[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: revised w/nocase text, considering titlecase and cased

This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.

On Sat, May 10, 2014 at 6:59 AM, John Cowan <cowan@xxxxxxxxxxxxxxxx> wrote:
Alex Shinn scripsit:

> Note that both (w/nocase upper) and (w/nocase lower)
> are effectively ways to access the Unicode "cased"
> property (L&),

I don't think so, no, unless I misunderstand how "w/nocase" works.
Having case is not synonymous with being part of a casing pair: there
are lower case letters like ẗ (t with diaeresis) that have no upper
case equivalents,
So (w/nocase upper) would not include ẗ since there's
no uppercase equivalent to map from.  Technically full
case folding is allowed so that T followed by U+0308
COMBINING DIAERESIS would match ẗ, but I don't
think we can include conditionally included characters
in the expanded char set.

On the other hand, it just seems surprising enough that
w/nocase on one case isn't equivalent to the union of
all cases.  And that (w/nocase upper) != (w/nocase lower).
Enough so that it might be worth making an exception
for this.

It seems in Perl the /i modifier doesn't affect Unicode
properties at all, but does affect ASCII:

$ perl -e 'print "ok\n" if "t" =~ /\p{Lu}/i'
$ perl -e 'print "ok\n" if "t" =~ /[[:upper:]]/i'
$ perl -Mlocale -e 'print "ok\n" if "t" =~ /[[:upper:]]/i'

This is inconsistent, so not worth following.  Also for
the sake of efficiency it's probably better to allow
mapping cased upper and lower to an existing char-
set.  Thus when translating to an existing posix-style
regexp syntax, (w/nocase upper) becomes "\\p{L&}"
instead of a ridiculously long explicit char set.

 and the mathematical letters at U+1D400 et seqq. have
case but don't form casing pairs.

... What were they smoking?