[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: revised w/nocase text, considering titlecase and cased

On Sat, May 10, 2014 at 6:59 AM, John Cowan <cowan@xxxxxxxxxxxxxxxx> wrote:
Alex Shinn scripsit:

> Note that both (w/nocase upper) and (w/nocase lower)
> are effectively ways to access the Unicode "cased"
> property (L&),

I don't think so, no, unless I misunderstand how "w/nocase" works.
Having case is not synonymous with being part of a casing pair: there
are lower case letters like ẗ (t with diaeresis) that have no upper
case equivalents,
So (w/nocase upper) would not include ẗ since there's
no uppercase equivalent to map from.  Technically full
case folding is allowed so that T followed by U+0308
COMBINING DIAERESIS would match ẗ, but I don't
think we can include conditionally included characters
in the expanded char set.

On the other hand, it just seems surprising enough that
w/nocase on one case isn't equivalent to the union of
all cases.  And that (w/nocase upper) != (w/nocase lower).
Enough so that it might be worth making an exception
for this.

It seems in Perl the /i modifier doesn't affect Unicode
properties at all, but does affect ASCII:

$ perl -e 'print "ok\n" if "t" =~ /\p{Lu}/i'
$ perl -e 'print "ok\n" if "t" =~ /[[:upper:]]/i'
$ perl -Mlocale -e 'print "ok\n" if "t" =~ /[[:upper:]]/i'

This is inconsistent, so not worth following.  Also for
the sake of efficiency it's probably better to allow
mapping cased upper and lower to an existing char-
set.  Thus when translating to an existing posix-style
regexp syntax, (w/nocase upper) becomes "\\p{L&}"
instead of a ridiculously long explicit char set.

 and the mathematical letters at U+1D400 et seqq. have
case but don't form casing pairs.

... What were they smoking?