Re: revised w/nocase text, considering titlecase and cased

This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.

On Sat, May 10, 2014 at 6:59 AM, John Cowan <cowan@xxxxxxxxxxxxxxxx> wrote:

Alex Shinn scripsit:

> Note that both (w/nocase upper) and (w/nocase lower)
> are effectively ways to access the Unicode "cased"
> property (L&),

I don't think so, no, unless I misunderstand how "w/nocase" works.
Having case is not synonymous with being part of a casing pair: there
are lower case letters like ẗ (t with diaeresis) that have no upper
case equivalents,

So (w/nocase upper) would not include ẗ since there's

no uppercase equivalent to map from. Technically full

case folding is allowed so that T followed by U+0308

COMBINING DIAERESIS would match ẗ, but I don't

think we can include conditionally included characters

in the expanded char set.

On the other hand, it just seems surprising enough that

w/nocase on one case isn't equivalent to the union of

all cases. And that (w/nocase upper) != (w/nocase lower).

Enough so that it might be worth making an exception

for this.

It seems in Perl the /i modifier doesn't affect Unicode

properties at all, but does affect ASCII:

$ perl -e 'print "ok\n" if "t" =~ /\p{Lu}/i'

$ perl -e 'print "ok\n" if "t" =~ /[[:upper:]]/i'

$ perl -Mlocale -e 'print "ok\n" if "t" =~ /[[:upper:]]/i'

This is inconsistent, so not worth following. Also for

the sake of efficiency it's probably better to allow

mapping cased upper and lower to an existing char-

set. Thus when translating to an existing posix-style

regexp syntax, (w/nocase upper) becomes "\\p{L&}"

instead of a ridiculously long explicit char set.

and the mathematical letters at U+1D400 et seqq. have
case but don't form casing pairs.

... What were they smoking?

Alex