[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: revised w/nocase text, considering titlecase and cased



On Mon, May 12, 2014 at 2:54 PM, John Cowan <cowan@xxxxxxxxxxxxxxxx> wrote:
Alex Shinn scripsit:

>   As a special case, the pre-defined named character sets
>   upper and lower (and their aliases upper-case and lower-case)
>   are defined to match all characters with the cased property (L&).
>   Note also all other pre-defined named character sets are
>   equivalent to themselves under w/nocase.
>
>   Rationale: The differences between the case insensitive
>   lower and upper and the cased property are few and unlikely
>   to match user intention.  Moreover, unlike the algorithmically
>   mapped upper and lower char-sets, the cased property is
>   readily available in most Unicode implementations.

Looks good to me.

I think this language should also be added:

    Note that placing a sequence consisting of a base character
    and combining characters into a character string representing
    a character set will not do what the user probably expects;
    it will create a character set pattern containing the base
    character and the combining character(s) as alternatives.
    For the same reason, it is inadvisable to apply Unicode
    normalization to such strings.

The description would go after the language for the (<string>)
literal char-set sre, which currently says:

  The set of chars as formed by (string->char-set <string>).

How about just adding:

  Note that string->char-set works on code points,
  not grapheme clusters, so any combining characters in
  <string> will be treated separately from any preceding
  base characters.

The only remaining issue is do we want to expose the
cased and titlecase char-sets?

-- 
Alex