[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Should SRFI-115 character sets match extended grapheme clusters?



On Mon, May 12, 2014 at 6:39 AM, John Cowan <cowan@xxxxxxxxxxxxxxxx> wrote:
Mark H Weaver scripsit:

> Normalization is an important part of the solution, but it alone does
> not solve the problem where no precomposed character exists.  Figure 5
> of TR15 gives some examples where NFC produces more than one codepoint
> per character.

Ah, I understand now.  The trouble is that normalization of a char-set
pattern causes it to mean something completely different.  Thus ("á")
(i.e. ("\xE1;") matches the character \#xE1;, whereas ("á")
(i.e. ("a\x301;")) although canonically equivalent to it, matches the
disjunction of #\x61; and #\x301;.  They will never match the same thing,
which is counterintuitive.  Unfortunately, I don't see what can be done
about this other than to issue stern warnings in the documentation.

Alex, do you think you can make a w/norm or norm-char-set SRE pattern
work?  It would mean transforming a charset pattern containing "a\x301;"
to one that contains "\xE1;", and also transforming a pattern containing
"f\x301;" (which has no precomposed form) into (seq #\f #\x301;).
In the general case it would produce an alternation of sequences,
and would have to normalize the part of the text being matched as well
(unless it comes in two flavors, one for NFD and the other for NFC).

Normalization was in the early issues and dismissed because
of lack of implementation support and unclear costs in new
implementations.  I think good recommended practice for now
is to just normalize both inputs and patterns separately.

-- 
Alex