Mark H Weaver scripsit:
Ah, I understand now. The trouble is that normalization of a char-set
> Normalization is an important part of the solution, but it alone does
> not solve the problem where no precomposed character exists. Figure 5
> of TR15 gives some examples where NFC produces more than one codepoint
> per character.
pattern causes it to mean something completely different. Thus ("á")
(i.e. ("\xE1;") matches the character \#xE1;, whereas ("á")
(i.e. ("a\x301;")) although canonically equivalent to it, matches the
disjunction of #\x61; and #\x301;. They will never match the same thing,
which is counterintuitive. Unfortunately, I don't see what can be done
about this other than to issue stern warnings in the documentation.
Alex, do you think you can make a w/norm or norm-char-set SRE pattern
work? It would mean transforming a charset pattern containing "a\x301;"
to one that contains "\xE1;", and also transforming a pattern containing
"f\x301;" (which has no precomposed form) into (seq #\f #\x301;).
In the general case it would produce an alternation of sequences,
and would have to normalize the part of the text being matched as well
(unless it comes in two flavors, one for NFD and the other for NFC).