This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.
Mark H Weaver scripsit: > Normalization is an important part of the solution, but it alone does > not solve the problem where no precomposed character exists. Figure 5 > of TR15 gives some examples where NFC produces more than one codepoint > per character. Ah, I understand now. The trouble is that normalization of a char-set pattern causes it to mean something completely different. Thus ("á") (i.e. ("\xE1;") matches the character \#xE1;, whereas ("á") (i.e. ("a\x301;")) although canonically equivalent to it, matches the disjunction of #\x61; and #\x301;. They will never match the same thing, which is counterintuitive. Unfortunately, I don't see what can be done about this other than to issue stern warnings in the documentation. Alex, do you think you can make a w/norm or norm-char-set SRE pattern work? It would mean transforming a charset pattern containing "a\x301;" to one that contains "\xE1;", and also transforming a pattern containing "f\x301;" (which has no precomposed form) into (seq #\f #\x301;). In the general case it would produce an alternation of sequences, and would have to normalize the part of the text being matched as well (unless it comes in two flavors, one for NFD and the other for NFC). -- John Cowan http://www.ccil.org/~cowan cowan@xxxxxxxx We pledge allegiance to the penguin and to the intellectual property regime for which he stands, one world under Linux, with free music and open source software for all. --Julian Dibbell on Brazil, edited