This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.
On Thu, Feb 12, 2004 at 07:41:07AM +0100, Ken Dickey wrote: > Back to dumb questions. > > I assume that it is useful to distinguish the two goals of > extending programming language identifiers > and processing Unicode data. > > For identifiers, either we have EQ? preserving literals, or > "literalization of bits" (I.e. string preservation). > > So w.r.t. identifiers, why is normalization needed at all? To my mind, > normalization is a library procedure (set of procedures) for dealing > with Unicode data/codepoints. Normalization is a way to eliminate "trivial" differences between strings. There's often several ways to encode exactly the same character (grapheme), and normalization is a procedure for folding all of the variants down to a single, canonical encoding. If you're doing a simple test for exact string equality (string=, for example, but not string-ci=) then normalization is both necessary and sufficient to prepare for it. It's necessary, because without it, trivial differences will result in false negatives. It's also sufficient for a simple grapheme-by-grapheme (or binary) comparison. > Defining valid identifier syntax such that case folding of > (unnormalized) identifier literals should be sufficient. > > What am I missing? If you're already folding case or otherwise saying "these characters are equivalent" (i.e., using string collation for equality testing), then I suppose you don't *need* to normalize. I think it does simplify processing a bit, because you deal with all the encoding quirks first, then you deal with the language quirks. Or to put it another way, case folding is just a specific kind of normalization, that removes the "trivial encoding differences" between variants of the letter A. By the way, regarding the issue I brought up about Latin B vs Greek B: After posting, I realized that might be better to handle that with collation rules instead of normalization (folding). Then again, I suppose that it doesn't make much of a difference. The two operations are equivalent with regard to equality testing (although they do have different side effects). Hm, I wonder whether the UC expects people with special collation needs to use NFC for normalization, followed by a domain-specific folding or collation step? That's kind of weird, though, because the second step will often include some (but not all) of the compatibility transformations. -- Bradd W. Szonye http://www.szonye.com/bradd