[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encodings.

This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.

On Thu, Feb 12, 2004 at 07:41:07AM +0100, Ken Dickey wrote:
> Back to dumb questions.
> I assume that it is useful to distinguish the two goals of
> 	extending programming language identifiers
> and	processing Unicode data.
> For identifiers, either we have EQ? preserving literals, or
> "literalization of bits" (I.e. string preservation).
> So w.r.t. identifiers, why is normalization needed at all? To my mind,
> normalization is a library procedure (set of procedures) for dealing
> with Unicode data/codepoints.

Normalization is a way to eliminate "trivial" differences between
strings. There's often several ways to encode exactly the same character
(grapheme), and normalization is a procedure for folding all of the
variants down to a single, canonical encoding.

If you're doing a simple test for exact string equality (string=, for
example, but not string-ci=) then normalization is both necessary and
sufficient to prepare for it. It's necessary, because without it,
trivial differences will result in false negatives. It's also sufficient
for a simple grapheme-by-grapheme (or binary) comparison.

> Defining valid identifier syntax such that case folding of
> (unnormalized) identifier literals should be sufficient.
> What am I missing?

If you're already folding case or otherwise saying "these characters are
equivalent" (i.e., using string collation for equality testing), then I
suppose you don't *need* to normalize. I think it does simplify
processing a bit, because you deal with all the encoding quirks first,
then you deal with the language quirks.

Or to put it another way, case folding is just a specific kind of
normalization, that removes the "trivial encoding differences" between
variants of the letter A.

By the way, regarding the issue I brought up about Latin B vs Greek B:
After posting, I realized that might be better to handle that with
collation rules instead of normalization (folding). Then again, I
suppose that it doesn't make much of a difference. The two operations
are equivalent with regard to equality testing (although they do have
different side effects).

Hm, I wonder whether the UC expects people with special collation needs
to use NFC for normalization, followed by a domain-specific folding or
collation step? That's kind of weird, though, because the second step
will often include some (but not all) of the compatibility
Bradd W. Szonye