This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
bear scripsit: > [...] specifying only the Right Thing [...] Unicode is not about the Right Thing; it's about doing the best thing possible in the circumstances. > This problem is much less severe if your characters are grapheme > clusters. But thanks to Ligatures (which cannot appear in canonical > strings) and eszett (which, unfortunately, can) it is not completely > eliminated by moving to grapheme clusters. There are four kinds of normalization, only two of which (and the two least commonly used) remove ligatures. Canonical normalizations do not remove ligatures, with the sole exception of U+FB1F, HEBREW LIGATURE YIDDISH YOD YOD PATAH. Compatibility normalizations do remove most ligatures (there are some characters called LIGATURE for historical reasons which do not function as ligatures). > Right; substrings that aren't valid strings, or which combine into > something that isn't the original string, can result when you split > grapheme clusters; This happens when you take substrings on arbitrary > codepoint boundaries, or do buffered operations on arbitrary codepoint > boundaries, or any of a number of other things. These things turn out not to be the case. They are true if you split strings on arbitrary *octet* or *code unit* boundaries, but if you stick to *codepoint* boundaries, they are not true. Any sequence of codepoints is a valid string, and no amount of taking apart and putting back together can change the validity or the interpretation of the string. > But these are > problems that go away if your characters are grapheme clusters. The description of grapheme clusters in Unicode makes it clear that they are neither correct nor complete in all circumstances, just yet another global definition that provides a fairly good approximation. > I don't believe in this. If you're going to limit it to ASCII, > then 'ascii' ought to be in its name. I agree. > The thing is, if underspecified these operations will be > nearly useless. Portable code will be unable to rely on > them doing any particular thing. I agree with that too. > It's my opinion that the only way to make normalization transparent to > the programmer and user is to use grapheme-cluster characters instead > of codepoint characters. Normalization consists in altering codepoint > sequences within grapheme clusters only; if this is your character > unit, then it can be done without disrupting character indexes or > counts, saving everyone a lot of headaches. You do realize that there is a countable infinity of different grapheme clusters? > One thing about normalization: Ligatures do not exist in normalized > text, because they have canonical decompositions. Not so; see above. > As for conversion between different normalized forms, I think that the > unicode normalization form is properly a property of the port through > which data is read or written. The port reads codepoints in some > normalization form, and delivers _characters_ represented according to > the abstraction you use internally. Likewise, it accepts abstract > characters and writes codepoints in some normalization form. That's an interesting idea, but IMHO too radical at present. > This introduces a distinction between text ports (which read and write > characters, full-stop) and binary ports (which read and write octets). > If you want to read or write characters on a binary port, you *SHOULD* > have to state explicitly what encoding to use. Indeed. That, however, has to do with encodings, not normalization forms. -- John Cowan http://www.ccil.org/~cowan <jcowan@xxxxxxxxxxxxxxxxx> "Any legal document draws most of its meaning from context. A telegram that says 'SELL HUNDRED THOUSAND SHARES IBM SHORT' (only 190 bits in 5-bit Baudot code plus appropriate headers) is as good a legal document as any, even sans digital signature." --me