This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.
On Sat, 14 Feb 2004, Ken Dickey wrote: > A Scheme implementation which properly reads the two files should > end up with the identifier occurrences [stored in different > encodings] denoted above represented by symbols which are eq? (NB: > _not_ eqv?) to each other. If not, I term this "broken". Yup. Agreed. Conforming unicode systems read different encodings (sequences of bytes) or canonicalizations (sequences of codepoints) and recognize them as being the _same_ string (sequence of abstract characters). This is the strongest single reason why I decided for my own implementation that The Right Thing was to draw boundaries for character operations at the character level rather than the codepoint level. Unicode forces the abstraction of "character" to a level higher than representation or encoding, but each file is still a proper sequence of characters, and each identifier that must be equated is the same sequence of characters, even if not the same sequence of codepoints. So if in an NFD file, I get a sequence of codepoints that goes R, e, combining accent grave, s, u, m, e, combining accent egu, and in an NFC file I read a sequence of codepoints that's R, e-with-grave, s, u, m, e-with-egu, then as a conforming implementation of unicode I *MUST* recognize that these are the same sequence of characters and treat them as the same sequence of characters. In the case of scheme, that means my compiler must understand that they are the same identifier. > > >[In the absence of reflection] one should be able to consistently replace all >occurrences of an identifier in the same scope without changing the meaning/ >behavior of a program. If not, I term the situation "broken". I'll say "right", but as you note above, there is always the possibility of reflection, since scheme has symbol->string, string->symbol, and eval. Programs that don't use them will not, generally, need to agree on a syntax for unicode identifiers other than a simple escape mechanism that allows them to be written in ascii. > > > There are many concepts which come in paired/binary parts: on/off, up/down, et > cetera, which have no meaning without both parts. > [...] > So if a glyph/character does not have a case variant, considering it to be > lower case makes no logical sense. I view this as an abuse of terminology. > Being outside of normal logic, I term this "bizarre" and if pressed, probably > "broken" as well. This happens in one case, (eszett) for a singular reason; the uppercase form of this *ONE* lowercase letter is *TWO* uppercase letters. There are many other instances in Unicode in which a character's lowercase and uppercase form must be represented by a different number of codepoints, and if you regard codepoints as characters these instances appear to have the same problem (isolated lowercase forms or isolated uppercase forms). > So in all this discussion of multiple canonical forms (another > misuse of terminology, IMHO) multiple normal forms, et cetera, I am > looking for a description of how to keep  and  from being > broken. The set of Unicode codepoints is not a character set that has these properties. The set of characters that can be represented by sequences of these codepoints is a character set that has these properties. > If satisfying the Unicode Standard means breaking , then I say > "Don't do that!". No. Satisfying Unicode means, precisely, *NOT* breaking . Regardless of the encoding of the file (sequences of bytes or codepoints) Unicode requires the system to recognize that these identifiers are in fact the same sequences of characters. The multiple "canonicalizations" that people are worried about (NFC/NFD versus NFKC/NFKD) can properly be regarded as two character sets. The NFC/NFD character set includes many distinctions smaller than the NFKC/NFKD character set can make, and there is a "standard" mapping between the two character sets in which there are many instances in which NFC/NFD characters are distinct, but mapped to the "same" NFKC/NFKD character. For example, counting mathematical forms, there are about a dozen unaccented lowercase latin letter A's in the NFC/NFD character set, varying mainly by font. All of these map to the same NFKC/NFKD character. Inappropriately converting a file in which the distinctions are important is a lot like converting a text processor document in which different fonts are important to plain ascii - it loses information. I think it is up to the implementor's discretion whether his scheme regards its "character set" as the NFKC/NFKD character set or the NFC/NFD character set. Both character sets are, technically, infinite, but the NFKC/NFKD character set is a proper subset of the NFC/NFD character set. Hope this helps, Bear