This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.
On Thu, 12 Feb 2004, Ken Dickey wrote: > I assume that it is useful to distinguish the two goals of > extending programming language identifiers > and processing Unicode data. For temporary solutions and bandaids, yes. But scheme is a lisp, and our code is data and our data is code. Our identifier-naming rules, ultimately, *can* affect our program behavior, where with C and similar languages, it cannot. Every implementation that deals with Unicode at all seriously is going to have to create rules for distinguishing Unicode identifiers, and to the extent that they adopt *different* rules, there will be enduring and sometimes very subtle portability problems, and bugs where code works slightly differently on one system than it does on another > So w.r.t. identifiers, why is normalization needed at all? To my > mind, normalization is a library procedure (set of procedures) for > dealing with Unicode data/codepoints. Normalization is needed because unicode provides many different ways to represent what is intended to be _EXACTLY_ the same string. When you see a string containing, say, a lowercase 'a' with an accent grave, you don't know whether that's one codepoint (what unicode calls a 'precombined character') or two codepoints (what unicode calls a 'base character' plus a 'combining mark'). Furthermore, unicode editors are not required to distinguish these forms, and may convert one to the other, and back, arbitrarily and without your knowledge. In fact, for most operations where it might matter, they are required to *not* distinguish such forms, so that if you 'text search' for one you should also find all instances of the other. Editors and compilers typically convert all unicode text to a preferred normalized encoding instantly when it is read. There are four sanctioned choices; NFC, NFD, NFKC, and NFKD. The first two and the last two may be converted to and from each other without loss; they are truly equivalent. However, NFC/NFD allow much finer-grained distinctions between similar strings than NFKC/NFKD. Identifiers that are distinct in the first two forms may become identical in the latter two forms, and once converted to one of the latter two forms do not contain sufficient information to convert them back. > Defining valid identifier syntax such that case folding of > (unnormalized) identifier literals should be sufficient. > What am I missing? You're missing all the tools and utilities out there that are programmed with the expectation and requirement that they can arbitrarily impose or change normalization forms without changing the text of the documents they handle. There is no escaping this; even Emacs and Notepad do it. > Another note. Characters are currently dealt with in a fairly abstract > manner. It would seem that in dealing with Unicode data as binary data > (codepoints), R6RS/SRFI/... must define a binary IO API. I think that's true. While programmers should never have to care about the internal representation of characters, they have to care about writing them in a format acceptable to other systems and reading different forms of them that are written by other systems. Bear