This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.
> At Tue, 10 Feb 2004 13:06:28 -0800 (PST), Tom Lord wrote: >> There is an easy example of why such a category is desirable in >> computing. Let's suppose that I'm going to specify the lexical >> syntax of identifiers in a programming language. As part of that >> specification, I'll need to identify this category. (For an example, >> see "Unicode Technical Report #31: Identifier and Pattern Syntax", >> http://www.unicode.org/reports/tr31/tr31-2.html) Alex Shinn wrote: > We may want to take that report with a grain of salt for Scheme. A > simpler approach would be to define Scheme identifiers as everything > _excluding_ the reserved punctuation characters, optionally allowing > Unicode variations on those characters and extending the definition of > whitespace. Most Schemes already work in this manner, despite the > fact that R5RS uses an inclusive list .... Agreed. It has the same basic flaw as Annex 7 of UTR 15: It isn't a syntax for programming-language identifiers, it's a syntax for C-family identifiers! Both reports blithely ignore the fact that not all languages restrict identifiers to letters, numbers, and underscores. Even COBOL permits dashes! The other thing I didn't care for (in UTR 15) was the recommendation to use NFC for case-sensitive languages and NFKC for case-insensitive languages. NFC is designed for round-trip conversions, and it often uses different encodings for visually indistinguishable symbols. For example, the letters "ffi" and the "ffi" ligature are distinct under NFC (IIRC). That's a very bad property for programming language identifiers. Unfortunately, NFKC isn't perfect either. One thing I especially dislike is that it flattens the differences between the mathematical alphabets. Here you have a case where graphemes *are* visually distinguishable, and for good reason, but the normalization form treats them as identical. If you're working on a sublanguage for symbolic mathematics, you might be tempted to write "double-struck small letter j" for a unit vector and "italic small letter j" for the imaginary unit. But NFKC folds them together. You'll need to modify NFKC for mathematics, or track the semantic data separately (which amounts to the same thing). It's especially bad if you're considering a language for typesetting mathematics! (Not that anybody would ever want to implement a TeX-like language in Scheme, right?) The Unicode character set is well-suited to that task, but the normalization forms aren't, IMO. Some of this is only tangentially relevant to Scheme, I realize. However, I don't think the identifier requirements were particularly well-thought-out. The standard normalization forms seem poorly suited for precision tasks like source code. "If it looks the same, it may or may not be the same thing" may work well enough for word processors, but it's not good for compilers. And there's still the annoying fact that these UTRs basically imply, "You can have identifiers for any language you want, as long as it's C!" -- Bradd W. Szonye http://www.szonye.com/bradd