This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.
Ken Dickey wrote: > Let's say there are two scheme source files, each of which uses the > "same" identifier in the same global (module global) scope/context. > We say that in a RNRS Scheme the identifier names or denotes the same > value. > > Let's say the two files are stored in different encodings (say utf-8 > and ucs-2) and processed by different but conforming Unicode systems > (text editors, Scheme read/write, whatever) so that identifiers still > appear the same when displayed but are stored in different encodings. > > A Scheme implementation which properly reads the two files should end > up with the identifier occurrences denoted above represented by > symbols which are eq? (NB: _not_ eqv?) to each other. If not, I term > this "broken". That's the essence of the conformance requirement I quoted earlier. If a process claims to support Unicode, UTF-8, and UCS-2, then all of the many ways of encoding the same character or symbol *must* be recognized as canonically identical (EQ? in Scheme terms). Also, that's the whole reason that the "normalization forms" exist: They make it easier to compare text for canonical equivalence. They're very similar to the C function strxfrm, which changes a string to a form that is easier to compare and collate. (The major difference is that a normal form is still directly printable; the result of strxfrm is not.) A process *could* keep all text in its original format, with some characters in fully-composed (NFC) UTF-8 and others in fully-decomposed (NFD) UTF-32. Checking for canonical equivalence would involve some complicated function that interprets and normalizes characters on the fly (like using "strcoll" in C). But it's usually easier to transform all strings into some "normal" form -- make them all NFC UTF-32 first -- and use a simple binary comparison (like usign "strxfrm" then "strcmp" in C). Your earlier idea of using separate programs to convert and then process characters was on the right track. The only flaw (and it's a big flaw) was the idea of putting them into completely separate processes. > So if a glyph/character does not have a case variant, considering it > to be lower case makes no logical sense. I view this as an abuse of > terminology. Being outside of normal logic, I term this "bizarre" and > if pressed, probably "broken" as well. In that case, the German language is "broken." The ß character is lowercase, but it's generally impossible to use it in a round-trip "raise and lower case" process. There are many situations where you need to know what words mean in order to judge whether they're really the "same thing." Computers aren't very good at that, so programmers need to be careful about (for example) using homonyms as identifiers. If you have two identifiers named "resume," you can't expect the computer to tell them apart just because one of them means "restart" and the other means "curriculum vitae." The trouble with ß is that the computer must understand the meaning of words just to do "mechanical" transformations like changing case! Many programmers assume that you can change case without understanding, but ß is the classic example of why that is impossible. (It's not the only example. For example, what's the lowercase version of SMITH? In English, that depends on whether it's a surname or a job title.) I think this is getting off on a bit of a tangent from what you're talking about, so please excuse me for going off on a rant! Programmers -- at least English-speaking ones -- have historically over-simplified the issues related to lettercasing, and some of the problems have made their way into the Scheme standard. -- Bradd W. Szonye http://www.szonye.com/bradd