This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
On Tue, 12 Jul 2005, Michael Sperber wrote: > >bear <bear@xxxxxxxxx> writes: > >> Particularly, some characters, particularly accented characters, >> have uppercase and lowercase versions which are different numbers of >> codepoints. Thus, in the "codepoint equals character" model, one >> case is a character and the other case -- isn't. > >I don't quite understand what you're saying: the locale-independent >case mappings in UnicodeData.txt always map a single scalar value to a >single scalar value. Sure it doesn't always do what your locale >thinks (as you point out), but this case mapping doesn't require >"multi-codepoint characters." Okay, after performing a quick check, the ones that require multi-codepoint mappings simply don't have altercases specified in UnicodeData.txt. What I was thinking of were characters like U+FB01 small ligature fi, which has no corresponding single- codepoint uppercase. Finding the lowercase of such a character is not going to work - but okay, that's the sacrifice you make for the single codepoint/single character confusion. >> Sixth, is there any way for a scheme implementation to support >> characters and strings in addutional encodings different from >> unicode and not necessarily subsets of it, and remain compliant? > > I don't think so, at least not in the way you envision. I don't think > that's necessary or even a good idea, either. This SRFI effectively > hijacks the char and string datatypes and says that the abstractions > for accessing them deal in Unicode. Any representation that allows > you to do that---i.e. implement STRING-REF, CHAR->INTEGER, and > INTEGER->CHAR and so on in a way compatible with the SRFI is fine, > but I believe you're thinking about representations where that's not > the case. Hmmm. I'm still of the opinion that making the programming abstraction more closely match the end-user abstraction (ie, with glyph=character rather than codepoint=character) is just plain better, in many ways. It gives me the screaming willies that under Unicode, strings which to the eye look identical, can have different lengths, no codepoint at any particular index in common, and sort relative to each other such that there are an infinite number of unrelated strings that go between them. To me, it is the codepoint=character model that is introducing representation artifacts and the glyph=character model comes a lot closer to avoiding them. But we've been there, and I've talked about that, at length. People seem determined to do it this way, and people with other languages seem to be doing it mostly this way too. I'm convinced that requiring the "wrong" approach in a way that outlaws a better one is a wrong thing, but I'm realistic by now that nobody else is going to be convinced. Also, I'm not entirely happy about banning characters and character sets that aren't subsets of unicode. In the first place there are a lot of characters that aren't in Unicode and are likely never to be - ask a Chinese person to write his own address without using one and you'll begin to see the problem. And in the second place, traditionally the characters have been used to describe a lot of non-character entities - and while some of these come through in control codes, others, including the very useful keystroke-description codes from, eg, MITscheme, simply don't. Bear