This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
On Mon, 22 Dec 2003, Thomas Bushnell, BSG wrote: >Matthew Flatt <mflatt@xxxxxxxxxxx> writes: > >> * For Scheme characters, pick a specific encoding, probably one of >> UTF-16, UTF-32, UCS-2, or UCS-4 (but I don't know which is the right >> choice). > >Wrong. A Scheme character should be a codepoint. The representation >of code points as sequences of bytes should be under the hood. I'm using a homebrewed scheme system where the character set is infinite. Char->integer may return a bignum. Each character is a unicode codepoint plus a non-defective sequence of unicode combining codepoints. The unicode documentation refers to these entities as "graphemes." MITscheme uses an 13-bit character set; 8 bit ascii plus 5 buckybits. They have characters running around in their set that have nothing to do with unicode. I figure I'm going to wind up doing translation no matter what, because C just isn't capable of hiding the differences between character sizes correctly. But I'm not going to give up grapheme-characters, because I strongly feel that they are the "Right Thing." And at some point I may add buckybits just for the hell of it. My point is that it does no good to assume anything about a scheme's internal representation of characters. Some schemes are going to deal with an infinite character set, not limited at all to unicode codepoints. So maybe you should pay some attention to cases where there's no corresponding character in Unicode (MITscheme character "super-meta-J") or where the unicode correspondence to a scheme character is multiple codepoints (grapheme-character "Latin Capital Letter A/Ring Above/Accent Grave"). Bear