This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 contain all messages, not just those from before July 7th, 2015.
Per Bothner <email@example.com> writes: >>> What does char->integer return? How does char<? work? What is your >>> proposed implementation for a "character" in the Unicode world, given >>> that it is not a code-point? How would you store characters in a >>> string? >> Storage is irrelevant. An implementation would be free to store >> characters however it wished. char->integer and char<? can return >> whatever the implementation pleases. I would rather drop them, since >> they have nothing really to do with characters. They are functions on >> *code points*, which are there because the R5RS authors did not bother >> to distinguish code points from characters. > > I'm asking how *you* would implement a "character" data type. > Assume you have 32-bit "scheme values". Would you make characters > immediate/unboxed values? In that case, assume you have 28 bits. > Or are characters pointers to objects in memory? If so, how are > they managed? Are equal characters eq? Suppose I have a UTF-8 > input file. What does read-char do? What is a string - an array > of 32-bit Scheme values or could it be more compact? I would probably have two different sorts of characters, just as most scheme systems have two different kinds of integers. Most characters can be encoded unboxed as single unicode codepoints. Some, which require more than one code point, would either need to be larger unboxed values (if the system permits), or boxed objects. I suspect it would be efficient to attempt a uniquization of the boxed objects when characters are being used as isolated values (though I'm not certain of this). Strings could easily be arrays of Unicode code points, though I'm not certain that this is the best option, because it would impede random access to characters. (On the other hand, since you would like to call the code points "characters", you also would not be able to have random access to Unicode's abstract characters.) I would have no objection to strings having two interfaces, one that operates on the characters and one that operates on the code points, though I'm hesitant about standardizing that. As for reading a file in UTF-8, that's like reading a file in any encoding. The process of taking a sequence of bytes and mapping them to a sequence of characters requires a mapping function. A splufty system would need to be able to read UTF-8, ascii, ISO Latin 1, ISO Latin 2, etc. There is no encoding-generic implementation of read-char, you need to know the encoding of the input stream to implement it correctly. As suggested above, since we agree that a string is *implemented* as a sequence of code points, perhaps in UTF-8, we can both implement Unicode strings the same way. Thomas