This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
Matthew Flatt wrote:
* Where "char *" is used for strings (e.g., "expected_explanation" for a type error), define it to be an ASCII or Latin-1 encoding (I prefer the latter).
No, it should be UTF-8.
* For Scheme characters, pick a specific encoding, probably one of UTF-16, UTF-32, UCS-2, or UCS-4 (but I don't know which is the right choice).
Standardizing a specific encoding either forces Scheme implementations to standardize encodings internally or force force expensive conversions. [Slightly off-topic - I doubt anybody will follow my recommendation.] But if you're going to pick an encoding, I think UTF-8 is "right" - except for old APIs. (You can't do random access from a character number, but there is never any actual need for that. You need sequential access plus random access to previouly seen characters, which byte offsets give you.) A preceived problem with using UTF-8 is that you can't replace a 1-byte character by a 3-byte character. But that is just a symptom of another problem: a fixed-size mutable "string" is a useless data structure, only useful for implementing higher level data structures. So if I was designing a Scheme dialect for internationalization, I'd do away with mutable strings. You'd have uniform byte arrays (for implementation) and "texts". The latter are implemented using a byte buffer with a gap (as in an Emacs buffer). Constant strings are a special case of texts. For compatibility with old Scheme code that uses character indexes, a "string" would be a text with a 1-element index cache to map a character index to a buffer index. -- --Per Bothner per@xxxxxxxxxxx http://per.bothner.com/