[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: character strings versus byte strings
Matthew Flatt wrote:
* Where "char *" is used for strings (e.g., "expected_explanation" for
a type error), define it to be an ASCII or Latin-1 encoding (I
prefer the latter).
No, it should be UTF-8.
* For Scheme characters, pick a specific encoding, probably one of
UTF-16, UTF-32, UCS-2, or UCS-4 (but I don't know which is the right
Standardizing a specific encoding either forces Scheme implementations
to standardize encodings internally or force force expensive conversions.
[Slightly off-topic - I doubt anybody will follow my recommendation.]
But if you're going to pick an encoding, I think UTF-8 is "right" -
except for old APIs. (You can't do random access from a character
number, but there is never any actual need for that. You need
sequential access plus random access to previouly seen characters,
which byte offsets give you.)
A preceived problem with using UTF-8 is that you can't replace
a 1-byte character by a 3-byte character. But that is just a
symptom of another problem: a fixed-size mutable "string" is
a useless data structure, only useful for implementing higher
level data structures.
So if I was designing a Scheme dialect for internationalization,
I'd do away with mutable strings. You'd have uniform byte arrays
(for implementation) and "texts". The latter are implemented
using a byte buffer with a gap (as in an Emacs buffer). Constant
strings are a special case of texts.
For compatibility with old Scheme code that uses character indexes,
a "string" would be a text with a 1-element index cache to map a
character index to a buffer index.