[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: character strings versus byte strings

Matthew Flatt wrote:

 * Where "char *" is used for strings (e.g., "expected_explanation" for
   a type error), define it to be an ASCII or Latin-1 encoding (I
   prefer the latter).

No, it should be UTF-8.

 * For Scheme characters, pick a specific encoding, probably one of
   UTF-16, UTF-32, UCS-2, or UCS-4 (but I don't know which is the right

Standardizing a specific encoding either forces Scheme implementations
to standardize encodings internally or force force expensive conversions.

[Slightly off-topic - I doubt anybody will follow my recommendation.]

But if you're going to pick an encoding, I think UTF-8 is "right" -
except for old APIs.  (You can't do random access from a character
number, but there is never any actual need for that.  You need
sequential access plus random access to previouly seen characters,
which byte offsets give you.)

A preceived problem with using UTF-8 is that you can't replace
a 1-byte character by a 3-byte character.  But that is just a
symptom of another problem:  a fixed-size mutable "string" is
a useless data structure, only useful for implementing higher
level data structures.

So if I was designing a Scheme dialect for internationalization,
I'd do away with mutable strings.  You'd have uniform byte arrays
(for implementation) and "texts".  The latter are implemented
using a byte buffer with a gap (as in an Emacs buffer).  Constant
strings are a special case of texts.

For compatibility with old Scheme code that uses character indexes,
a "string" would be a text with a 1-element index cache to map a
character index to a buffer index.
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/