[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: character strings versus byte strings

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.



Matthew Flatt wrote:

 * Where "char *" is used for strings (e.g., "expected_explanation" for
   a type error), define it to be an ASCII or Latin-1 encoding (I
   prefer the latter).

No, it should be UTF-8.

 * For Scheme characters, pick a specific encoding, probably one of
   UTF-16, UTF-32, UCS-2, or UCS-4 (but I don't know which is the right
   choice).

Standardizing a specific encoding either forces Scheme implementations
to standardize encodings internally or force force expensive conversions.

[Slightly off-topic - I doubt anybody will follow my recommendation.]

But if you're going to pick an encoding, I think UTF-8 is "right" -
except for old APIs.  (You can't do random access from a character
number, but there is never any actual need for that.  You need
sequential access plus random access to previouly seen characters,
which byte offsets give you.)

A preceived problem with using UTF-8 is that you can't replace
a 1-byte character by a 3-byte character.  But that is just a
symptom of another problem:  a fixed-size mutable "string" is
a useless data structure, only useful for implementing higher
level data structures.

So if I was designing a Scheme dialect for internationalization,
I'd do away with mutable strings.  You'd have uniform byte arrays
(for implementation) and "texts".  The latter are implemented
using a byte buffer with a gap (as in an Emacs buffer).  Constant
strings are a special case of texts.

For compatibility with old Scheme code that uses character indexes,
a "string" would be a text with a 1-element index cache to map a
character index to a buffer index.
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/