[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
encoding strings in memory
I have two main concerns about the proposed change:
1. Strings will almost certainly have to be represented
as arrays of 32-bit entities, since string-set! allows one to whack any
character. This representation wastes memory, since the overwhelmingly
common case is to use characters only from the Basic Multilingual Plane
(0x0000 to 0xFFFF). For applications we write, the majority of characters
are ASCII, even though our software is used around the world. Consequently,
we use UTF-8 for storing strings, even though we run on Microsoft Windows
2. Changing strings to use 32-bit characters will make
foreign function interfaces difficult, since the major platforms use UTF-16-LE
and UTF-8. It will also break all existing foreign-function code
that relies on strings being 8-bit bytes.
Chez Scheme uses 8-bit characters currently, and it works
very nicely on operating systems that support UTF-8. In particular,
I've used it at home in Mac OS X and get most of the benefits of the proposed
change, namely I can deal with Unicode strings in a consistent way.
This doesn't work as well in Microsoft Windows, because
Chez Scheme uses the current ANSI encoding, which turns most Unicode characters
into the question mark. However, I have used it to store UTF-16-LE
strings with success, and I've also written code that converts UTF-8 to
UTF-16-LE for the foreign-function interface.
It seems to me that keeping char 8-bit and string as an
array of 8-bit bytes would be the least disruptive change. The implementations
could specify that UTF-8 is used when communicating with the outside world,
namely file and process operations and the foreign function interface.
This would be trivial to implement in UTF-8-friendly OSes and not
difficult in Microsoft Windows.