[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

encoding strings in memory




I have two main concerns about the proposed change:

1. Strings will almost certainly have to be represented as arrays of 32-bit entities, since string-set! allows one to whack any character.  This representation wastes memory, since the overwhelmingly common case is to use characters only from the Basic Multilingual Plane (0x0000 to 0xFFFF).  For applications we write, the majority of characters are ASCII, even though our software is used around the world.  Consequently, we use UTF-8 for storing strings, even though we run on Microsoft Windows (UTF-16-LE).

2. Changing strings to use 32-bit characters will make foreign function interfaces difficult, since the major platforms use UTF-16-LE and UTF-8.  It will also break all existing foreign-function code that relies on strings being 8-bit bytes.

Chez Scheme uses 8-bit characters currently, and it works very nicely on operating systems that support UTF-8.  In particular, I've used it at home in Mac OS X and get most of the benefits of the proposed change, namely I can deal with Unicode strings in a consistent way.

This doesn't work as well in Microsoft Windows, because Chez Scheme uses the current ANSI encoding, which turns most Unicode characters into the question mark.  However, I have used it to store UTF-16-LE strings with success, and I've also written code that converts UTF-8 to UTF-16-LE for the foreign-function interface.

It seems to me that keeping char 8-bit and string as an array of 8-bit bytes would be the least disruptive change.  The implementations could specify that UTF-8 is used when communicating with the outside world, namely file and process operations and the foreign function interface.  This would be trivial to implement in UTF-8-friendly OSes and not difficult in Microsoft Windows.

Bob