[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: encoding strings in memory



bburger@xxxxxxxxxxx wrote:
1. Strings will almost certainly have to be represented as arrays of 32-bit entities, since string-set! allows one to whack any character. This representation wastes memory, since the overwhelmingly common case is to use characters only from the Basic Multilingual Plane (0x0000 to 0xFFFF). For applications we write, the majority of characters are ASCII, even though our software is used around the world. Consequently, we use UTF-8 for storing strings, even though we run on Microsoft Windows (UTF-16-LE).

We have teh same problem in the Java world. Native strings and characters are 16-bit Unicode. This would fine 99% of the time. However, use of character above 0xFFFF requires using surrogate pairs.

The problem is string-ref and string-set!. Existing Java-String-based encodings have string-ref return *half* of a surrogate pair. This is no problem for most applications, if you just want to print or copy strings. It's not really a problem for intelligent code that deals with composed characters which needs to work with variable-length strings anyway. It is a problem for intermediate code that does something with each individual character.

Note that even these applications don't actually need a linear mapping from indexes to characters. I.e. arithmetic on indexes in a string is never (well, hardly ever) useful or meaningful. All we need is a "position" magic cookie, similar to stdio's fpos_t.

One solution is to have multiple "modes". A string may start out in 8-bit mode, and switch to 16-bit code when a 16-bit character is inserted, and then switch to 32-bit mode when a still larger character is inserted. This means the entire string has to be copied when a single character is inserted, but the amortized cost per character is constant. It also means that we need 32- bits per character for the entire string, even if there is only a single character > 0xFFFF.

2. Changing strings to use 32-bit characters will make foreign function interfaces difficult, since the major platforms use UTF-16-LE and UTF-8. It will also break all existing foreign-function code that relies on strings being 8-bit bytes.

The "mode-switching" solution doesn't solve that problem - it makes it worse.

It seems to me that keeping char 8-bit and string as an array of 8-bit bytes would be the least disruptive change.

But what does char-ref return?

I have an idea; see next message.
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/