[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: encoding strings in memory
bburger@xxxxxxxxxxx wrote:
1. Strings will almost certainly have to be represented as arrays of
32-bit entities, since string-set! allows one to whack any character.
This representation wastes memory, since the overwhelmingly common case
is to use characters only from the Basic Multilingual Plane (0x0000 to
0xFFFF). For applications we write, the majority of characters are
ASCII, even though our software is used around the world. Consequently,
we use UTF-8 for storing strings, even though we run on Microsoft
Windows (UTF-16-LE).
We have teh same problem in the Java world. Native strings and
characters are 16-bit Unicode. This would fine 99% of the time.
However, use of character above 0xFFFF requires using surrogate pairs.
The problem is string-ref and string-set!. Existing Java-String-based
encodings have string-ref return *half* of a surrogate pair. This is no
problem for most applications, if you just want to print or copy
strings. It's not really a problem for intelligent code that deals with
composed characters which needs to work with variable-length strings
anyway. It is a problem for intermediate code that does something with
each individual character.
Note that even these applications don't actually need a linear mapping
from indexes to characters. I.e. arithmetic on indexes in a string is
never (well, hardly ever) useful or meaningful. All we need is a
"position" magic cookie, similar to stdio's fpos_t.
One solution is to have multiple "modes". A string may start out in
8-bit mode, and switch to 16-bit code when a 16-bit character is
inserted, and then switch to 32-bit mode when a still larger character
is inserted. This means the entire string has to be copied when a
single character is inserted, but the amortized cost per character is
constant. It also means that we need 32- bits per character for the
entire string, even if there is only a single character > 0xFFFF.
2. Changing strings to use 32-bit characters will make foreign function
interfaces difficult, since the major platforms use UTF-16-LE and UTF-8.
It will also break all existing foreign-function code that relies on
strings being 8-bit bytes.
The "mode-switching" solution doesn't solve that problem - it makes it
worse.
It seems to me that keeping char 8-bit and string as an array of 8-bit
bytes would be the least disruptive change.
But what does char-ref return?
I have an idea; see next message.
--
--Per Bothner
per@xxxxxxxxxxx http://per.bothner.com/