[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation



I'm surprised that nobody in the threads about
constant-time access to Unicode strings has 
mentioned adaptive encoding forms.

My plan (and stalled code) works that way.  If a
string contains only codepoints in 0..255, store it as bytes.
0..ffff, use 16-bits, otherwise, use 32.

All access to a given codepoint position is O(1) that way.
Some mutations and are worst-case linear in
the length of the string but can be expected case O(1).

Some strings need to be converted before being passed to
functions provided by the native environment.  On 
the other hand, linguistic text is likely to be space
efficient.

This technique internally uses non-standard and 
restricted encoding forms.  Surrogates are never
used in this representation as a stand-in for a 
wider character and so there is no difficulty
handling unpaired surrogates.  (Even concatenating
one string ending in a high surrogate with one
beginning with a low surrogate produces the desirable
result: a string with two adjacent unpaired surrogates.)

-t