This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
I'm surprised that nobody in the threads about constant-time access to Unicode strings has mentioned adaptive encoding forms. My plan (and stalled code) works that way. If a string contains only codepoints in 0..255, store it as bytes. 0..ffff, use 16-bits, otherwise, use 32. All access to a given codepoint position is O(1) that way. Some mutations and are worst-case linear in the length of the string but can be expected case O(1). Some strings need to be converted before being passed to functions provided by the native environment. On the other hand, linguistic text is likely to be space efficient. This technique internally uses non-standard and restricted encoding forms. Surrogates are never used in this representation as a stand-in for a wider character and so there is no difficulty handling unpaired surrogates. (Even concatenating one string ending in a high surrogate with one beginning with a low surrogate produces the desirable result: a string with two adjacent unpaired surrogates.) -t