[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

constant-time access to variable-width encodings



Here is an idea for an implementation strategy for using UTF-8 or UTF-16 encoding of strings without breaking constant-time string-ref. Obviously R6RS should not require or assume this implementation, but it would be nice if it could be written to *allow* it.

Assume a special character:
#\partial - an incomplete character.
We can define it as U+D800 (start of the high surrogates area), since that is never a valid Unicode character.

The goal is to allow implementations to use plain 8-bit UTF-8 or 16-bit UTF-16 string encodings, while still allowing constant-time string-ref. Both of these encoding have the nice property that it is trivial to detect whether a stored (8-bit or 16-bit) character is a complete character or whether it is part of a multibyte encoding or surrogate pair.

The proposal is to allow string-ref to return #\partial for some indexes representing non-initial bytes or low-surrogate values. Assume a string using UTF-8:
"Rød" (Norwegian for "red") - i.e. {#\R, #\xF8, #\d}.
The UTF-8 representation is: {#x52, #xc3, #xb8, #x64 }.
(string-ref "Rød" 0) => #\R
(string-ref "Rød" 1) => #\ø
(string-ref "Rød" 2) => #\partial
(string-ref "Rød" 3) => #\d
(string-length "Rød") => 4 ;; Not 3!

I.e. the complete character value is returned for the index of its first byte/half, and #\partial is returned for subsequence indexes.

The character #\partial is generally ignored. Specifically, it is ignored when printing or by string-set! or the (string char ...) function. The character routines also generally "ignore" it:
(char-upcase #\partial) => #\partial
...
(char-alphabetic? #\partial) => #f
...

The string-length function returns the "allocated" length, which is the same as the number of character *including* any #\partial characters. Thus existing code generally needs no change. There is seldom a need to test explicitly for #\partial - it is treated like a zero-width "filler", and user code can treat it as such. That only difference from a normal (zero-width) character is that it is never explicitly stored in a string. But that's an application detail.

This brings us to string-set! and other side-effecting string procedures. The obvious problem is: what happens if you replace a 1-byte character with a multibyte character or vice versa? In that case you may have to widen or narrow the string. That may seem expensive, but in practice is unlikely to be an issue. Random access of strings is not something people generally do. Most of the time people copy a string or fill it in left-to-right, which means that "replacing" an existing character isn't a issue. However, it does mean that a string may need a variable-size buffer. But that is needed anyway.

Note that mutable fixed-width strings really make no sense: most strings are immutable, once constructed. If you do need to mutate a string, a fixed-length string is useless. A fixed-size mutable buffer only makes sense because it is easy to implement, not because it is useful.

So let's make (mutable) strings variable-length. The implementation is trivial: Each string object contains a pointer to a u8 or u16 buffer, plus a current length, plus a buffer size (which might be stored with the buffer).

(Shared substrings are a possibility in this model, but I won't discuss them further.)

The preferred way to construct a string is now this function:
(string-append! string char-or-string ...)
  Append (in place) each char-or-string to the end of the string.
  If an argument is the #\partial character it is ignored.

This is a cheap constant-time (on average) operation. But note that appending a character may change (string-length string) by an implementation-defined amount: If the character requires multiple buffer (u8 or u16) positions, it may increase the string-length by more than 1, and if it is #\partial it doesn't change the length. However, appending a string always causes string-length to increase by the string-length of the added strings.

It is also reasonable to provide:
(string-replace! string start end replacement-string)
  Replace (in place) (substring start end) by replacement-string.

Now we can implement string-set! in terms of string-replace!:
(define (string-set! string k char)
  (let ((end (start-of-next-char string k)))
     (string-replace! string k end (make-string 1 char))))
where (start-of-next-char string k) is the index of the next real (non-#\partial) character whose index is > k, or (string-length string) if there is no such character.

Note that (substring string start end) is undefined if (string-ref string start) *or* (string-ref string end) is #\partial.

Note that (make-string k char) creates k copies of char, so the resulting string-length may be different from k. If char if #\partial then the resulting string-length is 0.

This may seem a radical proposal, but it actually doesn't change/break many R5RS idioms/code.
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/