[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

constant-time access to variable-width encodings

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: srfi-75@xxxxxxxxxxxxxxxxx
Subject: constant-time access to variable-width encodings
From: Per Bothner <per@xxxxxxxxxxx>
Date: Wed, 13 Jul 2005 11:12:57 -0700
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
User-agent: Mozilla Thunderbird 1.0.2-6 (X11/20050513)

Here is an idea for an implementation strategy for using UTF-8 or UTF-16encoding of strings without breaking constant-time string-ref.Obviously R6RS should not require or assume this implementation, but itwould be nice if it could be written to *allow* it.


Assume a special character:
#\partial - an incomplete character.

We can define it as U+D800 (start of the high surrogates area), sincethat is never a valid Unicode character.

The goal is to allow implementations to use plain 8-bit UTF-8 or 16-bitUTF-16 string encodings, while still allowing constant-time string-ref.Both of these encoding have the nice property that it is trivial todetect whether a stored (8-bit or 16-bit) character is a completecharacter or whether it is part of a multibyte encoding or surrogate pair.

The proposal is to allow string-ref to return #\partial for some indexesrepresenting non-initial bytes or low-surrogate values. Assume a stringusing UTF-8:

"Rød" (Norwegian for "red") - i.e. {#\R, #\xF8, #\d}.
The UTF-8 representation is: {#x52, #xc3, #xb8, #x64 }.
(string-ref "Rød" 0) => #\R
(string-ref "Rød" 1) => #\ø
(string-ref "Rød" 2) => #\partial
(string-ref "Rød" 3) => #\d
(string-length "Rød") => 4 ;; Not 3!

I.e. the complete character value is returned for the index of its firstbyte/half, and #\partial is returned for subsequence indexes.

The character #\partial is generally ignored. Specifically, it isignored when printing or by string-set! or the (string char ...)function. The character routines also generally "ignore" it:

(char-upcase #\partial) => #\partial
...
(char-alphabetic? #\partial) => #f
...

The string-length function returns the "allocated" length, which is thesame as the number of character *including* any #\partial characters.Thus existing code generally needs no change. There is seldom a need totest explicitly for #\partial - it is treated like a zero-width"filler", and user code can treat it as such. That only difference froma normal (zero-width) character is that it is never explicitly stored ina string. But that's an application detail.

This brings us to string-set! and other side-effecting stringprocedures. The obvious problem is: what happens if you replace a1-byte character with a multibyte character or vice versa? In that caseyou may have to widen or narrow the string. That may seem expensive,but in practice is unlikely to be an issue. Random access of strings isnot something people generally do. Most of the time people copy astring or fill it in left-to-right, which means that "replacing" anexisting character isn't a issue. However, it does mean that a stringmay need a variable-size buffer. But that is needed anyway.

Note that mutable fixed-width strings really make no sense: most stringsare immutable, once constructed. If you do need to mutate a string, afixed-length string is useless. A fixed-size mutable buffer only makessense because it is easy to implement, not because it is useful.

So let's make (mutable) strings variable-length. The implementation istrivial: Each string object contains a pointer to a u8 or u16 buffer,plus a current length, plus a buffer size (which might be stored withthe buffer).

(Shared substrings are a possibility in this model, but I won't discussthem further.)


The preferred way to construct a string is now this function:
(string-append! string char-or-string ...)
  Append (in place) each char-or-string to the end of the string.
  If an argument is the #\partial character it is ignored.

This is a cheap constant-time (on average) operation. But note thatappending a character may change (string-length string) by animplementation-defined amount: If the character requires multiplebuffer (u8 or u16) positions, it may increase the string-length by morethan 1, and if it is #\partial it doesn't change the length. However,appending a string always causes string-length to increase by thestring-length of the added strings.


It is also reasonable to provide:
(string-replace! string start end replacement-string)
  Replace (in place) (substring start end) by replacement-string.

Now we can implement string-set! in terms of string-replace!:
(define (string-set! string k char)
  (let ((end (start-of-next-char string k)))
     (string-replace! string k end (make-string 1 char))))

where (start-of-next-char string k) is the index of the next real(non-#\partial) character whose index is > k, or (string-length string)if there is no such character.

Note that (substring string start end) is undefined if (string-refstring start) *or* (string-ref string end) is #\partial.

Note that (make-string k char) creates k copies of char, so theresulting string-length may be different from k. If char if #\partialthen the resulting string-length is 0.

This may seem a radical proposal, but it actually doesn't change/breakmany R5RS idioms/code.

--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/

Follow-Ups:
- Re: constant-time access to variable-width encodings
  - From: Ray Blaak
- Re: constant-time access to variable-width encodings
  - From: Shiro Kawai

Prev by Date: Re: case mappings
Next by Date: Re: constant-time access to variable-width encodings
Previous by thread: Re: encoding strings in memory
Next by thread: Re: constant-time access to variable-width encodings
Index(es):
- Date
- Thread