[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Octet vs Char (Re: strings draft)

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.

On Mon, 26 Jan 2004, Ken Dickey wrote:

>Well color me dumb, but I don't see why getting O(1) is such a big deal.
>Let's say an implementation reads UTF-8 chars.  A uni-string could be
>represented as a tuple.  The required part is a byte-vector of characters
>that fit into 8 bits.  If characters have 16 bit representation, use a
>second, 16-bit vector and a map of which chars are the long ones.  If 16 bit,
>index into the 2nd array, else into the first.  The 2nd array could be
>shortened by using the byte-vector to hold indexes into the 16-bit array in
>the mapped/unused bytevector slots.  In the "worst case" (storage wise), we
>only need the 16-bit array.  The mapping could be a bit-vector or whatever
>makes sense in a particular implementation.
>It is true that there are more code paths in this scheme (sic), but each kind
>of representation "knows that" and this still makes refs & sets ~ O(1).
>What am I missing here?

O(1) reference or character setting comes at the expense of O(n) insertions,
deletions, and non-identical-sized replacements.

EG, if I change "the" to "a" at the beginning of a long string, and
I've represented it as a vector to get O(1) reference time, the rest of
the string has to be copied to move it two character spaces in memory.

This is no big deal, on the same order as a function call overhead, for
strings of 250 characters or less.  But it starts to be a very big deal
when the string is the size of a large novel, around  2 million

Likewise, when the application decides it wants to make a copy of a
string and change just the copy, the naive implementation using strings
made of byte vectors copies the whole thing to a new vector allocated
for a new variable, then copies it again within that vector to move the
bulk of the new string 2 spaces.  A more sophisticated implementation
still copies the whole thing at least once.

Implementing ropes allows the system instead to copy just the root node
of the string's Btree (constant-size unit) in order to copy a string,
and then copy tree nodes down to the substring where "the" will be
changed to "a", make a copy of just that substring, and shift the bulk
of just that copy 2 characters.  If the strings are large, this saves
you a bunch of CPU and a bunch of memory, since the bulk of the structure
- all the tree nodes except a single root-to-leaf path and all the
substrings but the one changed - is still shared between the two strings.

Below, I'm comparing ropes to the simplest type of vector string,
in which all characters are equal width.  You've offered a
technique to get similar performance with strings having unequal-
width characters (such as UTF-8 encoding) and I think it'd work,
but here are still compelling reasons in at least some applications
to choose ropes.

Here's how the operations break down (use a constant-width font):
                     "Rope" string     |  vector string
indexed reference;                     |
                        O(log n)       |  O(1)
indexed mutation                       |
                        O(log n)       |   O(1)
string-copy                            |
                        O(1),          |  O(n),
                        Omega(1)       |  omega(n)
unequal-size mutation                  |
                       O(log n)        |  O(n)
                       Omega(log n) *1 |  Omega(1)

*1; Unequal-size mutation uses Omega(log n) memory only if the data
    is shared with another string (ie, it uses part of the memory
    spaces that has already been saved in a string-copy operation).
    If the data is unshared, as with vector strings, it's Omega(1).

As you can see, vector strings win on indexed reference and indexed
mutation, but rope strings win everywhere else. It is possible to use
more memory doing an unequal-size mutation, but that's only in the
case that the vector string would already have used even more memory
in a string-copy operation.

Order(1) indexed reference and indexed single-character mutation is
nice, but the time and space taken by the other operations starts
becoming untenable when the size of the strings gets sufficiently

If you need to know which is best for your application, you may
want to instrument your code and use a profiler to find out just
how much time you're spending copying strings.