[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Octet vs Char (Re: strings draft)




On Tue, 27 Jan 2004, Ken Dickey wrote:

>On Tuesday 27 January 2004 09:32 am, bear wrote:
>> On Mon, 26 Jan 2004, Ken Dickey wrote:
>> >Well color me dumb, but I don't see why getting O(1) is such a big deal.
>...
>> O(1) reference or character setting comes at the expense of O(n)
>> insertions, deletions, and non-identical-sized replacements.
>>
>> EG, if I change "the" to "a" at the beginning of a long string, and
>> I've represented it as a vector to get O(1) reference time, the rest of
>> the string has to be copied to move it two character spaces in memory.

>I was puzzled by the ropes discussion here because it seemed to be orthogonal
>to the  Unicode discussion.  I now see that its because it _is_ orthogonal to
>the Unicode discussion.

The only thing that unicode has to do with it is that unicode
makes non-identical sized replacements more likely, and makes
it more likely that the programmer will not realize that a given
operation involves non-identical sized replacements.  Replacing
one codepoint with another may wind up being a replacement of a
character that takes 1 octet of UTF-8 to express with a character
that takes 3 octets of UTF-8 to express, or vice versa.  This sort
of thing is amenable to your proposed approach of indexed fallback
into another vector.

But replacing a character with a combining sequence of multiple
codepoints, or vice versa, is also likely; in fact the Unicode
Consortium's canonicalization algorithms do this all the time.
In this case you're looking at things like replacing

U+212B ANGSTROM SIGN
with
U+41 LATIN CAPITAL LETTER A , U+30A COMBINING RING ABOVE

and if your implementation treats the former as one character
and the latter as two characters, which most do, you wind up
with the same need to copy the rest of the string that changing
"a" to "the" caused in ASCII strings.  This is not amenable to
your proposed approach of indexed fallback into another vector.

What this means is that, while on an absolute level Unicode and
rope representation are orthogonal issues, Unicode has patterns
of likely use that rely heavily on the most expensive operations
of vector representations.

And of course both came up here because the first draft of the
FFI SRFI wanted a C pointer to a mutable memory area containing
the internal representation of a scheme string, and has to know
this kind of "detail" to even make sense of what it finds there.

As a result of the discussions here, I'm now considering
adding more types of string values, each with its own read
syntax and conversions:  For example,

#,(Latin-1-vector "hello world")
 would be an octet vector where each octet is a latin-1
 character.  This would make binary I/O using string-like
 constructions possible and give C programs the kind of
 FFI value they wanted. No characters outside Latin-1
 would be allowed, of course.

#,(UTF32-vector "hello world")
 would be a "string" indexed by unicode codepoint rather
 than by character.  Handy for FFI, and also allows people
 to create invalid or non-canonical combining sequences,
 assign values that aren't even mapped codepoints to arbitrary
 locations, or do other linguistically wrong operations.
 However, converting it to a regular string would canonicalize
 it, and would fail if it contained non-characters.

				Bear