[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Octet vs Char (Re: strings draft)
On Monday 26 January 2004 02:13 pm, Matthew Dempsky wrote:
> From my understanding there are a few issues being discussed here
> (with possibly overlapping consequences) -- O(1) performance of
> strings and using strings for octet sequences.
>
> SRFI-4 defines homogenous numeric vectors which seem like they would
> suffice for implementing octet streams independant of strings and, if
> not, a few Scheme implementations (larceny, MIT/GNU-scheme) provide
> byte-vectors.
>
> As for O(1) performance, there's undoubtably a need for
> non-O(1)-string-like-heavy-weight-objects (herein called ropes) which
> some implementations choose to implement and call them strings (as
> permitted by r5rs). The current proposal seems to urge
> implementations to provide these as a seperate data type.
Well color me dumb, but I don't see why getting O(1) is such a big deal.
Let's say an implementation reads UTF-8 chars. A uni-string could be
represented as a tuple. The required part is a byte-vector of characters
that fit into 8 bits. If characters have 16 bit representation, use a
second, 16-bit vector and a map of which chars are the long ones. If 16 bit,
index into the 2nd array, else into the first. The 2nd array could be
shortened by using the byte-vector to hold indexes into the 16-bit array in
the mapped/unused bytevector slots. In the "worst case" (storage wise), we
only need the 16-bit array. The mapping could be a bit-vector or whatever
makes sense in a particular implementation.
It is true that there are more code paths in this scheme (sic), but each kind
of representation "knows that" and this still makes refs & sets ~ O(1).
What am I missing here?
Thanks,
-KenD