[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Octet vs Char (Re: strings draft)



On Monday 26 January 2004 02:13 pm, Matthew Dempsky wrote:
> From my understanding there are a few issues being discussed here
> (with possibly overlapping consequences) -- O(1) performance of
> strings and using strings for octet sequences.
>
> SRFI-4 defines homogenous numeric vectors which seem like they would
> suffice for implementing octet streams independant of strings and, if
> not, a few Scheme implementations (larceny, MIT/GNU-scheme) provide
> byte-vectors.
>
> As for O(1) performance, there's undoubtably a need for
> non-O(1)-string-like-heavy-weight-objects (herein called ropes) which
> some implementations choose to implement and call them strings (as
> permitted by r5rs).  The current proposal seems to urge
> implementations to provide these as a seperate data type.

Well color me dumb, but I don't see why getting O(1) is such a big deal.

Let's say an implementation reads UTF-8 chars.  A uni-string could be 
represented as a tuple.  The required part is a byte-vector of characters 
that fit into 8 bits.  If characters have 16 bit representation, use a 
second, 16-bit vector and a map of which chars are the long ones.  If 16 bit, 
index into the 2nd array, else into the first.  The 2nd array could be 
shortened by using the byte-vector to hold indexes into the 16-bit array in 
the mapped/unused bytevector slots.  In the "worst case" (storage wise), we 
only need the 16-bit array.  The mapping could be a bit-vector or whatever 
makes sense in a particular implementation.

It is true that there are more code paths in this scheme (sic), but each kind 
of representation "knows that" and this still makes refs & sets ~ O(1).

What am I missing here?

Thanks,
-KenD