[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: constant-time access to variable-width encodings




On Wed, 13 Jul 2005, Per Bothner wrote:

> Shiro Kawai wrote:

>> I feel a bit uncomfortable, though, with the fact that indexes
>> and string-length differ among different implementations, or
>> even in the same implementations with different character
>> encodings.

> I can see an issue if you try to write that out using one
> implementation, and read it back in with another.  Not sure how
> important that is.

Actually, it's supposed to be a non-problem for unicode-compliant
applications, because the unicode string equivalence algorithm is
*required* to treat strings as equivalent regardless of how the
graphemes within them are encoded.

Speaking of which, the current draft of the SRFI is not
unicode-compliant in that its string=? predicate does not detect
strings which are "canonically equivalent" according to the
Unicode Consortium's required string equivalence algorithm.  They
define strings as equal if they contain a sequence of graphemes
which are equivalent, and you're defining strings as equal if
they contain a sequence of codepoints which are equivalent.

Aaaand, this is yet another problem that goes away if you embrace
glyph=character instead of codepoint=character.  With Unicode,
you *CANNOT* make assumptions about how strings are represented.
Two strings which are "equal" under unicode's required
equivalence predicates may be of different lengths and have not a
single codepoint in common, and the differences are purely
representation artifacts.  If you embrace glyph=character then at
least a given string will portably be a fixed number of
characters, and a unicode-aware char=? predicate can bury
representation artifacts below the level of notice of the
programmer or user.

				Bear