[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: constant-time access to variable-width encodings

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

On Wed, 13 Jul 2005, Per Bothner wrote:

> Shiro Kawai wrote:

>> I feel a bit uncomfortable, though, with the fact that indexes
>> and string-length differ among different implementations, or
>> even in the same implementations with different character
>> encodings.

> I can see an issue if you try to write that out using one
> implementation, and read it back in with another.  Not sure how
> important that is.

Actually, it's supposed to be a non-problem for unicode-compliant
applications, because the unicode string equivalence algorithm is
*required* to treat strings as equivalent regardless of how the
graphemes within them are encoded.

Speaking of which, the current draft of the SRFI is not
unicode-compliant in that its string=? predicate does not detect
strings which are "canonically equivalent" according to the
Unicode Consortium's required string equivalence algorithm.  They
define strings as equal if they contain a sequence of graphemes
which are equivalent, and you're defining strings as equal if
they contain a sequence of codepoints which are equivalent.

Aaaand, this is yet another problem that goes away if you embrace
glyph=character instead of codepoint=character.  With Unicode,
you *CANNOT* make assumptions about how strings are represented.
Two strings which are "equal" under unicode's required
equivalence predicates may be of different lengths and have not a
single codepoint in common, and the differences are purely
representation artifacts.  If you embrace glyph=character then at
least a given string will portably be a fixed number of
characters, and a unicode-aware char=? predicate can bury
representation artifacts below the level of notice of the
programmer or user.