[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.




On Mon, 25 Jul 2005, Alan Watson wrote:

>Files actually provide a fairly close analogy to the commonest means of
>representing Unicode strings.
>
>Imagine a file system that implements files as streams of bytes. Now
>imagine that you want to read the Nth *line*. The only way to do this is
>to read through the file until you have encounted N-1 newlines. This is
>like finding the Nth character when using UTF-8 for strings.
>
>Now imagine a file system that implements files as enumerated
>random-access records and uses exactly one record for each line. You can
>directly read the Nth line. This is like finding the Nth character when
>using UCS-32 for strings.
>
>Now imagine a file system that implements files as enumerated
>random-access records and uses one or more record for each line. This is
>like using UTF-16 for strings.

FWIW, I'm representing strings as trees, where each
branch of the tree records how many units are contained
within the leaves depending from that branch.  Initially
my only units were grapheme clusters, but now I keep
track of newlines and codepoints as well.

So, it's possible to navigate the triple-indexed tree by
any of the three indexes (codepints, graphemes, or lines)
and get to the right leaf.  Of course, having gotten
there, you may need to count from the beginning of the
leaf buffer to get to exactly where you wanted to be.
The result is that it's not actual constant-time random
access, but it's log(n) access in any of the reasonable
indexes. One important thing is that the size of the leaf
buffers is strictly limited to 1020 codepoints or less.
You don't have to start counting from the beginning of
the string.  Another is that different leaves can use
different fixed-width representations:  Latin1, BMP,
or UTF-32 depending on what is the widest codepoint in
that leaf's buffer.  In practice, 99% or more of every
long string is represented in ascii/Latin1.

I think a lot of people won't want to do this, but my
point is that it is possible to strike a good compromise
and multi-index your strings so that you can have fast
access in any of several representations.

				Bear