This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
On Mon, 25 Jul 2005, Alan Watson wrote: >Files actually provide a fairly close analogy to the commonest means of >representing Unicode strings. > >Imagine a file system that implements files as streams of bytes. Now >imagine that you want to read the Nth *line*. The only way to do this is >to read through the file until you have encounted N-1 newlines. This is >like finding the Nth character when using UTF-8 for strings. > >Now imagine a file system that implements files as enumerated >random-access records and uses exactly one record for each line. You can >directly read the Nth line. This is like finding the Nth character when >using UCS-32 for strings. > >Now imagine a file system that implements files as enumerated >random-access records and uses one or more record for each line. This is >like using UTF-16 for strings. FWIW, I'm representing strings as trees, where each branch of the tree records how many units are contained within the leaves depending from that branch. Initially my only units were grapheme clusters, but now I keep track of newlines and codepoints as well. So, it's possible to navigate the triple-indexed tree by any of the three indexes (codepints, graphemes, or lines) and get to the right leaf. Of course, having gotten there, you may need to count from the beginning of the leaf buffer to get to exactly where you wanted to be. The result is that it's not actual constant-time random access, but it's log(n) access in any of the reasonable indexes. One important thing is that the size of the leaf buffers is strictly limited to 1020 codepoints or less. You don't have to start counting from the beginning of the string. Another is that different leaves can use different fixed-width representations: Latin1, BMP, or UTF-32 depending on what is the widest codepoint in that leaf's buffer. In practice, 99% or more of every long string is represented in ascii/Latin1. I think a lot of people won't want to do this, but my point is that it is possible to strike a good compromise and multi-index your strings so that you can have fast access in any of several representations. Bear