This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
On 7/28/05, Tom Emerson <tree@xxxxxxxxxxxxx> wrote: > William D Clinger writes: > Per Bothner wrote: > > Random accesses to a position in a string that has not > > been previously accessed is not in itself useful. > > In computational linguistics it is common to utilize standoff markup, > where features in a text are tagged in a separate file via character > ranges into the original. For example, we may have a file indicating > that certain prepositional phrases appear at offsets [25,40) and > [125,160) in the original file. I'm regularly dealing with > multimegabyte text files with such standoff markup and not having > random access is a detriment in these applications. You're missing Per's point. Those features have to have been assigned by some previous text processing, which had to know the location in the text in order to choose a tag. Those locations could just as easily be represented by opaque pointers as by codepoint offsets. To store these pointers in a separate file they just need to be serializable. The obvious pointer representation for UTF-8 strings would be the byte offset, an integer, which serializes as is. -- Alex