[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

William D Clinger writes:
Per Bothner wrote:
> Random accesses to a position in a string that has not
> been previously accessed is not in itself useful.

In computational linguistics it is common to utilize standoff markup,
where features in a text are tagged in a separate file via character
ranges into the original. For example, we may have a file indicating
that certain prepositional phrases appear at offsets [25,40) and
[125,160) in the original file. I'm regularly dealing with
multimegabyte text files with such standoff markup and not having
random access is a detriment in these applications.

Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"