[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation




On Thu, 28 Jul 2005, Alan Watson wrote:

>So, two questions:
>
>(1) Are your "random" accesses into your corpus linguistics strings
>really random, do they have significant locality, or could they be
>arranged to have have significant locality?

Speaking for myself, I would say they are as close to random as
makes no difference.  I typically suck the large string into
memory, pull in its indexes from another file, and then consult
my indexes for members of a particular synonym group and go to
fifty or five hundred locations in the string to gather details
about the context in which those words were used.

Now I could sort the accesses and do them from lowest to highest
offset, thus simulating locality.  But, particularly with relatively
rare words, the gaps between occurrences have poisson random
distribution, typically measured in megabytes.

The problem with doing this in terms of something other than
numeric offsets isn't locality though, not really; the problem
is serialization.  The corpus is a multi-megabyte object which
lives on the disk.  And none of the implementations of "marks"
I've seen has marks that persist across different instances
of the string, or are serializable.  There's a big upfront
investment in reading the corpus, recognizing words, parsing
sentences, and building indexes.  That's work I don't want to
repeat every time I pull the thing into memory, so having
done that, I want to be able to write the string (and the
indexes) and read the string and indexes back in when I'm
getting ready to do more work, and still have the indexes refer
to the correct places in the string.

>(2) Could you live with linear complexity to extract classes of substrings?

It would be a serious problem.  "Linear" becomes really onerous
when talking about long strings - one of the reasons I implemented
ropes for string representation.

				Bear