This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
On Thu, 28 Jul 2005, Alan Watson wrote: >So, two questions: > >(1) Are your "random" accesses into your corpus linguistics strings >really random, do they have significant locality, or could they be >arranged to have have significant locality? Speaking for myself, I would say they are as close to random as makes no difference. I typically suck the large string into memory, pull in its indexes from another file, and then consult my indexes for members of a particular synonym group and go to fifty or five hundred locations in the string to gather details about the context in which those words were used. Now I could sort the accesses and do them from lowest to highest offset, thus simulating locality. But, particularly with relatively rare words, the gaps between occurrences have poisson random distribution, typically measured in megabytes. The problem with doing this in terms of something other than numeric offsets isn't locality though, not really; the problem is serialization. The corpus is a multi-megabyte object which lives on the disk. And none of the implementations of "marks" I've seen has marks that persist across different instances of the string, or are serializable. There's a big upfront investment in reading the corpus, recognizing words, parsing sentences, and building indexes. That's work I don't want to repeat every time I pull the thing into memory, so having done that, I want to be able to write the string (and the indexes) and read the string and indexes back in when I'm getting ready to do more work, and still have the indexes refer to the correct places in the string. >(2) Could you live with linear complexity to extract classes of substrings? It would be a serious problem. "Linear" becomes really onerous when talking about long strings - one of the reasons I implemented ropes for string representation. Bear