[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: Alan Watson <a.watson@xxxxxxxxxxxxxxxx>
Subject: Re: Surrogates and character representation
From: bear <bear@xxxxxxxxx>
Date: Thu, 28 Jul 2005 15:35:40 -0700 (PDT)
Cc: Shiro Kawai <shiro@xxxxxxxx>, tree@xxxxxxxxxxxxx, per@xxxxxxxxxxx, srfi-75@xxxxxxxxxxxxxxxxx
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
In-reply-to: <42E90FA0.6070005@xxxxxxxxxxxxxxxx>
References: <42E8546F.9000407@xxxxxxxxxxx> <17128.22540.135687.288180@xxxxxxxxxxxxxxxxxxxxxx> <Pine.LNX.4.58.0507280119280.28883@xxxxxxxxxxxxxx> <20050728.000652.1016278026.shiro@xxxxxxxx> <42E90FA0.6070005@xxxxxxxxxxxxxxxx>

On Thu, 28 Jul 2005, Alan Watson wrote:

>So, two questions:
>
>(1) Are your "random" accesses into your corpus linguistics strings
>really random, do they have significant locality, or could they be
>arranged to have have significant locality?

Speaking for myself, I would say they are as close to random as
makes no difference.  I typically suck the large string into
memory, pull in its indexes from another file, and then consult
my indexes for members of a particular synonym group and go to
fifty or five hundred locations in the string to gather details
about the context in which those words were used.

Now I could sort the accesses and do them from lowest to highest
offset, thus simulating locality.  But, particularly with relatively
rare words, the gaps between occurrences have poisson random
distribution, typically measured in megabytes.

The problem with doing this in terms of something other than
numeric offsets isn't locality though, not really; the problem
is serialization.  The corpus is a multi-megabyte object which
lives on the disk.  And none of the implementations of "marks"
I've seen has marks that persist across different instances
of the string, or are serializable.  There's a big upfront
investment in reading the corpus, recognizing words, parsing
sentences, and building indexes.  That's work I don't want to
repeat every time I pull the thing into memory, so having
done that, I want to be able to write the string (and the
indexes) and read the string and indexes back in when I'm
getting ready to do more work, and still have the indexes refer
to the correct places in the string.

>(2) Could you live with linear complexity to extract classes of substrings?

It would be a serious problem.  "Linear" becomes really onerous
when talking about long strings - one of the reasons I implemented
ropes for string representation.

				Bear

Follow-Ups:
- Re: Surrogates and character representation
  - From: Alan Watson

References:
- Re: Surrogates and character representation
  - From: Per Bothner
- Re: Surrogates and character representation
  - From: Tom Emerson
- Re: Surrogates and character representation
  - From: bear
- Re: Surrogates and character representation
  - From: Shiro Kawai
- Re: Surrogates and character representation
  - From: Alan Watson

Prev by Date: Re: Allowing ASCII only, string escapes, and normalization
Next by Date: Re: freshman-level Boyer-Moore fast string search
Previous by thread: Re: Surrogates and character representation
Next by thread: Re: Surrogates and character representation
Index(es):
- Date
- Thread