[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: freshman-level Boyer-Moore fast string search

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: srfi-75@xxxxxxxxxxxxxxxxx
Subject: Re: freshman-level Boyer-Moore fast string search
From: William D Clinger <will@xxxxxxxxxxx>
Date: Mon, 01 Aug 2005 16:32:00 +0200
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
User-agent: Gnus/5.110003 (No Gnus v0.3) XEmacs/21.5-b21 (berkeley-unix)

Dr Alan Watson quoting me:
> > Suppose each string s is represented by a vector of 2^21 elements,
> > where element i consists of a list of numbers, in IEEE double
> > precision, that represent the indexes within s at which the
> > character c appears, where c is the Unicode scalar value f(i),
> > where f is represented by a global association list that maps
> > scalar values to indexes (i.e. f-inverse).  SRFI-75 allows that
> > representation, yet penalizes it.  I also consider it to be a
> > poor representation.
> 
> You consider this a poor representation because it does not
> allow constant-time random access or because it does not allow
> linear-time traversal? Or simply because of the space cost?

Yes.  And don't forget the implementation complexity, cost of
mutation, potential for race conditions in multithreaded systems,
and incompatibility with Java, C#, and C++ cultural expectations.

> Consider an implementation that internally represents strings
> as if they were doubly-linked lists and cached the position of
> the last index. This allows linear-time traversal but not
> constant-time random access. Would you consider this a poor
> implementation?

For some purposes, perhaps, but not for others.  IMO it is far
more reasonable than the representation I described above.

BTW, I once used a doubly-linked-list-of-lines representation to
represent the text of editor buffers in an Emacs-like editor I
wrote in Scheme.  Some of the editor's commands treated the text
as a single long linear array of characters; other commands
treated the text as an array of strings, with double indexing
(line+column) to obtain a character.  Aided by a couple of cached
index translations, this representation ran acceptably fast on an
8 MHz 68000 with 1 Mby of RAM, and ran very nicely on the faster
and larger Macintoshes that came later.  It was never released as
a product, but I used it at home for more than a decade.

That experience is one of the reasons I disregard some of the
criticisms I have read here concerning SRFI-75's alleged bias
against interesting representations.

I have expressed my opinion that implementations should provide
multiple representations for strings, transparently and adaptively,
using the API described in SRFI-75 and to be described in future
SRFIs.  Not every one of those representations has to do all things
well.  Indeed, the impossibility of doing all things well with a
single representation is the reason I favor a multiplicity.

Will

Prev by Date: Normalization vs. grapheme clusters
Next by Date: Line-feed required for a line termination?
Previous by thread: Re: freshman-level Boyer-Moore fast string search
Next by thread: Normalization vs. grapheme clusters
Index(es):
- Date
- Thread