[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: freshman-level Boyer-Moore fast string search

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

Alex Shinn wrote:
 > Well, UTF-8 is in theory 1-4 bytes, and in practice you almost never
 > see 4-byte codepoints, and 1 byte codepoints are extremely common.

Sorry, I was thinking UTF-8 is 1-6 bytes, but that's out of date.

 > Is the three-stage assumption for binary trees?

No.  Stage 1: If the Unicode scalar value is less than 256, then
use it as an index into an array.  Stage 2: If it's greater than 255,
then look it up in a hash table or balanced search tree.  (A third
stage isn't necessary, but I thought it might suggest to you how
dividing a 21-bit scalar value into three 7-bit chunks can simulate,
with no real loss of efficiency apart from memory considerations,
the UTF-8 encoding that you thought was faster.)

 > You can't just dismiss a lg(m) factor.

I'm not.  I put that factor into my description of the FLBM algorithm
just to show you how easy it is to do better than O(m+Sigma) time and
space during the preprocessing phase.  Please remember that the whole
point of that algorithm is to make the basic ideas easy enough for
freshmen to understand.  In a real implementation of the real
Boyer-Moore algorithm, I would use something at least as fast as the
two-stage lookup described above.

 > Out of curiosity, what string representations does SRFI-75 penalize
 > which you consider to be poor?

Suppose each string s is represented by a vector of 2^21 elements,
where element i consists of a list of numbers, in IEEE double
precision, that represent the indexes within s at which the
character c appears, where c is the Unicode scalar value f(i),
where f is represented by a global association list that maps
scalar values to indexes (i.e. f-inverse).  SRFI-75 allows that
representation, yet penalizes it.  I also consider it to be a
poor representation.