[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: freshman-level Boyer-Moore fast string search

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



William D Clinger wrote:
Alex Shinn wrote:
 > Thus Bothner's statement still stands without a counter-example.

What you mean, of course, is that Bothner's statement still stands
without a counterexample you are willing to admit.

To deny that the FLBM is a counterexample, you must deny that the
FLBM is a useful algorithm.  To deny that the real Boyer-Moore
algorithm is a counterexample, you must deny that it is a useful
algorithm.  You can't get out of that by insisting that the
Boyer-Moore algorithm is useful only when the strings being
searched are represented in UTF-8, as silly as that claim would
be, because it is easy to construct examples for which the
algorithm must access all of the bytes that encode every
character that it examines in the UTF-8 strings being searched.

In case you have forgotten Bothner's statement, here it is:
"Random accesses to a position in a string that has not been
previously accessed is not in itself useful."

I might quibble that BM isn't really random access:  It's really
sequential scanning, but sometimes you're able to skip ahead more
than a single character at a time.  Also, note that character
indexes aren't inputs to the algorthm, - i.e. they aren't semenatically
meaningful.  The discussion is whether the performance of one
particular algorithm (and its variations) is better when using a
UTF-32 representation or a UTF-8 representation.  It's similar to
asking what kind of indexes would work best, which I'm also not
in a position to judge.

I won't deny that BM is useful, but I suspect it is *less* useful
now that people search XML files and other complex text data structures
with embedded controls and tree structure.  Or they use special
indexes.

Also, remember that with today's hardware what matters performace-wise
are cache misses, and UTF-8 uses less memory that UTF-32.  Any
theoretical performance advantage of UTF-32 for BM has to make up
for that fact.  Even if UTF-8 has to access (say) twice as many bytes,
that doesn't matter if it ends up accessing fewer cache "words'.
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/