[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: freshman-level Boyer-Moore fast string search

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

In my previous message concerning the Freshman-Level Boyer-Moore
algorithm, I wrote:

 > In the worst case, the FLBM algorithm requires O(mn lg m) time.
 > In the best case, however, the FLBM algorithm runs in O(n/m lg m)
 > time.  This is also the usual case in practice.

That last sentence isn't true.  It's true for the real Boyer-Moore
algorithm, but not for the FLBM.

Alex Shinn wrote:
 > What you describe is not the Boyer-Moore algorithm.  It is a
 > (freshman level variant of) a variant of the Boyer-Moore algorithm

My message was perfectly clear about that.

 > which trades a time and space cost of Sigma in the preprocessing
 > step for a constant factor of lg(m) in both steps.

Why do you continue to speak as though the Boyer-Moore algorithm's
preprocessing must be done so stupidly?  This isn't rocket science.
My description of the FLBM algorithm illustrates the basic idea of
a less stupid representation, and you can find it worked out in
more detail at http://www.codeproject.com/csharp/bmsearch.asp
for the real Boyer-Moore algorithm.

If, as you say, the real Boyer-Moore algorithm can be made to run
just as fast with UTF-8, then I was wrong to give it as an example
of an algorithm that performs better with constant-time random
access to characters than with constant-time random access to the
bytes of a UTF-8 encoding.  In that case, please accept the
Freshman-Level Boyer-Moore algorithm as my example instead.

 > However, the point of the discussion is that people are objecting to
 > explicitly forcing strings to be character arrays.

That may be *your* point, but it isn't *my* point.  So far as I can
tell, absolutely no one has proposed, whether explicitly or implicitly,
that strings be forced to be character arrays.  In particular, SRFI-75
does not propose that.

I got into this because Per Bothner stated that "Random accesses to a
position in a string that has not been previously accessed is not in
itself useful."  I cited the Boyer-Moore string search to show that
Bothner's statement is untrue.

My dispute with you arose over your claim that "the true cost of BM
includes a preprocessing step of O(m+sigma) in both time and space,
where m is the pattern length and sigma is the size of the alphabet",
and that this implies that "UTF-8 wins hands down for string
searching."  Your first claim is utter nonsense, leaving the
second without support.

 > If we have string-pointers in Scheme then the above conversion to
 > index need not even be performed.  Why penalize the fastest
 > implementation?

In general, there is no fastest implementation.  Implementations
are not totally ordered by speed or by asymptotic time complexity.
The fastest implementation for one problem is unlikely to be the
fastest for all.

Even if there were a fastest implementation, I am unaware of any
proposal to penalize it.  I must admit, however, that I am in the
habit of ignoring proposals that I cannot take seriously.  If you
think there is a serious proposal on the table that would penalize
what you regard as the fastest implementation of something, then I
suggest you identify the proposal, the fastest implementation, and
the something.