[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: freshman-level Boyer-Moore fast string search

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



On 7/29/05, William D Clinger <cesura@xxxxxxxxxxx> wrote:
> Alex Shinn wrote:
>  > What is being done stupidly?  Any freshman textbook will include Sigma
>  > in the analysis of BM (at least mine, "Algorithms" by CL&R does, and
>  > all of the web analysis I Googled for did as well).
> 
> You were arguing that UTF-8 "wins hands down for string searching".

I retract that, it was too strong a statement.  I would still personally
prefer UTF-8 for string searching any day, but acknowledge that this
is debatable.

> Had it been clear that you meant to say nothing more than that
> UTF-8 wins hands down, provided the implementor doesn't know any
> better than to use the naive representations that are often found
> in freshman textbooks, then I would not have responded to you.

I never suggested there weren't alternate algorithms.  But when you
make a change to an algorithm significant enough to alter the
asymptotic time, it is typical to refer to that as a variant of the
algorithm.  Regardless, I did acknowledge the variants, and I quoted
your own algorithms and your own running time analysis accurately,
and don't believe I deserve this hostility.

>  > Ah, so this uses a hash table instead of a binary tree.  For
>  > asymptotic notation this will remove the lg(m) factor, though
>  > it does incur some not insignificant constant overhead.
> 
> I will assume you are smart enough to realize that a potentially
> three-stage lookup based on Unicode scalar values, which usually
> terminates in the first stage of lookup, is essentially similar
> to the potentially six-step lookup required for UTF-8, which
> usually involves only one step.

Well, UTF-8 is in theory 1-4 bytes, and in practice you almost never
see 4-byte codepoints, and 1 byte codepoints are extremely common.

Is the three-stage assumption for binary trees?  That would suggest
the search strings are never longer than 8 characters.  In practice I
frequently search for much longer strings - "call-with-current-continuation"
at 30 characters (5 stage lookup) is fairly common, and mult-line
searches are not unheard of.  You can't just dismiss a lg(m) factor.

If you're interested, I'll gladly benchmark a UTF-8 string search
against any UTF-32 string search you want to provide.

> > It seems you are suggesting non-fixed-length arrays as being
> > poor choices of representation.
>
> It seems you are not reading very carefully.

Out of curiosity, what string representations does SRFI-75 penalize
which you consider to be poor?

>  > Thus Bothner's statement still stands without a counter-example.
> 
> What you mean, of course, is that Bothner's statement still stands
> without a counterexample you are willing to admit.

[...]

> In case you have forgotten Bothner's statement, here it is:
> "Random accesses to a position in a string that has not been
> previously accessed is not in itself useful."

You are correct, and I was mistaken, I was mentally translating this
to "Random accesses to a position in a string that has not been
previously accessed is not *necessary*," and was arguing strongly
on this point because of the previous draft of the SRFI which restricted
strings to character arrays.

"Usefulness" is fairly meaningless to argue about - people can find
a use for just about anything.  I will stick to the claim that random
access is not necessary, for performance or other reasons.

-- 
Alex