[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: freshman-level Boyer-Moore fast string search
On 7/29/05, William D Clinger <cesura@xxxxxxxxxxx> wrote:
> Alex Shinn wrote:
> > What is being done stupidly? Any freshman textbook will include Sigma
> > in the analysis of BM (at least mine, "Algorithms" by CL&R does, and
> > all of the web analysis I Googled for did as well).
> You were arguing that UTF-8 "wins hands down for string searching".
I retract that, it was too strong a statement. I would still personally
prefer UTF-8 for string searching any day, but acknowledge that this
> Had it been clear that you meant to say nothing more than that
> UTF-8 wins hands down, provided the implementor doesn't know any
> better than to use the naive representations that are often found
> in freshman textbooks, then I would not have responded to you.
I never suggested there weren't alternate algorithms. But when you
make a change to an algorithm significant enough to alter the
asymptotic time, it is typical to refer to that as a variant of the
algorithm. Regardless, I did acknowledge the variants, and I quoted
your own algorithms and your own running time analysis accurately,
and don't believe I deserve this hostility.
> > Ah, so this uses a hash table instead of a binary tree. For
> > asymptotic notation this will remove the lg(m) factor, though
> > it does incur some not insignificant constant overhead.
> I will assume you are smart enough to realize that a potentially
> three-stage lookup based on Unicode scalar values, which usually
> terminates in the first stage of lookup, is essentially similar
> to the potentially six-step lookup required for UTF-8, which
> usually involves only one step.
Well, UTF-8 is in theory 1-4 bytes, and in practice you almost never
see 4-byte codepoints, and 1 byte codepoints are extremely common.
Is the three-stage assumption for binary trees? That would suggest
the search strings are never longer than 8 characters. In practice I
frequently search for much longer strings - "call-with-current-continuation"
at 30 characters (5 stage lookup) is fairly common, and mult-line
searches are not unheard of. You can't just dismiss a lg(m) factor.
If you're interested, I'll gladly benchmark a UTF-8 string search
against any UTF-32 string search you want to provide.
> > It seems you are suggesting non-fixed-length arrays as being
> > poor choices of representation.
> It seems you are not reading very carefully.
Out of curiosity, what string representations does SRFI-75 penalize
which you consider to be poor?
> > Thus Bothner's statement still stands without a counter-example.
> What you mean, of course, is that Bothner's statement still stands
> without a counterexample you are willing to admit.
> In case you have forgotten Bothner's statement, here it is:
> "Random accesses to a position in a string that has not been
> previously accessed is not in itself useful."
You are correct, and I was mistaken, I was mentally translating this
to "Random accesses to a position in a string that has not been
previously accessed is not *necessary*," and was arguing strongly
on this point because of the previous draft of the SRFI which restricted
strings to character arrays.
"Usefulness" is fairly meaningless to argue about - people can find
a use for just about anything. I will stick to the claim that random
access is not necessary, for performance or other reasons.