[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: freshman-level Boyer-Moore fast string search

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



William D Clinger wrote:
With random access to the characters of s, the operation
described by the previous paragraph runs in O(lg m) time.
With random access to the bytes of s represented in UTF-8,
however, the previous paragraph takes O(m) time.  The problem
with UTF-8 is that we can't access s[i+m-1] without looking
at all of the bytes that encode the characters that precede
it.  We can look at the byte with index i+m-1, and we might
even be able to calculate the character whose UTF-8 encoding
includes the byte at byte index i+m-1 (I don't know Unicode
well enough to know whether this is possible), but knowing
that character doesn't help.  For the FLBM algorithm to work,
we have to know the *character* with index i+m-1.

But a FLBM using UTF-8 strings would naturally index and compare
*bytes*, not characters.  I.e. the obvious algorithm would
let m be the number of *bytes* in the UTF-8 encoding of s0;
n the total number of bytes in the strings to be searched,
and the i in the results <s,i> would be byte offsets.

The pre-processing would be to generate a table, such that:
For each byte that occurs within s0, the table contains the index of the last occurrence of that byte within s0.

So the question is whether this is still true: "In the best case,
however, the FLBM algorithm runs in O(n/m lg m) time. This is also the usual case in practice." I suspect so. Of course n and m are larger,
since they now measure bytes rather than characters, but these are
constant factors that will be dwarfed by other issues (such as cache
misses).
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/