[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: freshman-level Boyer-Moore fast string search



William D Clinger wrote:
With random access to the characters of s, the operation
described by the previous paragraph runs in O(lg m) time.
With random access to the bytes of s represented in UTF-8,
however, the previous paragraph takes O(m) time.  The problem
with UTF-8 is that we can't access s[i+m-1] without looking
at all of the bytes that encode the characters that precede
it.  We can look at the byte with index i+m-1, and we might
even be able to calculate the character whose UTF-8 encoding
includes the byte at byte index i+m-1 (I don't know Unicode
well enough to know whether this is possible), but knowing
that character doesn't help.  For the FLBM algorithm to work,
we have to know the *character* with index i+m-1.

But a FLBM using UTF-8 strings would naturally index and compare
*bytes*, not characters.  I.e. the obvious algorithm would
let m be the number of *bytes* in the UTF-8 encoding of s0;
n the total number of bytes in the strings to be searched,
and the i in the results <s,i> would be byte offsets.

The pre-processing would be to generate a table, such that:
For each byte that occurs within s0, the table contains the index of the last occurrence of that byte within s0.

So the question is whether this is still true: "In the best case,
however, the FLBM algorithm runs in O(n/m lg m) time. This is also the usual case in practice." I suspect so. Of course n and m are larger,
since they now measure bytes rather than characters, but these are
constant factors that will be dwarfed by other issues (such as cache
misses).
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/