[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: freshman-level Boyer-Moore fast string search

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

On 7/29/05, William D Clinger <cesura@xxxxxxxxxxx> wrote:
> The purpose of this note is to show how random access
> to the characters of a string improves the efficiency
> of string searching beyond what is possible with random
> access to the bytes of a UTF-8 string.
> This can be demonstrated using the Boyer-Moore algorithm
> for fast string searching

What you describe is not the Boyer-Moore algorithm.  It is a
(freshman level variant of) a variant of the Boyer-Moore algorithm
which trades a time and space cost of Sigma in the preprocessing
step for a constant factor of lg(m) in both steps.  This is probably
a worthwhile trade-off when Sigma is the full size of Unicode, but
there are other optimizations worth considering.

You say above you'll give us random access to the bytes of the UTF-8
string, and I want to use that.  I want to use something like strstr(3),
which works on bytes and returns a pointer.  This could use the real
BM, without the lg(m) factor.  Thus our costs become:

UTF-32 BM:
    preprocess: O(m + Sigma) time and space, Sigma huge
    worst case:  O(mn) time
    best case:    O(m/n) time

UTF-32 BM with binary tree offset lookup:
    preprocess: O(m lg(m)) time + O(m) space
    worst case:  O(mn lg(m)) time
    best case:    O(m/n lg(m)) time

UTF-8 with byte-oriented BM:
    preprocess: O(m + Sigma) time and space, Sigma small
    worst case:  O(mn) time
    best case:    O(m/n) time

Now, if you really wanted to convert the UTF-8 pointer to a codepoint
index, that could be done with a post-processing step of O(n).  This of
course only needs to be done when the string is found, so it will factor
into the average case, but not the best case.

However, the point of the discussion is that people are objecting to
explicitly forcing strings to be character arrays.  If we have string-pointers
in Scheme then the above conversion to index need not even be
performed.  Why penalize the fastest implementation?