This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
I found FLBM buried in my repository and thought it might be illustrative to describe it in terms of current Schemes, since I wasn't doing a good job of explaining previously. It's written in R5RS + STRING-FOLD. (define (simple-boyer-mooyer-search pat s) (define (max-char s) (string-fold (lambda (c hi) (max (char->integer c) hi)) 0 pat)) (let* ((pat-len (string-length pat)) (str-len (string-length s)) (high (max (max-char pat) (max-char s))) (shift (make-vector (+ high 1) pat-len))) (do ((i 0 (+ i 1))) ; fill shift table ((= i pat-len)) (vector-set! shift (char->integer (string-ref pat i)) (- pat-len i 1))) (let check-from-index ((i (- pat-len 1))) (if (>= i str-len) ; failure #f (let scan-back ((j (- pat-len 1)) (k i)) (cond ((negative? j) ; success (+ k 1)) ((eqv? (string-ref pat j) (string-ref s k)) (scan-back (- j 1) (- k 1))) (else ; jump to next potential index (check-from-index (+ i (vector-ref shift (char->integer (string-ref s i)))))))))))) This works with whatever string representation you choose. Now, if you use this in the latest MzScheme which uses UCS4 internally, it will correctly work with the vanilla BM running times, including a potentially huge sigma (it makes some effort to reduce this and can be further improved). If you use this with Gauche Scheme, which can use a number of variable width encodings internally such as UTF-8, it will perform absolutely horribly, with the runtimes William Clinger gave earlier for the naive UTF-8 search. If you use this with MIT-Scheme it will work on MIT-Scheme strings, which use Latin-1 internally. It will have the vanilla BM running times with a small sigma of 256. However, if you treat your strings as though they were UTF-8 encoded, the algorithm will continue to work properly with the same running times (modified by the fact that n and m are now byte length, not codepoint lengths). This is because if you find a UTF-8 string within a UTF-8 string, it's guaranteed to be on a correct character boundary and therefore a valid match. You do have to keep in mind the returned result is a byte offset, not a codepoint offset. -- Alex