[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

I found FLBM buried in my repository and thought it might
be illustrative to describe it in terms of current Schemes,
since I wasn't doing a good job of explaining previously.
It's written in R5RS + STRING-FOLD.

(define (simple-boyer-mooyer-search pat s)
  (define (max-char s)
    (string-fold (lambda (c hi) (max (char->integer c) hi)) 0 pat))
  (let* ((pat-len (string-length pat))
         (str-len (string-length s))
         (high (max (max-char pat) (max-char s)))
         (shift (make-vector (+ high 1) pat-len)))
    (do ((i 0 (+ i 1)))  ; fill shift table
        ((= i pat-len))
      (vector-set! shift (char->integer (string-ref pat i)) (- pat-len i 1)))
    (let check-from-index ((i (- pat-len 1)))
      (if (>= i str-len) ; failure
        (let scan-back ((j (- pat-len 1)) (k i))
            ((negative? j)  ; success
             (+ k 1))
            ((eqv? (string-ref pat j) (string-ref s k))
             (scan-back (- j 1) (- k 1)))
            (else           ; jump to next potential index
              (+ i (vector-ref shift (char->integer (string-ref s i))))))))))))

This works with whatever string representation you choose.

Now, if you use this in the latest MzScheme which uses UCS4
internally, it will correctly work with the vanilla BM running times,
including a potentially huge sigma (it makes some effort to reduce
this and can be further improved).

If you use this with Gauche Scheme, which can use a number of
variable width encodings internally such as UTF-8, it will perform
absolutely horribly, with the runtimes William Clinger gave earlier
for the naive UTF-8 search.

If you use this with MIT-Scheme it will work on MIT-Scheme strings,
which use Latin-1 internally.  It will have the vanilla BM running
times with a small sigma of 256.  However, if you treat your
strings as though they were UTF-8 encoded, the algorithm will
continue to work properly with the same running times (modified
by the fact that n and m are now byte length, not codepoint lengths).
This is because if you find a UTF-8 string within a UTF-8 string,
it's guaranteed to be on a correct character boundary and therefore
a valid match.  You do have to keep in mind the returned result is
a byte offset, not a codepoint offset.