[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 are here. Eventually, the entire history will be moved there, including any new messages.

William D Clinger wrote:
Alex Shinn quoting me:
 > > I acknowledged that the algorithm will still work.  My point
 > > is that its asymptotic complexity may be degraded.
 > I'm sorry, perhaps I'm just misunderstanding, but if the exact
 > same algorithm, in fact the exact same code, can be used,
 > how is the asymptotic complexity affected?

The different representations for the string being searched
(UTF-8 vs UTF-32) change the problem.  The main difference
is that UTF-32 admits random access of characters, while
UTF-8 does not.  UTF-8 admits random access of bytes, but
there is no way to convert a byte value obtained by random
access into a UTF-8 string into the character offset on
which the Boyer-Moore algorithm depends,

I don't claim to understand Boyer-Moore, beyond what I gather
is the key insight: if you're searching for "ABC" and character
N is "D" there is no point in checking characters N+1 or N+2.
(Hence the non-obvious part of building the appropriate delta
tables before you start.)

But I'm puzzled, since I think Alex's point is this:  A Boyer-Moore
implementation that works on 8-bits characters (e.g. Latin-1) will
work unchanged on UTF-8 characters.  Naively, one would think it
would have the same performance characteristics.  I guess that
statistics of multi-byte characters might throw off the "delta"
tables so they delta will tend to be smaller.  (However, don't
go into details for my sake: I suspect t would take me too much
effort to delve into it.)

As Per Bothner agreed, that can degrade the asympototic
complexity of the algorithm from O(n/m) to O(n).

I did?
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/