[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation



On 7/28/05, Alan Watson <a.watson@xxxxxxxxxxxxxxxx> wrote:
> William D Clinger wrote:
> > Per Bothner wrote:
> >  > Random accesses to a position in a string that has not
> >  > been previously accessed is not in itself useful.
> >
> > Untrue.  The Boyer-Moore algorithm for fast string
> > searching uses random accesses to positions that have
> > not been previously accessed [1].
> 
> Yes, but I think you can implement this for UTF-8 or UTF-16 strings
> using offsets to the underlying bytes or shorts. I don't think that you
> need character offsets.

True.  In fact any existing Boyer-Moore search that works at
the 8-bit byte level will work unmodified on UTF-8 strings.

Furthermore, the true cost of BM includes a preprocessing step
of O(m+sigma) in both time and space, where m is the pattern
length and sigma is the size of the alphabet.  For the byte-level
UTF-8 search, sigma is 256, but for UTF-32 sigma is 2^21 = 2097152.
For patterns that are going to be repeatedly searched, it is reasonable
to keep the table for UTF-8 in memory, but the table for UTF-32,
if using a SRFI-4 u32vector for offsets, would incur an extra 8MB of
memory per pattern.  This isn't reasonable - in practice you'll want
to perform the preprocessing step every time for UTF-32.  UTF-8 wins
hands down for string searching.

-- 
Alex