[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



On 7/28/05, Alan Watson <a.watson@xxxxxxxxxxxxxxxx> wrote:
> William D Clinger wrote:
> > Per Bothner wrote:
> >  > Random accesses to a position in a string that has not
> >  > been previously accessed is not in itself useful.
> >
> > Untrue.  The Boyer-Moore algorithm for fast string
> > searching uses random accesses to positions that have
> > not been previously accessed [1].
> 
> Yes, but I think you can implement this for UTF-8 or UTF-16 strings
> using offsets to the underlying bytes or shorts. I don't think that you
> need character offsets.

True.  In fact any existing Boyer-Moore search that works at
the 8-bit byte level will work unmodified on UTF-8 strings.

Furthermore, the true cost of BM includes a preprocessing step
of O(m+sigma) in both time and space, where m is the pattern
length and sigma is the size of the alphabet.  For the byte-level
UTF-8 search, sigma is 256, but for UTF-32 sigma is 2^21 = 2097152.
For patterns that are going to be repeatedly searched, it is reasonable
to keep the table for UTF-8 in memory, but the table for UTF-32,
if using a SRFI-4 u32vector for offsets, would incur an extra 8MB of
memory per pattern.  This isn't reasonable - in practice you'll want
to perform the preprocessing step every time for UTF-32.  UTF-8 wins
hands down for string searching.

-- 
Alex