[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



On 7/28/05, William D Clinger <cesura@xxxxxxxxxxx> wrote:
> 
> You certainly don't need character offsets to do a string
> search, but the naive algorithm without random access to
> characters is O(mn).  The Boyer-Moore algorithm improves
> this to O(n/m) in many cases.  I believe one can construct
> artificial examples to show that some O(n/m) cases would
> degrade to an intermediate complexity, or even back to O(mn),
> in UTF-8 or UTF-16 without character offsets.

This is not correct.  Any search algorithm that works on bytes
will work on on UTF-8 strings.  That is, given a C function that
searches for a char* within a char* (e.g. strstr) then that will
return the correct result if the arguments are UTF-8 encoded,
no matter what algorithm is used.

It is in fact UTF-32 that has additional overhead for Boyer-Moore
searches as mentioned in my previous mail.

-- 
Alex