[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation



On 7/28/05, William D Clinger <cesura@xxxxxxxxxxx> wrote:
> 
> You certainly don't need character offsets to do a string
> search, but the naive algorithm without random access to
> characters is O(mn).  The Boyer-Moore algorithm improves
> this to O(n/m) in many cases.  I believe one can construct
> artificial examples to show that some O(n/m) cases would
> degrade to an intermediate complexity, or even back to O(mn),
> in UTF-8 or UTF-16 without character offsets.

This is not correct.  Any search algorithm that works on bytes
will work on on UTF-8 strings.  That is, given a C function that
searches for a char* within a char* (e.g. strstr) then that will
return the correct result if the arguments are UTF-8 encoded,
no matter what algorithm is used.

It is in fact UTF-32 that has additional overhead for Boyer-Moore
searches as mentioned in my previous mail.

-- 
Alex