[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: William D Clinger <cesura@xxxxxxxxxxx>
Subject: Re: Surrogates and character representation
From: Per Bothner <per@xxxxxxxxxxx>
Date: Thu, 28 Jul 2005 22:56:39 -0700
Cc: srfi-75@xxxxxxxxxxxxxxxxx
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
In-reply-to: <y9lek9i2n8j.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
References: <y9lek9i2n8j.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
User-agent: Mozilla Thunderbird 1.0.6-1.1.fc4 (X11/20050720)

William D Clinger wrote:

Alex Shinn quoting me:
 > > I acknowledged that the algorithm will still work.  My point
 > > is that its asymptotic complexity may be degraded.
 >
 > I'm sorry, perhaps I'm just misunderstanding, but if the exact
 > same algorithm, in fact the exact same code, can be used,
 > how is the asymptotic complexity affected?

The different representations for the string being searched
(UTF-8 vs UTF-32) change the problem.  The main difference
is that UTF-32 admits random access of characters, while
UTF-8 does not.  UTF-8 admits random access of bytes, but
there is no way to convert a byte value obtained by random
access into a UTF-8 string into the character offset on
which the Boyer-Moore algorithm depends,


I don't claim to understand Boyer-Moore, beyond what I gather
is the key insight: if you're searching for "ABC" and character
N is "D" there is no point in checking characters N+1 or N+2.
(Hence the non-obvious part of building the appropriate delta
tables before you start.)

But I'm puzzled, since I think Alex's point is this:  A Boyer-Moore
implementation that works on 8-bits characters (e.g. Latin-1) will
work unchanged on UTF-8 characters.  Naively, one would think it
would have the same performance characteristics.  I guess that
statistics of multi-byte characters might throw off the "delta"
tables so they delta will tend to be smaller.  (However, don't
go into details for my sake: I suspect t would take me too much
effort to delve into it.)

As Per Bothner agreed, that can degrade the asympototic
complexity of the algorithm from O(n/m) to O(n).


I did?
--
	--Per Bothner
per@xxxxxxxxxxx   http://per.bothner.com/

Follow-Ups:
- Re: Surrogates and character representation
  - From: William D Clinger
- Re: Surrogates and character representation
  - From: Alex Shinn

References:
- Re: Surrogates and character representation
  - From: William D Clinger

Prev by Date: Re: freshman-level Boyer-Moore fast string search
Next by Date: Re: freshman-level Boyer-Moore fast string search
Previous by thread: Re: Surrogates and character representation
Next by thread: Re: Surrogates and character representation
Index(es):
- Date
- Thread