[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation



William D Clinger wrote:
Referring to the Boyer-Moore fast string searching algorithm,
Alan Watson wrote:
 > Yes, but I think you can implement this for UTF-8 or UTF-16
 > strings using offsets to the underlying bytes or shorts.  I
 > don't think that you need character offsets.

You certainly don't need character offsets to do a string
search, but the naive algorithm without random access to
characters is O(mn).  The Boyer-Moore algorithm improves
this to O(n/m) in many cases.  I believe one can construct
artificial examples to show that some O(n/m) cases would
degrade to an intermediate complexity, or even back to O(mn),
in UTF-8 or UTF-16 without character offsets.  I don't know
how often examples of those cases would arise in practice.

n = string length
m = pattern length

I can see four cases when UTF-8 is the underlying representation:

(a) You have access to the underlying byte vector and you want a byte index. O(n/m). Life is sweet.

(b) You have access to the underlying byte vector but you want a character index. O(n/m) to find the byte index then O(n) to convert it to a character index. Life is fairly sweet.

(c) You do not have access to the underlying byte vector, there is no caching of character index to byte index conversions, and you want a character index. O(n²/m), I think, because basically each character index is an O(n) operation. You say (nm). Either way, life sucks.

(d) You do not have access to the underlying byte vector, the implementation caches the last two character index to byte index conversions, and you want a character index. O(n), I think. Life is fairly sweet.

Case (d) works out not too badly because, I think, your next character index is always just a few characters (up to m) from one of the last two character indexes. Yes? I think you could even get away with the implementation caching only the last index, provided it knows how to search backwards as well as forwards from this point (pretty easy with UTF-8).

I just think
it's a good idea to understand the efficiency issues before
we dismiss them.  [...] SRFI-75 does penalize certain poor choices of
representation, and I think that's good too.

Yes. I was simply making the point that UTF-8 is not such a losing representation as one might think initially.

I appreciate the fact that some implementations will want to
use the same representation as some other language or system,
even if that is not a particularly good representation.  From
that point of view, I think the main problem with SRFI-75 is
that it requires mutable strings, which (in the presence of
concurrency or an obsession with object identity) make it hard
to change the representation transparently---code written in
some other language or in a concurrent thread might notice the
change, even if the Scheme code in this thread doesn't.

Good point, but I think that if you impose an extra layer of indirection, you might be able to solve these problems (at least for the other language reading the Scheme string). For example, instead of having the Scheme implementation say to C "here is a pointer to a UTF-8/UTF-16/UCS-32 string that represents this Scheme string", you have it say "here is a pointer to a pointer to ...". Ditto for the length.

Regards,

Alan
--
Dr Alan Watson
Centro de Radioastronomía y Astrofísica
Universidad Astronómico Nacional de México