[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: William D Clinger <cesura@xxxxxxxxxxx>, srfi-75@xxxxxxxxxxxxxxxxx
Subject: Re: Surrogates and character representation
From: Alan Watson <a.watson@xxxxxxxxxxxxxxxx>
Date: Wed, 27 Jul 2005 17:14:35 -0500
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
In-reply-to: <42E7F7D8.2020201@xxxxxxxxxxx>
Organization: Centro de Radioastronomía y Astrofísica UNAM
References: <42E7F7D8.2020201@xxxxxxxxxxx>
User-agent: Mozilla Thunderbird 1.0 (X11/20050317)

William D Clinger wrote:

Referring to the Boyer-Moore fast string searching algorithm,
Alan Watson wrote:
 > Yes, but I think you can implement this for UTF-8 or UTF-16
 > strings using offsets to the underlying bytes or shorts.  I
 > don't think that you need character offsets.

You certainly don't need character offsets to do a string
search, but the naive algorithm without random access to
characters is O(mn).  The Boyer-Moore algorithm improves
this to O(n/m) in many cases.  I believe one can construct
artificial examples to show that some O(n/m) cases would
degrade to an intermediate complexity, or even back to O(mn),
in UTF-8 or UTF-16 without character offsets.  I don't know
how often examples of those cases would arise in practice.


n = string length
m = pattern length

I can see four cases when UTF-8 is the underlying representation:

(a) You have access to the underlying byte vector and you want a byteindex. O(n/m). Life is sweet.

(b) You have access to the underlying byte vector but you want acharacter index. O(n/m) to find the byte index then O(n) to convert itto a character index. Life is fairly sweet.

(c) You do not have access to the underlying byte vector, there is nocaching of character index to byte index conversions, and you want acharacter index. O(n²/m), I think, because basically each characterindex is an O(n) operation. You say (nm). Either way, life sucks.

(d) You do not have access to the underlying byte vector, theimplementation caches the last two character index to byte indexconversions, and you want a character index. O(n), I think. Life isfairly sweet.

Case (d) works out not too badly because, I think, your next characterindex is always just a few characters (up to m) from one of the last twocharacter indexes. Yes? I think you could even get away with theimplementation caching only the last index, provided it knows how tosearch backwards as well as forwards from this point (pretty easy withUTF-8).

I just think
it's a good idea to understand the efficiency issues before
we dismiss them.  [...] SRFI-75 does penalize certain poor choices of
representation, and I think that's good too.

Yes. I was simply making the point that UTF-8 is not such a losingrepresentation as one might think initially.

I appreciate the fact that some implementations will want to
use the same representation as some other language or system,
even if that is not a particularly good representation.  From
that point of view, I think the main problem with SRFI-75 is
that it requires mutable strings, which (in the presence of
concurrency or an obsession with object identity) make it hard
to change the representation transparently---code written in
some other language or in a concurrent thread might notice the

change, even if the Scheme code in this thread doesn't.

Good point, but I think that if you impose an extra layer ofindirection, you might be able to solve these problems (at least for theother language reading the Scheme string). For example, instead ofhaving the Scheme implementation say to C "here is a pointer to aUTF-8/UTF-16/UCS-32 string that represents this Scheme string", you haveit say "here is a pointer to a pointer to ...". Ditto for the length.


Regards,

Alan
--
Dr Alan Watson
Centro de Radioastronomía y Astrofísica
Universidad Astronómico Nacional de México

Prev by Date: Re: Surrogates and character representation
Next by Date: Re: Surrogates and character representation
Previous by thread: Re: Surrogates and character representation
Next by thread: Re: Surrogates and character representation
Index(es):
- Date
- Thread