[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation



Referring to the Boyer-Moore fast string searching algorithm,
Alan Watson wrote:
 > Yes, but I think you can implement this for UTF-8 or UTF-16
 > strings using offsets to the underlying bytes or shorts.  I
 > don't think that you need character offsets.

You certainly don't need character offsets to do a string
search, but the naive algorithm without random access to
characters is O(mn).  The Boyer-Moore algorithm improves
this to O(n/m) in many cases.  I believe one can construct
artificial examples to show that some O(n/m) cases would
degrade to an intermediate complexity, or even back to O(mn),
in UTF-8 or UTF-16 without character offsets.  I don't know
how often examples of those cases would arise in practice.

Others have named other applications of random access to
characters within a string.

I am not trying to argue that the asymptotic efficiency of
these or any other particular algorithms should be decisive,
nor am I trying to argue that a constant factor of 4 in space
efficiency should be or should not be decisive.  I just think
it's a good idea to understand the efficiency issues before
we dismiss them.

As you might divine from my comments in the SRFI-70 list,
I prefer to let implementors figure out how to represent
strings.  It isn't all that hard for an implementation to
change the representation, transparently, if it turns out
that a string is being used for random access to characters,
or that non-Latin-1 characters are being stored into a string
whose constituent characters originally lay within the Latin-1
subset.  SRFI-75 doesn't dictate any particular representation,
and that's good.  SRFI-75 does penalize certain poor choices of
representation, and I think that's good too.

I appreciate the fact that some implementations will want to
use the same representation as some other language or system,
even if that is not a particularly good representation.  From
that point of view, I think the main problem with SRFI-75 is
that it requires mutable strings, which (in the presence of
concurrency or an obsession with object identity) make it hard
to change the representation transparently---code written in
some other language or in a concurrent thread might notice the
change, even if the Scheme code in this thread doesn't.  On
the other hand, no single representation has emerged in other
languages as the obvious standard representation that we should
adopt for Scheme.  The best we can do is to design our semantics
as best we can, allow implementors to design their representations
as best they can, and let users decide which systems work best
for their purposes.

Will