[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

 > But I'm puzzled, since I think Alex's point is this:  A Boyer-Moore
 > implementation that works on 8-bits characters (e.g. Latin-1) will
 > work unchanged on UTF-8 characters.

Alex was right about that, and I was wrong.  (I was wrong
for several reasons, which varied over time.  At one point
I thought "unchanged" meant the Scheme code would remain
unchanged while STRING-REF was changed from an O(1) to
an O(n) operation, or vice versa.  What he really meant
was that the C code would remain unchanged while the data
was changed from UCS-4 to UTF-8, or something like that.)

 > Naively, one would think it
 > would have the same performance characteristics....

It does, roughly speaking.  Alex, however, claimed that the
Boyer-Moore algorithm actually performs *better* on UTF-8,
because he thought the preprocessing phase requires time
and space proportional to the size of the alphabet.  That
was what I understood to be his main point, and it was the
reason I felt I had to respond to him.  He was wrong about

 > > As Per Bothner agreed, that can degrade the asympototic
 > > complexity of the algorithm from O(n/m) to O(n).
 > I did?

No, you didn't.  I misunderstood someone else's post; on top
of that, I saw your name in the cc: field and thought it was
the from: field.  Sorry.