[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



 > But I'm puzzled, since I think Alex's point is this:  A Boyer-Moore
 > implementation that works on 8-bits characters (e.g. Latin-1) will
 > work unchanged on UTF-8 characters.

Alex was right about that, and I was wrong.  (I was wrong
for several reasons, which varied over time.  At one point
I thought "unchanged" meant the Scheme code would remain
unchanged while STRING-REF was changed from an O(1) to
an O(n) operation, or vice versa.  What he really meant
was that the C code would remain unchanged while the data
was changed from UCS-4 to UTF-8, or something like that.)

 > Naively, one would think it
 > would have the same performance characteristics....

It does, roughly speaking.  Alex, however, claimed that the
Boyer-Moore algorithm actually performs *better* on UTF-8,
because he thought the preprocessing phase requires time
and space proportional to the size of the alphabet.  That
was what I understood to be his main point, and it was the
reason I felt I had to respond to him.  He was wrong about
that.

 > > As Per Bothner agreed, that can degrade the asympototic
 > > complexity of the algorithm from O(n/m) to O(n).
 >
 > I did?

No, you didn't.  I misunderstood someone else's post; on top
of that, I saw your name in the cc: field and thought it was
the from: field.  Sorry.

Will