[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Surrogates and character representation
Alex Shinn quoting me:
> > I acknowledged that the algorithm will still work. My point
> > is that its asymptotic complexity may be degraded.
> I'm sorry, perhaps I'm just misunderstanding, but if the exact
> same algorithm, in fact the exact same code, can be used,
> how is the asymptotic complexity affected?
The different representations for the string being searched
(UTF-8 vs UTF-32) change the problem. The main difference
is that UTF-32 admits random access of characters, while
UTF-8 does not. UTF-8 admits random access of bytes, but
there is no way to convert a byte value obtained by random
access into a UTF-8 string into the character offset on
which the Boyer-Moore algorithm depends, short of reading
all the characters that precede the randomly accessed byte.
As Per Bothner agreed, that can degrade the asympototic
complexity of the algorithm from O(n/m) to O(n).
> Please note that I said "additional overhead." Nowhere did I say
> or suggest that UTF-32 incurred an asymptotic penalty. However,
> the c1 and c2 are not just constants - they are typically larger than
> m, and therefore should not be ignored.
Your belief that c1 and c2 are larger than m depends upon
your implicit assumption that the automaton will be
represented in some absurdly naive way. With any reasonable
representation, c1 and c2 are negligibly small.
The Boyer-Moore fast string search algorithm is an important
algorithm that serves as a well-known example of the issue we
are discussing, but its asymptotic complexity (both time and
space) is not trivial to analyze, and that has been an
impediment to our discussion. In another message I will
present and analyze the Freshman-Level Boyer-Moore algorithm
(FLBM), which is a simplified version that is simple enough
for me to explain in freshmen-level courses, yet exhibits the
asymptotic behavior that is relevant to this discussion.