This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
William D Clinger wrote:
Alex Shinn quoting me: > > I acknowledged that the algorithm will still work. My point > > is that its asymptotic complexity may be degraded. > > I'm sorry, perhaps I'm just misunderstanding, but if the exact > same algorithm, in fact the exact same code, can be used, > how is the asymptotic complexity affected? The different representations for the string being searched (UTF-8 vs UTF-32) change the problem. The main difference is that UTF-32 admits random access of characters, while UTF-8 does not. UTF-8 admits random access of bytes, but there is no way to convert a byte value obtained by random access into a UTF-8 string into the character offset on which the Boyer-Moore algorithm depends,
I don't claim to understand Boyer-Moore, beyond what I gather is the key insight: if you're searching for "ABC" and character N is "D" there is no point in checking characters N+1 or N+2. (Hence the non-obvious part of building the appropriate delta tables before you start.) But I'm puzzled, since I think Alex's point is this: A Boyer-Moore implementation that works on 8-bits characters (e.g. Latin-1) will work unchanged on UTF-8 characters. Naively, one would think it would have the same performance characteristics. I guess that statistics of multi-byte characters might throw off the "delta" tables so they delta will tend to be smaller. (However, don't go into details for my sake: I suspect t would take me too much effort to delve into it.)
As Per Bothner agreed, that can degrade the asympototic complexity of the algorithm from O(n/m) to O(n).
I did? -- --Per Bothner per@xxxxxxxxxxx http://per.bothner.com/