This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
William D Clinger wrote:
Alex Shinn wrote: > Thus Bothner's statement still stands without a counter-example. What you mean, of course, is that Bothner's statement still stands without a counterexample you are willing to admit. To deny that the FLBM is a counterexample, you must deny that the FLBM is a useful algorithm. To deny that the real Boyer-Moore algorithm is a counterexample, you must deny that it is a useful algorithm. You can't get out of that by insisting that the Boyer-Moore algorithm is useful only when the strings being searched are represented in UTF-8, as silly as that claim would be, because it is easy to construct examples for which the algorithm must access all of the bytes that encode every character that it examines in the UTF-8 strings being searched. In case you have forgotten Bothner's statement, here it is: "Random accesses to a position in a string that has not been previously accessed is not in itself useful."
I might quibble that BM isn't really random access: It's really sequential scanning, but sometimes you're able to skip ahead more than a single character at a time. Also, note that character indexes aren't inputs to the algorthm, - i.e. they aren't semenatically meaningful. The discussion is whether the performance of one particular algorithm (and its variations) is better when using a UTF-32 representation or a UTF-8 representation. It's similar to asking what kind of indexes would work best, which I'm also not in a position to judge. I won't deny that BM is useful, but I suspect it is *less* useful now that people search XML files and other complex text data structures with embedded controls and tree structure. Or they use special indexes. Also, remember that with today's hardware what matters performace-wise are cache misses, and UTF-8 uses less memory that UTF-32. Any theoretical performance advantage of UTF-32 for BM has to make up for that fact. Even if UTF-8 has to access (say) twice as many bytes, that doesn't matter if it ends up accessing fewer cache "words'. -- --Per Bothner per@xxxxxxxxxxx http://per.bothner.com/