This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
Alex Shinn wrote: > What is being done stupidly? Any freshman textbook will include Sigma > in the analysis of BM (at least mine, "Algorithms" by CL&R does, and > all of the web analysis I Googled for did as well). You were arguing that UTF-8 "wins hands down for string searching". Had it been clear that you meant to say nothing more than that UTF-8 wins hands down, provided the implementor doesn't know any better than to use the naive representations that are often found in freshman textbooks, then I would not have responded to you. > Ah, so this uses a hash table instead of a binary tree. For > asymptotic notation this will remove the lg(m) factor, though > it does incur some not insignificant constant overhead. I will assume you are smart enough to realize that a potentially three-stage lookup based on Unicode scalar values, which usually terminates in the first stage of lookup, is essentially similar to the potentially six-step lookup required for UTF-8, which usually involves only one step. > It seems you are suggesting non-fixed-length arrays as being > poor choices of representation. It seems you are not reading very carefully. > Indeed, without some sort of > string pointer or cursor object SRFI-75 does indeed penalize > UTF-8 and ropes. I disagree. SRFI-75 is neutral with respect to UTF-8 and ropes. The scope of SRFI-75 is limited to support for Unicode in three of Scheme's existing data types; it does not add new datatypes. So far as I can tell, however, SRFI-75 is entirely compatible with "some sort of string pointer or cursor object" proposal, and I think it would be a good idea for someone to propose such things in some other SRFI. > Thus Bothner's statement still stands without a counter-example. What you mean, of course, is that Bothner's statement still stands without a counterexample you are willing to admit. To deny that the FLBM is a counterexample, you must deny that the FLBM is a useful algorithm. To deny that the real Boyer-Moore algorithm is a counterexample, you must deny that it is a useful algorithm. You can't get out of that by insisting that the Boyer-Moore algorithm is useful only when the strings being searched are represented in UTF-8, as silly as that claim would be, because it is easy to construct examples for which the algorithm must access all of the bytes that encode every character that it examines in the UTF-8 strings being searched. In case you have forgotten Bothner's statement, here it is: "Random accesses to a position in a string that has not been previously accessed is not in itself useful." Will