This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
On 7/29/05, William D Clinger <cesura@xxxxxxxxxxx> wrote: > > Alex Shinn wrote: > > What you describe is not the Boyer-Moore algorithm. It is a > > (freshman level variant of) a variant of the Boyer-Moore algorithm > > My message was perfectly clear about that. It was perfectly clear that you were describing a "freshman level variant." It was not perfectly clear that it was a variant of something that was not the true BM. > > which trades a time and space cost of Sigma in the preprocessing > > step for a constant factor of lg(m) in both steps. > > Why do you continue to speak as though the Boyer-Moore algorithm's > preprocessing must be done so stupidly? What is being done stupidly? Any freshman textbook will include Sigma in the analysis of BM (at least mine, "Algorithms" by CL&R does, and all of the web analysis I Googled for did as well). That's one option - the textbook definition of the BM algorithm. You need to modify the algorithm if you want to get rid of Sigma. You're proposal used a binary tree to construct the lookup, and your own analysis thus correctly factored in a lg(m) to the overall cost. I represented both of these approaches accurately in my summary. > My description of the FLBM algorithm illustrates the basic idea of > a less stupid representation, and you can find it worked out in > more detail at http://www.codeproject.com/csharp/bmsearch.asp > for the real Boyer-Moore algorithm. Ah, so this uses a hash table instead of a binary tree. For asymptotic notation this will remove the lg(m) factor, though it does incur some not insignificant constant overhead. > If, as you say, the real Boyer-Moore algorithm can be made to run > just as fast with UTF-8 Can be made to run? Perhaps you're not grasping the fact that no changes need to be made to an octet string search algorithm to work properly with UTF-8. Here is a concrete example of a UTF-8 string search example in Chicken: (define string-search (foreign-lambda* int ((c-string haystack) (c-string needle)) "char *result = strstr(haystack, needle); return result ? (result - haystack) : -1;")) (display (string-search "HERE IS A SIMPLE EXAMPLE" "EXAMPLE")) (newline) I used the C FFI to emphasize we're just calling out to a C function that searches at the octet level. This function could easily use the standard BM algorithm, with no need to specially handle UTF-8. Do you not understand how the nature of UTF-8 encoding is such that this is a viable approach? > > However, the point of the discussion is that people are objecting to > > explicitly forcing strings to be character arrays. > > That may be *your* point, but it isn't *my* point. So far as I can > tell, absolutely no one has proposed, whether explicitly or implicitly, > that strings be forced to be character arrays. In particular, SRFI-75 > does not propose that. The previous draft did in fact propose that, and was objected to: http://srfi.schemers.org/srfi-75/mail-archive/msg00055.html I see the current draft has changed "fixed-length array" to "sequence". I'm sorry I did not catch this change - perhaps a changelog would be helpful. However, from your previous comments: http://srfi.schemers.org/srfi-75/mail-archive/msg00250.html SRFI-75 doesn't dictate any particular representation, and that's good. SRFI-75 does penalize certain poor choices of representation, and I think that's good too. It seems you are suggesting non-fixed-length arrays as being poor choices of representation. Indeed, without some sort of string pointer or cursor object SRFI-75 does indeed penalize UTF-8 and ropes. > I got into this because Per Bothner stated that "Random accesses to a > position in a string that has not been previously accessed is not in > itself useful." I cited the Boyer-Moore string search to show that > Bothner's statement is untrue. The points are inherently related. The reason random access is being disputed is because it is the only advantage of fixed with string representations. I then pointed out that BM is in fact simpler in UTF-8, and can use the exact same algorithm, therefore can be no slower. We can dispute whether it is in fact faster till the cows come home, because the real cost is in the average case time which depends on the actual data used and hidden constants. Thus Bothner's statement still stands without a counter-example. -- Alex