This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
On 7/29/05, William D Clinger <cesura@xxxxxxxxxxx> wrote: > The purpose of this note is to show how random access > to the characters of a string improves the efficiency > of string searching beyond what is possible with random > access to the bytes of a UTF-8 string. > > This can be demonstrated using the Boyer-Moore algorithm > for fast string searching What you describe is not the Boyer-Moore algorithm. It is a (freshman level variant of) a variant of the Boyer-Moore algorithm which trades a time and space cost of Sigma in the preprocessing step for a constant factor of lg(m) in both steps. This is probably a worthwhile trade-off when Sigma is the full size of Unicode, but there are other optimizations worth considering. You say above you'll give us random access to the bytes of the UTF-8 string, and I want to use that. I want to use something like strstr(3), which works on bytes and returns a pointer. This could use the real BM, without the lg(m) factor. Thus our costs become: UTF-32 BM: preprocess: O(m + Sigma) time and space, Sigma huge worst case: O(mn) time best case: O(m/n) time UTF-32 BM with binary tree offset lookup: preprocess: O(m lg(m)) time + O(m) space worst case: O(mn lg(m)) time best case: O(m/n lg(m)) time UTF-8 with byte-oriented BM: preprocess: O(m + Sigma) time and space, Sigma small worst case: O(mn) time best case: O(m/n) time Now, if you really wanted to convert the UTF-8 pointer to a codepoint index, that could be done with a post-processing step of O(n). This of course only needs to be done when the string is found, so it will factor into the average case, but not the best case. However, the point of the discussion is that people are objecting to explicitly forcing strings to be character arrays. If we have string-pointers in Scheme then the above conversion to index need not even be performed. Why penalize the fastest implementation? -- Alex