This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
The purpose of this note is to show how random access to the characters of a string improves the efficiency of string searching beyond what is possible with random access to the bytes of a UTF-8 string. This can be demonstrated using the Boyer-Moore algorithm for fast string searching, but that algorithm is fairly hard to analyze. For this note, I use a simplified form of the Boyer-Moore algorithm, which I call the Freshman- Level Boyer-Moore (FLBM) algorithm because it is simple enough to be explained in freshman-level courses. The difference between the full Boyer-Moore algorithm and the FLBM algorithm is that the automaton used in the FLBM algorithm returns to its start state following every random access into the string being searched. The FLBM algorithm ------------------ Given: a string s0 of length m for which we are to search, and a set of strings to be searched for occurrences of s0 as a substring. Let n be the total length of the strings to be searched. Returns: a set of pairs of the form <s,i>, where s is one of the strings that was searched, i is an index into s, and s0 is a substring of s beginning at i. Preprocessing ------------- The FLBM algorithm begins by pre-processing s0 to compute the following table. For each character that occurs within s0, the table contains the index of the last occurrence of that character within s0. Using red-black search trees, this table can be constructed in O(m lg m) time, can be searched in O(lg m) time, and occupies O(m) space. Hence the preprocessing step takes O(m lg m) time and O(m) space. Initialization -------------- To search for s0 within a string s, set the initial value of i to 0. Main step --------- If i+m-1 is less than the number of characters within s, let s[i+m-1] be the (i+m-1)th character of s. If s[i+m-1] does not occur within s0, then set i to i+m. Otherwise let k be the index of the last occurrence of s[i+m-1] within s0, and set i to i+m-1-k. With random access to the characters of s, the operation described by the previous paragraph runs in O(lg m) time. With random access to the bytes of s represented in UTF-8, however, the previous paragraph takes O(m) time. The problem with UTF-8 is that we can't access s[i+m-1] without looking at all of the bytes that encode the characters that precede it. We can look at the byte with index i+m-1, and we might even be able to calculate the character whose UTF-8 encoding includes the byte at byte index i+m-1 (I don't know Unicode well enough to know whether this is possible), but knowing that character doesn't help. For the FLBM algorithm to work, we have to know the *character* with index i+m-1. Without knowledge of that character, we cannot use our table to compute the new value of i. Furthermore we cannot fall back on preprocessing the UTF-8 representation of s0 and using that table instead, because the computation of the new value for i would depend upon knowing that all characters of s are encoded using some fixed number of bytes, which is not true of UTF-8. Now to finish the explanation of the algorithm. If k=m-1, then setting i to i+m-1-k did not change the value of i. In that case, and in that case only, we use a naive string comparison to see whether s0 is a substring of s starting at index i. If it is, then we have found a match; whether we have found a match or not, we set i to i+1 and continue. Then repeat the main step of the algorithm. Stop when i+m-1 is no longer a valid character index into s. Analysis -------- I will assume m < n. (If this assumption is not true, then the problem is vacuous: s0 cannot be a substring of any string to be searched.) In the worst case, the FLBM algorithm requires O(mn lg m) time. In the best case, however, the FLBM algorithm runs in O(n/m lg m) time. This is also the usual case in practice. With UTF-8 encoding, the FLBM algorithm requires O(mn lg m) time in the worst case, and O(n lg m) time in the best case. In the usual case, FLBM/UTF-8 takes O(n lg m) time. This usual-case performance for UTF-8 is asymptotically worse than for encodings that allow random access to characters. The FLBM algorithm requires O(m) space regardless of the encoding used for strings. Will