This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
William D Clinger wrote:
With random access to the characters of s, the operation described by the previous paragraph runs in O(lg m) time. With random access to the bytes of s represented in UTF-8, however, the previous paragraph takes O(m) time. The problem with UTF-8 is that we can't access s[i+m-1] without looking at all of the bytes that encode the characters that precede it. We can look at the byte with index i+m-1, and we might even be able to calculate the character whose UTF-8 encoding includes the byte at byte index i+m-1 (I don't know Unicode well enough to know whether this is possible), but knowing that character doesn't help. For the FLBM algorithm to work, we have to know the *character* with index i+m-1.
But a FLBM using UTF-8 strings would naturally index and compare *bytes*, not characters. I.e. the obvious algorithm would let m be the number of *bytes* in the UTF-8 encoding of s0; n the total number of bytes in the strings to be searched, and the i in the results <s,i> would be byte offsets. The pre-processing would be to generate a table, such that:For each byte that occurs within s0, the table contains the index of the last occurrence of that byte within s0.
So the question is whether this is still true: "In the best case,however, the FLBM algorithm runs in O(n/m lg m) time. This is also the usual case in practice." I suspect so. Of course n and m are larger,
since they now measure bytes rather than characters, but these are constant factors that will be dwarfed by other issues (such as cache misses). -- --Per Bothner per@xxxxxxxxxxx http://per.bothner.com/