This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
On 7/28/05, Alan Watson <a.watson@xxxxxxxxxxxxxxxx> wrote: > William D Clinger wrote: > > Per Bothner wrote: > > > Random accesses to a position in a string that has not > > > been previously accessed is not in itself useful. > > > > Untrue. The Boyer-Moore algorithm for fast string > > searching uses random accesses to positions that have > > not been previously accessed . > > Yes, but I think you can implement this for UTF-8 or UTF-16 strings > using offsets to the underlying bytes or shorts. I don't think that you > need character offsets. True. In fact any existing Boyer-Moore search that works at the 8-bit byte level will work unmodified on UTF-8 strings. Furthermore, the true cost of BM includes a preprocessing step of O(m+sigma) in both time and space, where m is the pattern length and sigma is the size of the alphabet. For the byte-level UTF-8 search, sigma is 256, but for UTF-32 sigma is 2^21 = 2097152. For patterns that are going to be repeatedly searched, it is reasonable to keep the table for UTF-8 in memory, but the table for UTF-32, if using a SRFI-4 u32vector for offsets, would incur an extra 8MB of memory per pattern. This isn't reasonable - in practice you'll want to perform the preprocessing step every time for UTF-32. UTF-8 wins hands down for string searching. -- Alex