[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: srfi-75@xxxxxxxxxxxxxxxxx*Subject*: Re: freshman-level Boyer-Moore fast string search*From*: William D Clinger <cesura@xxxxxxxxxxx>*Date*: Fri, 29 Jul 2005 10:33:57 +0200*Delivered-to*: srfi-75@xxxxxxxxxxxxxxxxx*References*: <42E991CE.5050806@xxxxxxxxxxx> <5fb7e0870507282012453c30dd@xxxxxxxxxxxxxx> <42E9B37A.80008@xxxxxxxxxxx> <5fb7e0870507282305772e2cc2@xxxxxxxxxxxxxx>*User-agent*: Gnus/5.110003 (No Gnus v0.3) XEmacs/21.5-b20 (berkeley-unix)

Alex Shinn wrote: > Well, UTF-8 is in theory 1-4 bytes, and in practice you almost never > see 4-byte codepoints, and 1 byte codepoints are extremely common. Sorry, I was thinking UTF-8 is 1-6 bytes, but that's out of date. > Is the three-stage assumption for binary trees? No. Stage 1: If the Unicode scalar value is less than 256, then use it as an index into an array. Stage 2: If it's greater than 255, then look it up in a hash table or balanced search tree. (A third stage isn't necessary, but I thought it might suggest to you how dividing a 21-bit scalar value into three 7-bit chunks can simulate, with no real loss of efficiency apart from memory considerations, the UTF-8 encoding that you thought was faster.) > You can't just dismiss a lg(m) factor. I'm not. I put that factor into my description of the FLBM algorithm just to show you how easy it is to do better than O(m+Sigma) time and space during the preprocessing phase. Please remember that the whole point of that algorithm is to make the basic ideas easy enough for freshmen to understand. In a real implementation of the real Boyer-Moore algorithm, I would use something at least as fast as the two-stage lookup described above. > Out of curiosity, what string representations does SRFI-75 penalize > which you consider to be poor? Suppose each string s is represented by a vector of 2^21 elements, where element i consists of a list of numbers, in IEEE double precision, that represent the indexes within s at which the character c appears, where c is the Unicode scalar value f(i), where f is represented by a global association list that maps scalar values to indexes (i.e. f-inverse). SRFI-75 allows that representation, yet penalizes it. I also consider it to be a poor representation. Will

**Follow-Ups**:**Re: freshman-level Boyer-Moore fast string search***From:*Alan Watson

**References**:**Re: freshman-level Boyer-Moore fast string search***From:*Alex Shinn

**Re: freshman-level Boyer-Moore fast string search***From:*Alex Shinn

- Prev by Date:
**Re: Surrogates and character representation** - Next by Date:
**Re: Surrogates and character representation** - Previous by thread:
**Re: freshman-level Boyer-Moore fast string search** - Next by thread:
**Re: freshman-level Boyer-Moore fast string search** - Index(es):