This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
Alex Shinn wrote: > Well, UTF-8 is in theory 1-4 bytes, and in practice you almost never > see 4-byte codepoints, and 1 byte codepoints are extremely common. Sorry, I was thinking UTF-8 is 1-6 bytes, but that's out of date. > Is the three-stage assumption for binary trees? No. Stage 1: If the Unicode scalar value is less than 256, then use it as an index into an array. Stage 2: If it's greater than 255, then look it up in a hash table or balanced search tree. (A third stage isn't necessary, but I thought it might suggest to you how dividing a 21-bit scalar value into three 7-bit chunks can simulate, with no real loss of efficiency apart from memory considerations, the UTF-8 encoding that you thought was faster.) > You can't just dismiss a lg(m) factor. I'm not. I put that factor into my description of the FLBM algorithm just to show you how easy it is to do better than O(m+Sigma) time and space during the preprocessing phase. Please remember that the whole point of that algorithm is to make the basic ideas easy enough for freshmen to understand. In a real implementation of the real Boyer-Moore algorithm, I would use something at least as fast as the two-stage lookup described above. > Out of curiosity, what string representations does SRFI-75 penalize > which you consider to be poor? Suppose each string s is represented by a vector of 2^21 elements, where element i consists of a list of numbers, in IEEE double precision, that represent the indexes within s at which the character c appears, where c is the Unicode scalar value f(i), where f is represented by a global association list that maps scalar values to indexes (i.e. f-inverse). SRFI-75 allows that representation, yet penalizes it. I also consider it to be a poor representation. Will