This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
Referring to the Boyer-Moore fast string searching algorithm, Alan Watson wrote: > Yes, but I think you can implement this for UTF-8 or UTF-16 > strings using offsets to the underlying bytes or shorts. I > don't think that you need character offsets. You certainly don't need character offsets to do a string search, but the naive algorithm without random access to characters is O(mn). The Boyer-Moore algorithm improves this to O(n/m) in many cases. I believe one can construct artificial examples to show that some O(n/m) cases would degrade to an intermediate complexity, or even back to O(mn), in UTF-8 or UTF-16 without character offsets. I don't know how often examples of those cases would arise in practice. Others have named other applications of random access to characters within a string. I am not trying to argue that the asymptotic efficiency of these or any other particular algorithms should be decisive, nor am I trying to argue that a constant factor of 4 in space efficiency should be or should not be decisive. I just think it's a good idea to understand the efficiency issues before we dismiss them. As you might divine from my comments in the SRFI-70 list, I prefer to let implementors figure out how to represent strings. It isn't all that hard for an implementation to change the representation, transparently, if it turns out that a string is being used for random access to characters, or that non-Latin-1 characters are being stored into a string whose constituent characters originally lay within the Latin-1 subset. SRFI-75 doesn't dictate any particular representation, and that's good. SRFI-75 does penalize certain poor choices of representation, and I think that's good too. I appreciate the fact that some implementations will want to use the same representation as some other language or system, even if that is not a particularly good representation. From that point of view, I think the main problem with SRFI-75 is that it requires mutable strings, which (in the presence of concurrency or an obsession with object identity) make it hard to change the representation transparently---code written in some other language or in a concurrent thread might notice the change, even if the Scheme code in this thread doesn't. On the other hand, no single representation has emerged in other languages as the obvious standard representation that we should adopt for Scheme. The best we can do is to design our semantics as best we can, allow implementors to design their representations as best they can, and let users decide which systems work best for their purposes. Will