This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
On Tue, 27 Jan 2004, Ken Dickey wrote: >On Tuesday 27 January 2004 09:32 am, bear wrote: >> On Mon, 26 Jan 2004, Ken Dickey wrote: >> >Well color me dumb, but I don't see why getting O(1) is such a big deal. >... >> O(1) reference or character setting comes at the expense of O(n) >> insertions, deletions, and non-identical-sized replacements. >> >> EG, if I change "the" to "a" at the beginning of a long string, and >> I've represented it as a vector to get O(1) reference time, the rest of >> the string has to be copied to move it two character spaces in memory. >I was puzzled by the ropes discussion here because it seemed to be orthogonal >to the Unicode discussion. I now see that its because it _is_ orthogonal to >the Unicode discussion. The only thing that unicode has to do with it is that unicode makes non-identical sized replacements more likely, and makes it more likely that the programmer will not realize that a given operation involves non-identical sized replacements. Replacing one codepoint with another may wind up being a replacement of a character that takes 1 octet of UTF-8 to express with a character that takes 3 octets of UTF-8 to express, or vice versa. This sort of thing is amenable to your proposed approach of indexed fallback into another vector. But replacing a character with a combining sequence of multiple codepoints, or vice versa, is also likely; in fact the Unicode Consortium's canonicalization algorithms do this all the time. In this case you're looking at things like replacing U+212B ANGSTROM SIGN with U+41 LATIN CAPITAL LETTER A , U+30A COMBINING RING ABOVE and if your implementation treats the former as one character and the latter as two characters, which most do, you wind up with the same need to copy the rest of the string that changing "a" to "the" caused in ASCII strings. This is not amenable to your proposed approach of indexed fallback into another vector. What this means is that, while on an absolute level Unicode and rope representation are orthogonal issues, Unicode has patterns of likely use that rely heavily on the most expensive operations of vector representations. And of course both came up here because the first draft of the FFI SRFI wanted a C pointer to a mutable memory area containing the internal representation of a scheme string, and has to know this kind of "detail" to even make sense of what it finds there. As a result of the discussions here, I'm now considering adding more types of string values, each with its own read syntax and conversions: For example, #,(Latin-1-vector "hello world") would be an octet vector where each octet is a latin-1 character. This would make binary I/O using string-like constructions possible and give C programs the kind of FFI value they wanted. No characters outside Latin-1 would be allowed, of course. #,(UTF32-vector "hello world") would be a "string" indexed by unicode codepoint rather than by character. Handy for FFI, and also allows people to create invalid or non-canonical combining sequences, assign values that aren't even mapped codepoints to arbitrary locations, or do other linguistically wrong operations. However, converting it to a regular string would canonicalize it, and would fail if it contained non-characters. Bear