This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
On Thu, 22 Jan 2004, Tom Lord wrote: >As you admit: > > > If we restrict discussion to the characters that can appear in > > canonical string representation (meaning no ligatures) > >Which is fine for certain implementations but not for the standard. > > > Every > > cased character in unicode, with the single exception of eszett, > > has a lower-case and an upper-case; > >A single exception is an exception nonetheless. I think that what we >should take from such exceptions is not their singularity (who knows >what tomorrow may bring) but rather the "in general" programmatic >structure that they require. True, and this single exception means that I will need, like every other implementor, to caution users that case changing operations may change the length of a string and the indexes of codepoints after the problematic character. However, I can make the actual occurrence of this relevant only to german-using audiences, for whom it will be well known and well understood. >Yeah, I think you ought to alpha-rename STRING-REF and STRING-SET! and >various other procedures in your code. I think you want >COMBINING-SEQUENCE? not CHAR? for this infinite set you have in mind. I'm pretty sure I don't. I'm talking about characters, not codepoints. Codepoints are representation; characters are ideas. Some characters take more than one codepoint to represent. > > I would say that if the character set is not restricted to a known > > width, I think it's handier with codepoint separators, especially > > since most characters are in the 0...255 or 0..65535 range. > > > Instead of writing #\U+C32000000AF for a combining sequence > >I don't think you get to do that. I don't think combining sequences >can be CHAR?. Sorry. Useful data type -- yes. CHAR? -- no. Eh, okay. There's no real difference in what can be expressed. The U+ notation strongly suggests unicode, but I can compatibly use it for a character set whose first 2^21 mappings are the same mappings as unicode. > > > /=== R6RS Recommendation: > > > > R6RS should strongly encourage implementations to make the > > > expected-case complexity of STRING-REF and STRING-SET! O(1). > > > > \======== > > > I'd fail this. My strings are O(Log N) access where N is the length > > of the string. > >Really? You'd fail? On small strings you almost certainly wouldn't, strings up to ~1000 general characters, or 4080 latin-1 chars - O(1). These are a single vector or 'strand'. strings up to ~3000 general characters, or 12240 latin-1 chars - O(2). Traverse root node, operate in one of three child node strands. strings up to ~9000 general characters, or 36720 latin-1 chars - O(3). root node, branch node, strand. strings up to ~27000 general characters, or 110160 latin-1 chars - 0(4). root node, branch node, branch node, strand. and so on. It's a 2/3 B-tree of "strands" where each strand is a 4 kbyte vector. 4 32-bit values per strand are used for overhead. If the characters are in the latin-1 range, a strand can contain up to 4080 characters. But any strand which contains any characters outside latin-1 is represented using a 32-bit representation, and potentially several consecutive 32-bit values within such a strand may belong to the same character. The loss is that referring to an indexed character in a string is O(n). The win is that inserting or deleting characters is also O(n) and I can avoid spending a lot of storage on copies and a lot of CPU on making copies since different strings can share strands. >just because they're small. If your trees are at all balanced, you >probably wouldn't fail on "medium" strings either. It's a self-balancing tree structure, but the length of the path from root to strand is still proportional to the logarithm base 3 of the string length. This is a law of the universe with trees. >think you should look at some form of self-balancing tree -- then you >won't fail at all. (Assuming, that is, that you don't come away from >"STRING? is a tree" as I think you probably should --- the tree is >something else, not STRING?.) A string is a sequence of characters. I looked at strings and the operations I'm doing on them in language processing. I insert, delete, prepend, append, and copy strings a lot. When insertion or deletion requires copying the rest of the string to shift it forward or back one or two places, and every string has to have its own storage for data 99% of which is copied wholesale from other strings, and you have to do all that copying, it's intolerably slow. The balanced-tree representation with potentially shared leaves supports those operations better than the vector representation, _especially_ as strings grow larger than a few kilobytes. I'm working with text corpora, so the average string length is large - anything from a ten-kilobyte wall street journal article to a two-megabyte novel. I'll cheerfully create routines that copy an entire string into a single vector of a specific representation for an FFI to play with, and I'll create routines that copy them back into trees when the FFI says it's done playing, but I won't go back to using vectors internally and winding up waiting while megabytes of data get copied back and forth to move something up or down a few character spaces every time one word is inserted or deleted near the front. > > > While R6RS should not require that CHAR? be a subset of Unicode, > > > it should specify the semantics of string indexes for strings > > > which _are_ subsets of Unicode. I'll go along with that, and I have no trouble conforming. In that case the "large characters" may simply be regarded as lying outside unicode. > > Computing the codepoint-index on demand would require a traversal > > of the string, an O(N) operation, using my current representation. > > That's clearly intolerable. But in the same tree structure where I > > now just keep character indexes, I can add additional fields for > > codepoint indexes as well, making it an O(log N) operation. > > And if you were to use self-balancing trees, it would be an > expected-case O(1) operation. ??? Balanced trees are still trees, and if the tree is exp(n) long there are still O(n) links to follow from the root to a leaf. Can you explain what you mean? > Only because you are in a mindset where you want CHAR? to be a > combining char sequence. There are so many problems with permitting > that as a conformant Scheme that I think it has to be rejected. You > need to pick a different type for what you currently call CHAR?. I want char? to be a character, and to never have to care, on the scheme side of things, about how exactly it's represented, whether it's in unicode as a single codepoint or several, or etc. I don't give a flying leap whether unicode regards it as a combining sequence or not, and I don't want to *have* to give a flying leap about whether it's represented as a combining sequence or not. If it functions linguistically as a character, I want to be able to write scheme code that treats it as a character. I don't want to ever have to worry or think about the representation, until I'm doing I/O or passing a value across an FFI and the representation becomes important. Bear