This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.
On Wed, 21 Jan 2004, Tom Lord wrote: > By somewhat reasonable expectation, there must be at least 256 > distinct Scheme characters and INTEGER->CHAR must be defined for all > integers in the range `0..255'. There are many circumstances in > which conversions between octets and characters are desirable and > the requirements of this expectation say that such conversion is > always possible. It is quite possible to imagine implementations in > which this is not the case: in which, for example, a (fully general) > octet stream can not be read and written using READ-CHAR and DISPLAY > (applied to characters). Such an implementation might introduce > non-standard procedures for reading and writing octets and > representing arrays of octets. While such non-standard extensions > may be desirable for independent reasons, I see no good reason not > to define at least a subset of Scheme characters which is mapped to > the set of octet values. I think that this is a problem. We need a portable method of reading/writing an arbitrary octet stream, full stop. As characters become more complicated than octets, the two concepts must be divorced from each other; otherwise there will be endless hair as this exception or that rears its ugly head. So I'd propose READ-CHAR and DISPLAY which read or write "a character" abstracting away issues of encoding, multibyte character sets, endianness, etc. according to either application defaults or port properties, and two new routines, READ-OCTET and WRITE-OCTET, which read and write binary values exactly eight bits wide and take or return exact integers in the 0..255 range. In fact, READ-OCTET and WRITE-OCTET would in that case become primitive, since READ-CHAR and DISPLAY could be implemented in terms of them but the reverse would not be true. This neatly sidesteps the issue of needing character mappings for every member of the range 128-255, and separates the ideas of octet and character at the lowest level. > Pika is of the "approximately 2^21 characters" variety. > > Specifically, the Pika CHAR? type will in effect be a _superset_ of > the set of Unicode codepoints. Each 21-bit codepoint will > correspond to a Pika character. For each such character, there > will be (2^4-1) (15) additional related characters representing the > basic code point modified by a combination of any of four > "buckybits". FWIW, I'm using unicode codepoints in the private use as combining characters to represent buckybits. I think this is compatible, but conversions between representations will introduce yet more hair. > R5RS requires a partial ordering of characters in which upper and > lower case variants of "the same character" are treated as equal. > > Most problematically: R5RS requires that every alphabetic character > have both an upper and lower case variant. This is a problem > because Unicode defines abstract characters which, at least > intuitively, are alphabetic -- but which lack such case mappings. This problem goes away in the infinite-character-set universe. If we restrict discussion to the characters that can appear in canonical string representation (meaning no ligatures), Every cased character in unicode, with the single exception of eszett, has a lower-case and an upper-case; the catch is that the uppercase and lowercase versions of it may require different numbers of combining codepoints to represent. Eszett is in a class by itself, being a canonical lowercase character and having an uppercase form which is, linguistically, a different number of characters, as well as being a different number of codepoints. I was initially driven to the multi-codepoint representation by the attempt to solve this particular problem in reconciling the unicode standard with R5RS, and I wound up with a 99.999% solution. > We'll explore the topic further, later, but briefly: it does not > appear that "good Unicode support" and "R5RS requirements for case > mappings" are compatible -- at least not in a simple way. There's simple and there's simple and there's simple. It works "simply", at least from the programmer's POV, in the world where strings are agressively canonicalized and using multi-codepoint character representation, with the sole exception of the eszett character. This is complicated to implement, but all the complication is under the hood from the programmer's POV. >** What _is_ a Character, Anyway > /=== R6RS Recommendation: > > R6RS should explicitly define a _portable_character_set_ > containing the characters mentioned earlier: `a..z', `A..Z', > space, formfeed, newline, carriage return, and the punctuation > required by Scheme syntax. > > Additionally, R6RS should define an _optional_ syntax for > Unicode codepoints. I propose: > > #\U+XXXXX > > and in strings: > > \U+XXXXX. > > where XXXXX is an (arbitrary length) string of hexadecimal digits. It's important to note the terminating '.' in the representation for use in strings. Otherwise there is an ambiguity introduced. I would say that if the character set is not restricted to a known width, I think it's handier with codepoint separators, especially since most characters are in the 0...255 or 0..65535 range. Instead of writing #\U+C32000000AF for a combining sequence it would be handier and clearer, and easier to parse since you don't have to do bignum modulus operations to work out where to break up the number read, to write #\U+C32:AF or similar. This has the additional advantage that it doesn't "guess" the size of the codepoints (currently fixed at 21, but given 32 bits in the above example and the current most-general encoding). It becomes handier and clearer yet if some of the most useful entities have names, so you can write something like #\A:Macron or equivalently #\A:AF as a character. >* Scheme Strings Meet Unicode > /=== R6RS Recommendation: > > R6RS should strongly encourage implementations to make the > expected-case complexity of STRING-REF and STRING-SET! O(1). > > \======== I'd fail this. My strings are O(Log N) access where N is the length of the string. All told, I'd say this is in fact a performance win since it means I can do copy-on-write tricks with small substrings (strands) of the string rather than copy the whole string every time somebody wants to save both the original version and a slightly- changed version of it (which happens a lot when people are editing a multi-megabyte document and there's an undo stack). > Most of the possible answers to "what is a Scheme character" are > consistent with the view that characters correspond to (possibly a > subset of) Unicode codepoints. > > One of the possible answers to that question has the CHAR? type > correspond to a _sequence_ of Unicode code points. > > /=== R6RS Recommendation: > > While R6RS should not require that CHAR? be a subset of Unicode, > it should specify the semantics of string indexes for strings > which _are_ subsets of Unicode. > > Specifically, if a Scheme string consists of nothing but Unicode > codepoints (including substrings which form combining sequences), > string indexes _must_ be Unicode codepoint offsets. > > \======== > > > That proposed modification to R6RS presents a (hopefully small) > problem for Ray Dillinger. He would like (for quite plausible > reasons) to have CHAR? values which correspond to a _sequence_ of > Unicode codepoints. While I have some ideas about how to > _partially_ reconcile his ideas with this proposal, I'd like to hear > his thoughts on the matter. Computing the codepoint-index on demand would require a traversal of the string, an O(N) operation, using my current representation. That's clearly intolerable. But in the same tree structure where I now just keep character indexes, I can add additional fields for codepoint indexes as well, making it an O(log N) operation. This would add a constant factor to the processing times for my string operations, since I'd have to update two different sets of indexes instead of one on a write, but it's feasible and it would add other useful capabilities which would manifest on the scheme side, where I'd be introducing new functions based on codepoint indexes, in addition to the existing functions based on character indexes. However, I do strongly feel that these additional routines should be just that - additional. They do not deal with "characters" per se, but specifically with a single method of representing characters. So they're in the same spirit as things that apply binary bitmasks to floating-point constants or other such representation-dependent tricks. Bear