This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
I apologize that I have not had time to read the entire thread on this SRFI. The following describes what I did in BitC (a statically typed language, but similar in "feel" to scheme in many respects) in case it is helpful. If nothing else, it may be that enumerating the problems and issues I considered will be helpful to the SRFI-75 discussion. I will broaden in a few places. The introduction of unicode re-opens issues of input and output normalization and also issues of compilation unit character sets. All of these issues must be dealt with consistently. For reference, the relevant parts of the BitC specification may be found at: Compilation unit character set and normalization: http://www.coyotos.org/docs/bitc/spec.html#2 Character Literals: http://www.coyotos.org/docs/bitc/spec.html#2.4.3 String Literals: http://www.coyotos.org/docs/bitc/spec.html#2.4.4 Some comments: 1. The historical scheme (and LISP) syntax for character literals, which I borrowed, is unfortunate. There is no way to lexically reconcile the special character literal tokens with the escaping mechanism that is used within strings. We tried for a while, and gave up. 2. Having given up, we then adopted many of the single character backslash escapes from C. It is exceedingly convenient to be able to write an embedded newline or carriage return, the escape conventions are well known to nearly all programmers, and they are already recognized by many implementations of Scheme in any case. Some form of escaping is necessary in any case to admit double quote within strings. I recommend adopting these. 3. There is an issue with newline processing in input and output (which probably is the subject of a different SRFI). Platforms do not agree about newline conventions in text files. A regrettable consequence is that character streams require specification at open time as to whether they are being opened for binary or text processing. One regrettable consequence of this is that the R5RS specification for open-output-file and open-input-file is inadequate. A second argument needs to be added to specify newline processing conventions. Note that this also became an issue for UNIX STDIO, and that acceptance of "t" and "b" in the file mode argument to fopen() is now mandated by the C standard. This is also an issue for string ports. In general, any operation that opens a port must specify the desired processing for newlines. 4. In considering what to do about identifiers, I concluded that the problem should be divided into "first characters" and "follow characters". This aligned things nicely with the existing Unicode identifier model, and it was sufficient to then add a few punctuation characters to the legal set. The set of additional characters was taken from the Common LISP standard, but it should not be hard to adapt it to scheme. 5. Since integer literal tokenization already needs to handle a leading radix, I saw no reason to preclude this for character code points as well. 6. I went back and forth several times on numeric encoding of code points within strings. I found no answer that would satisfy everyone, but I did arrive at some conclusions: 1. It seems unlikely that the unicode character set is done growing. This suggests that the code point embedding in strings wants to be delimited. 2. Setting aside the risk of code point growth, the most common code point embedding syntax is extremely error prone. For a C programmer, it is easy to misread \0767 and especially so when the length of code points is variable. 3. Since the longer code points are quite long indeed, it seems undesirable to require a fixed-length sequence of digits following the '\' (or whatever delimiter is selected) 4. Nearly all unicode literature uses the convention U+xxxx to describe characters. Given a choice of bad tokens, familiar ones are better than unfamiliar ones. 5. Some languages distinguish between u+xxxx U+xxxxxxxx this is an artifact of having adopted unicode prior to unicode 3, when it was still believed that 16 bit code points would suffice. After some discussion within our group, we settled on \{U+xxx...xxxx} within strings #\{U+xxx...xxxx} character literals While the delimiting is not necessary in character literals, it is visually parallel and it simplifies tokenization slightly. 7. I am NOT convinced that the current handling of white space in BitC strings is right. Given that escapes were permitted, and that the human eye has a hard time differentiating different whitespace characters that may render equivalently, I concluded that it was better to restrict the legal set of non-escaped white space within string literals (and units of compilation) rather than expand it. This approach has two advantages: (1) I know it will work, and (2) it is a compatible change to expand the legal white space later. 8. On advice from several other language designers, I chose normalization C and UTF-8 for input format. Normalization C seems to be generally agreed as preferred. UTF-8 is backward compatible with 7-bit ISO-LATIN-1. Automatically detecting the UTF encoding of an input unit is perilous. 9. Once you have a variable-length character representation, it becomes necessary to incorporate separate means for reading bytes from input streams. For example this is needed if the programmer wishes to construct code to process files in (e.g.) UTF-32. This raises a question about newline canonicalization. My suggestion is that the port's handling of newlines should be independent of the caller. That is, read-byte on a text-mode port that would normally convert the input \r\n to \n should return the byte corresponding to \n. If you want unmangled bytes, use binary mode input. The same argument does *not* apply for read-char, because it is the nature of read-char to process the bytes in order to determine character length. 10. string-length must return length-in characters. For serialization purposes, it *may* be advisable to add string-byte-length as well; we have not come to a decision about this. 11. Strings now, more than ever, are not just vectors of characters (though this should be a feasible implementation). There is *excellent* discussion of the issues in the libicu documentation, and I strongly recommend reading that. 12. The libicu package has gotten almost all of this stuff right, and is very widely used. I strongly recommend using it as a model for the character processing library that necessarily goes with scheme. 13. Because the Unicode specification is updated and corrected, it is necessary for the Scheme standard to specify a version.