This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 contain all messages, not just those from before July 7th, 2015.
On Sun, 2006-05-21 at 07:31 -0700, bear wrote: > If you use a character encoding that has multibyte sequences > for some unicode codepoints, you can be left with up to seven > bytes that are the "trailing part" of a codepoint before the > next codepoint begins. Nit: 5 bytes. The maximum legal code point in UTF-8 is 6 bytes. > And given combining codepoints and > variation selectors, the next codepoint may not begin a new > character itself. Actually, this raises two very important points: 1. The correct primitive is READ-CODEPOINT, not READ-CHAR. READ-CHAR is a library routine. Implication: text ports are not primitive either, and (whatever they may be named) should be understood as codepoint ports. 2. The standard must define a normalization form as well as an encoding for input units of compilation. The universal answer in every other system/language that I have seen has been: UTF-8, normalization form C. I suspect that this will generate much debate as everyone expresses opinions about it, and will then turn out to be the "obvious" answer in hindsight, so let's get it over with :-) One very nice attribute of this approach is that the implementation can use READ-CODEPOINT (which has unambiguous behavior) directly, leaving READ-CHAR and text ports as a matter for library implementation. shap