[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Why are byte ports "ports" as such?



On Sun, 2006-05-21 at 07:31 -0700, bear wrote:
> If you use a character encoding that has multibyte sequences
> for some unicode codepoints, you can be left with up to seven
> bytes that are the "trailing part" of a codepoint before the
> next codepoint begins.

Nit: 5 bytes. The maximum legal code point in UTF-8 is 6 bytes.

>   And given combining codepoints and
> variation selectors, the next codepoint may not begin a new
> character itself.

Actually, this raises two very important points:

  1. The correct primitive is READ-CODEPOINT, not READ-CHAR.
     READ-CHAR is a library routine.

     Implication: text ports are not primitive either, and
     (whatever they may be named) should be understood as
     codepoint ports.

  2. The standard must define a normalization form as well as
     an encoding for input units of compilation.

     The universal answer in every other system/language that I have
     seen has been: UTF-8, normalization form C.

     I suspect that this will generate much debate as everyone
     expresses opinions about it, and will then turn out to be
     the "obvious" answer in hindsight, so let's get it over
     with :-)

One very nice attribute of this approach is that the implementation can
use READ-CODEPOINT (which has unambiguous behavior) directly, leaving
READ-CHAR and text ports as a matter for library implementation.


shap