[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Issues with Unicode

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: srfi-75@xxxxxxxxxxxxxxxxx
Subject: Issues with Unicode
From: "Jonathan S. Shapiro" <shap@xxxxxxxxxxx>
Date: Sun, 23 Apr 2006 10:54:55 +0200
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
Old-date: Sat, 22 Apr 2006 11:09:57 -0400
User-agent: Gnus/5.110004 (No Gnus v0.4) XEmacs/21.5-b24 (darwin)

I apologize that I have not had time to read the entire thread on this
SRFI. The following describes what I did in BitC (a statically typed
language, but similar in "feel" to scheme in many respects) in case it
is helpful. If nothing else, it may be that enumerating the problems and
issues I considered will be helpful to the SRFI-75 discussion.

I will broaden in a few places. The introduction of unicode re-opens
issues of input and output normalization and also issues of compilation
unit character sets. All of these issues must be dealt with
consistently.

For reference, the relevant parts of the BitC specification may be found
at:

  Compilation unit character set and normalization:
    http://www.coyotos.org/docs/bitc/spec.html#2

  Character Literals:
    http://www.coyotos.org/docs/bitc/spec.html#2.4.3

  String Literals:
    http://www.coyotos.org/docs/bitc/spec.html#2.4.4


Some comments:

1. The historical scheme (and LISP) syntax for character literals, which
I borrowed, is unfortunate. There is no way to lexically reconcile the
special character literal tokens with the escaping mechanism that is
used within strings. We tried for a while, and gave up.

2. Having given up, we then adopted many of the single character
backslash escapes from C. It is exceedingly convenient to be able to
write an embedded newline or carriage return, the escape conventions are
well known to nearly all programmers, and they are already recognized by
many implementations of Scheme in any case. Some form of escaping is
necessary in any case to admit double quote within strings. I recommend
adopting these.

3. There is an issue with newline processing in input and output (which
probably is the subject of a different SRFI). Platforms do not agree
about newline conventions in text files. A regrettable consequence is
that character streams require specification at open time as to whether
they are being opened for binary or text processing.

One regrettable consequence of this is that the R5RS specification for
open-output-file and open-input-file is inadequate. A second argument
needs to be added to specify newline processing conventions. Note that
this also became an issue for UNIX STDIO, and that acceptance of "t" and
"b" in the file mode argument to fopen() is now mandated by the C
standard.

This is also an issue for string ports.

In general, any operation that opens a port must specify the desired
processing for newlines.

4. In considering what to do about identifiers, I concluded that the
problem should be divided into "first characters" and "follow
characters". This aligned things nicely with the existing Unicode
identifier model, and it was sufficient to then add a few punctuation
characters to the legal set. The set of additional characters was taken
from the Common LISP standard, but it should not be hard to adapt it to
scheme.

5. Since integer literal tokenization already needs to handle a leading
radix, I saw no reason to preclude this for character code points as
well.

6. I went back and forth several times on numeric encoding of code
points within strings. I found no answer that would satisfy everyone,
but I did arrive at some conclusions:

  1. It seems unlikely that the unicode character set is done
     growing. This suggests that the code point embedding in strings
     wants to be delimited.

  2. Setting aside the risk of code point growth, the most common
     code point embedding syntax is extremely error prone. For a C
     programmer, it is easy to misread
         \0767
     and especially so when the length of code points is variable.

  3. Since the longer code points are quite long indeed, it seems
     undesirable to require a fixed-length sequence of digits
     following the '\' (or whatever delimiter is selected)

  4. Nearly all unicode literature uses the convention

	U+xxxx

     to describe characters. Given a choice of bad tokens, familiar
     ones are better than unfamiliar ones.

  5. Some languages distinguish between

	u+xxxx  U+xxxxxxxx

     this is an artifact of having adopted unicode prior to unicode 3,
     when it was still believed that 16 bit code points would suffice.

After some discussion within our group, we settled on

	\{U+xxx...xxxx}    within strings
        #\{U+xxx...xxxx}   character literals

While the delimiting is not necessary in character literals, it is
visually parallel and it simplifies tokenization slightly.

7. I am NOT convinced that the current handling of white space in BitC
strings is right. Given that escapes were permitted, and that the human
eye has a hard time differentiating different whitespace characters that
may render equivalently, I concluded that it was better to restrict the
legal set of non-escaped white space within string literals (and units
of compilation) rather than expand it. This approach has two advantages:
(1) I know it will work, and (2) it is a compatible change to expand the
legal white space later.

8. On advice from several other language designers, I chose
normalization C and UTF-8 for input format. Normalization C seems to be
generally agreed as preferred. UTF-8 is backward compatible with 7-bit
ISO-LATIN-1. Automatically detecting the UTF encoding of an input unit
is perilous.

9. Once you have a variable-length character representation, it becomes
necessary to incorporate separate means for reading bytes from input
streams. For example this is needed if the programmer wishes to
construct code to process files in (e.g.) UTF-32. This raises a question
about newline canonicalization. My suggestion is that the port's
handling of newlines should be independent of the caller. That is,
read-byte on a text-mode port that would normally convert the input \r\n
to \n should return the byte corresponding to \n. If you want unmangled
bytes, use binary mode input.

The same argument does *not* apply for read-char, because it is the
nature of read-char to process the bytes in order to determine character
length.

10. string-length must return length-in characters. For serialization
purposes, it *may* be advisable to add string-byte-length as well; we
have not come to a decision about this.

11. Strings now, more than ever, are not just vectors of characters
(though this should be a feasible implementation). There is *excellent*
discussion of the issues in the libicu documentation, and I strongly
recommend reading that.

12. The libicu package has gotten almost all of this stuff right, and is
very widely used. I strongly recommend using it as a model for the
character processing library that necessarily goes with scheme.

13. Because the Unicode specification is updated and corrected, it is
necessary for the Scheme standard to specify a version.

Follow-Ups:
- Re: Issues with Unicode
  - From: Marc Feeley
- Re: Issues with Unicode
  - From: Shiro Kawai

Prev by Date: Re: Unicode surrogates
Next by Date: Re: Issues with Unicode
Previous by thread: うえやま　みなこ
Next by thread: Re: Issues with Unicode
Index(es):
- Date
- Thread