[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Issues with Unicode

[Note:  Due to a typo, Shapiro's response to me was not forwarded to
the srfi-75 mailing list.  I have therefore included it in full
below, as well as an unusually full response.]

Jonathan S. Shapiro scripsit:

> On Sun, 2006-04-23 at 12:23 -0400, John Cowan wrote:
> > > 4. In considering what to do about identifiers, I concluded that the
> > > problem should be divided into "first characters" and "follow
> > > characters". This aligned things nicely with the existing Unicode
> > > identifier model, and it was sufficient to then add a few punctuation
> > > characters to the legal set. The set of additional characters was taken
> > > from the Common LISP standard, but it should not be hard to adapt it to
> > > scheme.
> > 
> > IMHO this is over-conservative, and prevents us from exploiting the weath
> > of mathematical operators and symbols in the standard.  Having wrestled
> > with this issue in drafting XML 1.1, I now firmly believe that identifiers
> > should be defined inclusively, not exclusively, in the Lisp family.
> > See the discussion on p. 132 of the Unicode Standard version 4.0
> > (online at http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf ).
> I do not understand this point. Let us talk about a concrete case: the
> BitC identifier specification, which can be found at:
>   http://www.coyotos.org/docs/bitc/spec.html#2.2
> The list of ``extended alphabetic characters''  might be different for
> Scheme, but you seem to imply that this approach is inappropriate in
> general. After reading that subsection, can you say what issues this
> approach does not address?

I'll quote
http://www.unicode.org/reports/tr31/#Alternative_Identifier_Syntax ,
which does a better job than I can:

        The disadvantage of working with the syntactic classes defined
        above is the storage space needed for the detailed definitions,
        plus the fact that with each new version of the Unicode Standard
        new characters are added, which an existing parser would not be
        able to recognize. In other words, the recommendations based on
        that table are not upwardly compatible.

        This problem can be addressed by turning the question
        around. Instead of defining the set of code points that are
        allowed, define a small, fixed set of code points that are
        reserved for syntactic use and allow everything else (including
        unassigned code points) as part of an identifier. All parsers
        written to this specification would behave the same way for all
        versions of the Unicode Standard, because the classification of
        code points is fixed forever.

        The drawback of this method is that it allows ``nonsense''
        to be part of identifiers because the concerns of lexical
        classification and of human intelligibility are separated. Human
        intelligibility can, however, be addressed by other means, such
        as usage guidelines that encourage a restriction to meaningful
        terms for identifiers. For an example of such guidelines, see
        the XML 1.1 specification by the W3C [XML1.1].

        By increasing the set of disallowed characters, a reasonably
        intuitive recommendation for identifiers can be achieved. This
        approach uses the full specification of identifier classes, as
        of a particular version of the Unicode Standard, and permanently
        disallows any characters not recommended in that version for
        inclusion in identifiers. All code points unassigned as of that
        version would be allowed in identifiers, so that any future
        additions to the standard would already be accounted for. This
        approach ensures both upwardly compatible identifier stability
        and a reasonable division of characters into those that do and
        do not make human sense as part of identifiers.

        Some additional extensions to the list of disallowed code points
        can be made to further constrain ``unnatural'' identifiers. For
        example, one could include unassigned code points in blocks of
        characters set aside for future encoding as symbols, such as
        mathematical operators.

        With or without such fine-tuning, such a compromise approach
        still incurs the expense of implementing large lists of code
        points. While they no longer change over time, it is a matter
        of choice whether the benefit of enforcing somewhat word-like
        identifiers justifies their cost.

        Alternatively, one can use the properties described below,
        and allow all sequences of characters to be identifiers that
        are neither pattern syntax nor pattern whitespace. This has the
        advantage of simplicity and small tables, but allows many more
        ``unnatural'' identifiers.

        R2      Alternative Identifiers

                To meet this requirement, an implementation shall define
                identifiers to be any string of characters that contains
                neither Pattern_White_Space nor Pattern_Syntax characters.

                Or, it shall declare that it uses a modification, and
                provide a precise list of characters that are added
                to or removed from the sets of code points defined by
                these properties.

> I believe that the specified "extended identifier characters" cover the
> "wealth of mathematical operators". If not, then by all means expand it
> further, but do so in specific form.

Well, if we look at Pattern_Syntax, we see that about 2760 characters
are permanently banned from use inside identifiers, including several
codepoint ranges where UTC can add additional characters of this type.
All the other non-whitespace Unicode characters become thus usable
in identifiers.  This list is far too long to discuss in detail,
but one can point to some obvious cases.

For example, why should U+003D EQUALS SIGN be permitted but U+2260
NOT EQUAL TO be forbidden?  Why should we write "forall" rather than
U+2200 FORALL, the inverted "A"?  What's wrong with U+22D9 VERY MUCH
GREATER THAN rather than ">>"?  Consider the ceiling and floor
operators.  Consider the APL symbols with their well-known APL
meanings.  Consider the arrows.  Consider the proper signs for
logical AND and OR.  Consider the diamond and square used in
modal logic for possibility and necessity.  I could go on for
a long, long time.

These characters are probably irrelevant to languages that make
hard distinctions between "operators" and "identifiers", but Lispy
languages have always used the same space for both.

> > >   1. It seems unlikely that the unicode character set is done
> > >      growing. This suggests that the code point embedding in strings
> > >      wants to be delimited.
> > 
> > Care to put your money where your mouth is?  Henry Thompson of the W3C
> > incautiously bet me back in 2001 that within five years Unicode would
> > have exceeded its architectural maximum of 17*65536 = 1,114,112 scalar
> > values.  He's already conceded.
> Yes.

Okay.  Name a price, a date, and (if you like) a third party to hold
the money.  "I'm a poor man, your Majesty", so I'll set an upper
limit of US$100.

> Please re-read my statement: "It seems *unlikely* that the unicode
> character set is done growing." Therefore, character code points should
> be delimited. I chose #\{U+xxx.xxx}. The important points are (1) some
> unambiguous bracketing, and (2) some unambiguous human-comprehensible
> indicator that this is intended to be taken as a Unicode code point.
> Since the textual convention for *writing* Unicode code points seems to
> be (almost universally) "U+xxx..xxx", I chose to adapt this. It is not
> perfect. It works sufficiently well and is visually distinctive.

I do agree that delimiting Unicode character references is appropriate
(carrying around 6-digit forms all the time is annoying and potentially
confusing), and I have no objection to your particular choice of
delimiters, only to your rationale.

> > ...Even the Han script, which is
> > already far and away the largest and is even still growing to some
> > extent, isn't going to get us past Plane 3.  No, until we meet the
> > Galactic Empire, Unicode's current architecture is secure.
> I recall similar arguments about 16 bit address spaces. The delimiters
> resolve both a mechanical and a human parse ambiguity. They do no harm,
> and they may serve to protect us in future.

Comparisons to address spaces are irrelevant.  See
for more on this point.

> > > 9. Once you have a variable-length character representation, it becomes
> > > necessary to incorporate separate means for reading bytes from input
> > > streams. For example this is needed if the programmer wishes to
> > > construct code to process files in (e.g.) UTF-32. This raises a question
> > > about newline canonicalization. My suggestion is that the port's
> > > handling of newlines should be independent of the caller. That is,
> > > read-byte on a text-mode port that would normally convert the input \r\n
> > > to \n should return the byte corresponding to \n. If you want unmangled
> > > bytes, use binary mode input.
> > 
> > The trouble there is that you may want to convert \r\n to \n even if the
> > encoding of the port is UTF-16 or something else not ASCII-compatible;
> > indeed, randomly removing \r bytes (as opposed to characters) will
> > randomly corrupt UTF-16 streams.  Newline handling has to be done
> > after character decoding.
> No. Newline decoding is a policy decision about the *meaning* of the
> newline. The issue is precisely to decide what *is* the character
> decoding.

I think we are talking past each other here.  There is no way in
principle to tell what a newline is until you have decoded the
stream of bytes into characters.  The "t" hack in fopen assumes
that all encodings are upward compatible with ASCII, which turns
out not to be the case.  In addition, we now have five representations
of newline:  CR, LF, CR+LF, NEL (U+0085), LS (U+2028).

> More generally, *any* decision about what constitutes a character
> codepoint constitutes a form of interpretation. Fundamentally, I am
> arguing that there need to be uninterpreted and interpreted ports.
> However distasteful some may find this on purist grounds, it has worked
> well for many years in practice, and no other approach has been advanced
> that can claim this. Indeed, other approaches have been sufficiently
> malformed as to lead to general acceptance of the distinction between
> "text" and "raw" I/O descriptors.

Provided you deal with the fact that UTF-16 and EBCDIC and other
non-ASCII-compatible encodings cannot be used over "text"
I/O descriptors, this is fine.

> > > 13. Because the Unicode specification is updated and corrected, it is
> > > necessary for the Scheme standard to specify a version.
> > 
> > I disagree.  Almost all the changes are upward compatible, and by careful
> > wording it is possible to avoid those areas where non-upward-compatible
> > changes are possible...
> If you can find a wording that satisfies this objective, I shall be
> happy to adopt it for BitC as well. I considered the matter (though not
> exhaustively), and could not find such a wording.
> For example, various "upwards compatible" corrections have added
> characters to the set of legal identifier characters. This poses a
> problem: if your implementation of Scheme accepts Unicode 4.1 plus
> corrections a+b+c, and my implementation accepts Unicode 4.1, then there
> are well-formed programs that your system will accept and my system will
> not. This is exactly the sort of problem that compatibility seeks to
> preempt.

The method explained above was designed to bypass this problem.
With it, whatever is not forbidden is permitted.

> I do not propose that we should inhibit implementations from adopting
> upward-compatible corrections to Unicode. Rather, I propose that we
> should state the *least* version of Unicode that a compliant system must
> accept, and further that we encourage implementations (or some external
> checking program) to provide some checking mode in which they interpret
> compliance strictly, in order to ensure that a developer can check the
> universal acceptance compliance of their programs.

I have no problem with that.

> shap

John Cowan  cowan@xxxxxxxx  http://ccil.org/~cowan
In the sciences, we are now uniquely privileged to sit side by side
with the giants on whose shoulders we stand.
        --Gerald Holton