[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encodings.

This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.

On Thu, 12 Feb 2004, Ken Dickey wrote:

> I assume that it is useful to distinguish the two goals of
>	extending programming language identifiers
> and	processing Unicode data.

For temporary solutions and bandaids, yes.  But scheme is a lisp, and
our code is data and our data is code.  Our identifier-naming rules,
ultimately, *can* affect our program behavior, where with C and similar
languages, it cannot.

Every implementation that deals with Unicode at all seriously is going
to have to create rules for distinguishing Unicode identifiers, and to
the extent that they adopt *different* rules, there will be enduring
and sometimes very subtle portability problems, and bugs where code
works slightly differently on one system than it does on another

> So w.r.t. identifiers, why is normalization needed at all? To my
> mind, normalization is a library procedure (set of procedures) for
> dealing with Unicode data/codepoints.

Normalization is needed because unicode provides many different ways
to represent what is intended to be _EXACTLY_ the same string.  When
you see a string containing, say, a lowercase 'a' with an accent grave,
you don't know whether that's one codepoint (what unicode calls a
'precombined character') or two codepoints (what unicode calls a 'base
character' plus a 'combining mark').

Furthermore, unicode editors are not required to distinguish these forms,
and may convert one to the other, and back, arbitrarily and without your

In fact, for most operations where it might matter, they are required
to *not* distinguish such forms, so that if you 'text search' for one
you should also find all instances of the other. Editors and compilers
typically convert all unicode text to a preferred normalized encoding
instantly when it is read.

There are four sanctioned choices; NFC, NFD, NFKC, and NFKD.  The
first two and the last two may be converted to and from each other
without loss; they are truly equivalent.  However, NFC/NFD allow much
finer-grained distinctions between similar strings than
NFKC/NFKD. Identifiers that are distinct in the first two forms may
become identical in the latter two forms, and once converted to one of
the latter two forms do not contain sufficient information to convert
them back.

> Defining valid identifier syntax such that case folding of
> (unnormalized) identifier literals should be sufficient.

> What am I missing?

You're missing all the tools and utilities out there that are
programmed with the expectation and requirement that they can
arbitrarily impose or change normalization forms without changing the
text of the documents they handle.  There is no escaping this; even
Emacs and Notepad do it.

> Another note.  Characters are currently dealt with in a fairly abstract
> manner.  It would seem that in dealing with Unicode data as binary data
> (codepoints), R6RS/SRFI/... must define a binary IO API.

I think that's true.  While programmers should never have to care
about the internal representation of characters, they have to care
about writing them in a format acceptable to other systems and reading
different forms of them that are written by other systems.