[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
On Thursday 12 February 2004 11:07 pm, Bradd W. Szonye wrote:
> On Thu, Feb 12, 2004 at 02:10:18PM +0100, Ken Dickey wrote:
> > Ah! So a broken language (huge tables and complex processing) must be
> > defined to deal with broken tools which do not write out Unicode data
> > in a canonical format.
> ..., there's more
> than one canonical form. The "C" forms compose characters into the
> smallest number of code-points possible. The "D" forms decompose them
> into fully-general base+combining forms. Programs which disagree on the
> form of the I/O will need to translate between the two.
> > What about creating a tool which reads bizarre Unicode and writes it
> > out in a canonical format? Then requiring portable Scheme programs to
> > pass through it?
> That wouldn't help unless they agree to write the *same* canonical
> format. Besides, this is just separating part of the reader's job into
> an external program, and in an error-prone way.
I think there is again confusion between processing Unicode data and reading
Let's say that there is a Scheme SRFI (or even, *GASP*, a standard) which
picks a single cannonical Unicode form (say the most compact one) and
requires, where Unicode is used, that Scheme programs be prepared in that
format. [And perhaps specify 'ascii/latin1/utf-8/ucs2/... parameters to open
the appropriate kind of input port].
This has essentially nothing to do with normalization and other processing of
This means that a Scheme reader can use a fairly simple case-folding algorithm
(compared to "slice-em-dice-em kitchen knife" normalization algorithms) which
is fairly compact [871 case-fold "exceptions" in Unicode 4] and hence leaves
implementations reasonably small.
I do not buy the argument that "this is just separating part of the reader's
job into an external program, and in an error-prone way." I think that this
is keeping the reader manageable. Saying you have to swallow the ocean to
process a stream is silly (and dangerous!).