[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encodings.

This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 are here. Eventually, the entire history will be moved there, including any new messages.

On Thursday 12 February 2004 11:07 pm, Bradd W. Szonye wrote:
> On Thu, Feb 12, 2004 at 02:10:18PM +0100, Ken Dickey wrote:
> > Ah!  So a broken language (huge tables and complex processing) must be
> > defined to deal with broken tools which do not write out Unicode data
> > in a canonical format.
> ..., there's more
> than one canonical form. The "C" forms compose characters into the
> smallest number of code-points possible. The "D" forms decompose them
> into fully-general base+combining forms. Programs which disagree on the
> form of the I/O will need to translate between the two.
> > What about creating a tool which reads bizarre Unicode and writes it
> > out in a canonical format?  Then requiring portable Scheme programs to
> > pass through it?
> That wouldn't help unless they agree to write the *same* canonical
> format. Besides, this is just separating part of the reader's job into
> an external program, and in an error-prone way.

I think there is again confusion between processing Unicode data and reading 
Scheme programs.

Let's say that there is a Scheme SRFI (or even, *GASP*, a standard) which 
picks a single cannonical Unicode form (say the most compact one) and 
requires, where Unicode is used, that Scheme programs be prepared in that 
format.  [And perhaps specify 'ascii/latin1/utf-8/ucs2/... parameters to open 
the appropriate kind of input port].

This has essentially nothing to do with normalization and other processing of 
Unicode data.

This means that a Scheme reader can use a fairly simple case-folding algorithm 
(compared to "slice-em-dice-em kitchen knife" normalization algorithms) which 
is fairly compact [871 case-fold "exceptions" in Unicode 4]  and hence leaves 
implementations reasonably small.

I do not buy the argument that "this is just separating part of the reader's 
job into an external program, and in an error-prone way."  I think that this 
is keeping the reader manageable.  Saying you have to swallow the ocean to 
process a stream is silly (and dangerous!).