[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encodings.

Paul Schlie wrote:
> I'm apologize if my tone was interpreted as being antagonistic.
> Although I may have abused the use of "canonical", my intent was to
> suggest that raw data I/O represents the fundamental basis required to
> support arbitrarily encoded data access ....

I agree. But have we actually advocated arbitrarily encoded data?
There's two levels here: How you encode the codepoints (UTF-8, UTF-16,
UTF-32, something else) and how you normalize the codepoints. The first
choice need not be arbitrary or even standardized. But no matter how you
do it, you'll need to deal with normalization if you're using Unicode.

> and in that respect, tried to suggest that null-encoding may be
> thought of as root canonical encoded form (where a null encoding
> transform does nothing, therefore lossless, fully preserving all the
> originally encoded data states in their native form) ....

That's not generally possible with Unicode. There is no single, standard
normalization form, and applications must be prepared to deal with that.
It's one of the consequences of using Unicode. If you ignore it, you
will not be able to process text efficiently. Fortunately, it's not a
big deal to normalize graphemes.

> However under no circumstances should scheme I/O be presumed to be
> based on any particular character encoding which may be different than
> the host platforms presumption ....

Again, I don't think anyone has proposed this. The whole SRFI is
intended for platforms that *are* well-suited to using Unicode. But even
in that case, you need to deal with normalization issues.
Bradd W. Szonye