[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: the "Unicode Background" section

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



From: Thomas Lord <lord@xxxxxxx>
Subject: Re: the "Unicode Background" section
Date: Fri, 22 Jul 2005 12:17:43 -0700

> I think it might be realistic to label ports not with
> the encoding scheme they want, but with the set of 
> code-values they can transmit -- in other words
> with their framing constraints.   In other words -- 
> a "UTF-8 port" (no such thing, really) and an "ASCII port"
> (no such thing, again) are *really* just "8-bit ports".
> A "UTF-16 port" is *really* just a "16-bit port".

Then you'll have difficulty to read a character from such ports.

Let's not mix CCS and CES.

Each CES covers one or more CCS.  All Unicode CESs (utf-8, utf-16le,
etc...) covers Unicode CCS.  EUC-JP covers ASCII, JISX0201 GR area,
and JISX0213.  Shift_JIS covers JISX0201, and JISX0213 (but not ASCII).

A Scheme implemenation choose one (or some) CCS as the character
set it supports.  It may be Unicode CCS, or subset of it, or
supertset of it.  The implementation uses its internal CES to
represent characters on memory.  The outside world uses 
some external CES.  The interface between them (it can be a port
or something else, I'll discuss it below) converts internal
CES and external CES back and forth.

The important point is that we can standardize behavior of
strings and characters on CCS level, and leave CES to the
implementation.

The CES-conversion mechanism may reside either in ports (or a
lower-layer of ports, as srfi-68 suggests), or it may be
a function that converts strings and binary verctors back and
forth.   Both are handy in practice, but either one can be
implemented in terms of another anyway.  (If you wish to generate
unpaired surrogates, it's the task of this layer).

The problem arises when the coverage of internal CES and
the one of external CES doesn't match.  There are some options
the conversion mechanism can take, every of which I see a
practical value: It can signal an error, it can ignore invalid
character, or it can replace the invalid character with
alternative ones (*1).   These are what the implementator should
think, and may be standardized by port srfi or something, but
that's out of scope of this srfi.

Now, (integer->char #xd800) is not in Unicode CCS.  If srfi-75
is about Unicode strings, it can leave the behavior undefined.
An implementation may extend CCS to support such a 'character'
and give its own meaning.  Then the implementation also define
its CES-conversion behavior when the external CES expects Unicode
CCS and internal CES has #\ud800.  It _might_ also have its port
expect this Extended CCS by default, so that if input port
encounters unpaired surrogate, read-char returns #\ud800.

Or other implementaiton may support Konjaku Mojikyo as its CCS
( http://www.mojikyo.org/html/abroad/abroad_top.html ) so it
may define over #\U0010ffff for its extended CCS.  Again, if
the implementation elects to do so, it should also provide
a reasonable semantics on CES-conversion mechanism with Unicode
CESs.

It is a matter of implementation's choice, and I don't think
this srfi have something to say about it, except it may be
explicit to allow such extention and leave the behavior undefined
in the standard.


Footnote:

(*1): I did encounter practical needs for at least two of the cases.

- Raising an error: when I want to guarantee what I'm writing
    to the port is acutally written in the file, I wish I can
    guarantee this.   It's also handy if you don't know the document's
    CES beforehand, and want to try to read it in one canididate CES,
    and wants to fallback another if it fails.

- Replacing to alternative char: I was gathering statistics from
    huge e-mail archives and there are tons of invalid characters
    (e.g. it declares its charset is US-ASCII in its header but
    it actually contains non-ASCII ISO8859 character).  For the
    purpose of the application (which is a spam filter), I don't
    want to be signalled errors for these cases, but just altering
    those invalid characters were suffice.


--shiro