This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
From: Thomas Lord <lord@xxxxxxx> Subject: Re: the "Unicode Background" section Date: Fri, 22 Jul 2005 12:17:43 -0700 > I think it might be realistic to label ports not with > the encoding scheme they want, but with the set of > code-values they can transmit -- in other words > with their framing constraints. In other words -- > a "UTF-8 port" (no such thing, really) and an "ASCII port" > (no such thing, again) are *really* just "8-bit ports". > A "UTF-16 port" is *really* just a "16-bit port". Then you'll have difficulty to read a character from such ports. Let's not mix CCS and CES. Each CES covers one or more CCS. All Unicode CESs (utf-8, utf-16le, etc...) covers Unicode CCS. EUC-JP covers ASCII, JISX0201 GR area, and JISX0213. Shift_JIS covers JISX0201, and JISX0213 (but not ASCII). A Scheme implemenation choose one (or some) CCS as the character set it supports. It may be Unicode CCS, or subset of it, or supertset of it. The implementation uses its internal CES to represent characters on memory. The outside world uses some external CES. The interface between them (it can be a port or something else, I'll discuss it below) converts internal CES and external CES back and forth. The important point is that we can standardize behavior of strings and characters on CCS level, and leave CES to the implementation. The CES-conversion mechanism may reside either in ports (or a lower-layer of ports, as srfi-68 suggests), or it may be a function that converts strings and binary verctors back and forth. Both are handy in practice, but either one can be implemented in terms of another anyway. (If you wish to generate unpaired surrogates, it's the task of this layer). The problem arises when the coverage of internal CES and the one of external CES doesn't match. There are some options the conversion mechanism can take, every of which I see a practical value: It can signal an error, it can ignore invalid character, or it can replace the invalid character with alternative ones (*1). These are what the implementator should think, and may be standardized by port srfi or something, but that's out of scope of this srfi. Now, (integer->char #xd800) is not in Unicode CCS. If srfi-75 is about Unicode strings, it can leave the behavior undefined. An implementation may extend CCS to support such a 'character' and give its own meaning. Then the implementation also define its CES-conversion behavior when the external CES expects Unicode CCS and internal CES has #\ud800. It _might_ also have its port expect this Extended CCS by default, so that if input port encounters unpaired surrogate, read-char returns #\ud800. Or other implementaiton may support Konjaku Mojikyo as its CCS ( http://www.mojikyo.org/html/abroad/abroad_top.html ) so it may define over #\U0010ffff for its extended CCS. Again, if the implementation elects to do so, it should also provide a reasonable semantics on CES-conversion mechanism with Unicode CESs. It is a matter of implementation's choice, and I don't think this srfi have something to say about it, except it may be explicit to allow such extention and leave the behavior undefined in the standard. Footnote: (*1): I did encounter practical needs for at least two of the cases. - Raising an error: when I want to guarantee what I'm writing to the port is acutally written in the file, I wish I can guarantee this. It's also handy if you don't know the document's CES beforehand, and want to try to read it in one canididate CES, and wants to fallback another if it fails. - Replacing to alternative char: I was gathering statistics from huge e-mail archives and there are tons of invalid characters (e.g. it declares its charset is US-ASCII in its header but it actually contains non-ASCII ISO8859 character). For the purpose of the application (which is a spam filter), I don't want to be signalled errors for these cases, but just altering those invalid characters were suffice. --shiro