[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: the "Unicode Background" section

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



From: "John.Cowan" <jcowan@xxxxxxxxxxxxxxxxx>
Subject: Re: the "Unicode Background" section
Date: Fri, 22 Jul 2005 17:56:00 -0400

> I'm not saying that any Scheme system has to accept every possible
> encoding (though I do think at least ASCII, UTF-8, and UTF-16 should
> be mandatory; they are all trivial), but it needs to be possible
> to specify the encoding of a port when it is created.  (I don't think
> it's necessary to be able to change it on the fly, though.)

Changing encodings in a port may come handy in a couple of very
practical situation:

- Parsing RFC2822 and/or MIME messages (the header is ASCII, 
  and the content's charset is specified in the header)

- Parsing documents that have encoding specification near the
  beginning of it (e.g. <?xml version="1.0" encoding="utf-8"?>,
  or the "coding: utf-8" magic comment to specify source-code
  encoding).

Both can be handled by layering ports, i.e. first you can use an
ascii port on top of binary port to find necessary info, then
create a new port with desired encoding on top of the original
binary port to suck the content.  You need to be careful about 
buffering, though.  And some may dislike the overhead of layering.
But that's out of scope of the discussion.

> Absolutely.  Or more specifically: attempt to write a character that's
> not in the repertoire associated with the encoding is an error.
> Allowing this to be lax is just asking for trouble.

I mentioned some other options in my reply to Tom Lord, but
there's one practical example:

Suppose I have a dynamic website which can store Unicode document.
My cgi script uses a CES-conversion port in its output so that
it can send out the document in CES specified by the web browser.
When one iso8859-1 browser ask a content which has chinese
characters, it won't be very useful if the cgi script sends
an error page.  Usually replacing unmappable characters for '?'
or something would be better.
(Again, it can be done by smart error handlers that does user-friendly
thing when 'encoding not supported' error.  It is much more handy
if port can handle it, though).

--shiro