[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



From: "John.Cowan" <jcowan@xxxxxxxxxxxxxxxxx>
Subject: Re: Surrogates and character representation
Date: Sun, 24 Jul 2005 01:37:13 -0400

> but language/library designers (whose job it is to make corner cases
> unsuprising) do have to think about them.

Yes, but such library is working on the different domains.
Suppose the library has a function ucs->utf8.  It accepts a character,
and returns a sequence of octets, e.g.
  (ucs->utf8 #\u3042) => (#xe3 #x81 #x82)
If it returns (#\u00e3 #\u0081 #\u0082), I'd say there's something
wrong in it, it mixes up the domain and the range.
The same is true on ucs->utf16: It's type should be Char -> [Int16],
and unpaired surrogates appears as Int16.

The implementation can have #\ud800, as far as it defines the
behavior of expressions such as (ucs->utf16 #\ud800) or
(string-append "\ud800" "\udc00"), as well as I/O.   If we have
it in the standard, the standard should give definitions for those
expressions.   Do you think there's an agreeable and consistent
definition on handling these "characters"?  If not, it's better
to leave it unspecified.

(BTW, I am using a weird Scheme system that allows such invalid
"characters" in a string, and sometines it is handy, but it is ugly.)

--shiro