This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
From: "John.Cowan" <jcowan@xxxxxxxxxxxxxxxxx> Subject: Re: Surrogates and character representation Date: Sun, 24 Jul 2005 01:37:13 -0400 > but language/library designers (whose job it is to make corner cases > unsuprising) do have to think about them. Yes, but such library is working on the different domains. Suppose the library has a function ucs->utf8. It accepts a character, and returns a sequence of octets, e.g. (ucs->utf8 #\u3042) => (#xe3 #x81 #x82) If it returns (#\u00e3 #\u0081 #\u0082), I'd say there's something wrong in it, it mixes up the domain and the range. The same is true on ucs->utf16: It's type should be Char -> [Int16], and unpaired surrogates appears as Int16. The implementation can have #\ud800, as far as it defines the behavior of expressions such as (ucs->utf16 #\ud800) or (string-append "\ud800" "\udc00"), as well as I/O. If we have it in the standard, the standard should give definitions for those expressions. Do you think there's an agreeable and consistent definition on handling these "characters"? If not, it's better to leave it unspecified. (BTW, I am using a weird Scheme system that allows such invalid "characters" in a string, and sometines it is handy, but it is ugly.) --shiro