This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
At Thu, 21 Jul 2005 15:45:34 -0700, Thomas Lord wrote: >> If CHARs are codepoints, more basic Unicode algorithms translate >> into Scheme cleanly. > I don't see what you mean. Can you provide an example? How about: Emitting a UTF-16 encoded stream of the contents of a string? Doesn't that sound like an application for WRITE-CHAR? Or is that the kind of thing one shouldn't be able to do in portable Scheme? >> What is gained by forcing surrogates to be unrepresentable as CHAR? > Every string is representable in UTF-8, UTF-16, etc. You are concerned about sequences containing isolated (unpaired) surrogates and their implications for string algebra. Your concerns are entirely reducible to a concern with UTF-16 -- in all other encodings, there is no ambiguity. So... how can we represent a string containing an isolated surrogate in UTF-16? One idea is for an implementation to privately allocate a range of characters for that purpose. Stuffing an isolated surrogate into a string in such an implementation may result in storing 32-bytes (a surrogate pair encoding an isolated surrogate) but so what? There are other techniques available too. In fact, it would be a MINOR arbitrary limitation of a conforming implementation (according to your own standards of what's important, evidenced by the draft) if that implementation simply aborted when an attempt to read or form an isolated surrogate happened. Why, then, would the standard bother to forbid it? >> What kind of code will I wind up with if I want to iterate over >> a large range of CHAR values? > Two loops: one from 0 to #xD7FF, and one from #xE000 to #x10FFFF. I'm not sure what to say other than that I don't see why you are comfortable with that. Surely people will want to paper that over and the net result will be what I suggested that you did not quote: we'll wind up with a separate set of APIs to cope with character arithmetic -- odd since arithmetic is just arithmetic no matter how you spell it. >> It's not as if by excluding surrogates we arrive at a CHAR definition >> that is significantly more "linguistic" than if we don't. > True, but we arrive at a definition that is more standards-friendly, I don't know what you mean by "standards-friendly" here. > FWIW: MzScheme originally supported a larger set of characters, mainly > because extra bits are available my implementation. The resulting bad > experience convinced me to define characters in terms of scalar > values, instead. I don't see your point. I don't see what "extra bits" have to do with surrogates. You also don't explain why a set of characters larger than "Unicode scalar values" caused a bad experience and I don't take your word for it (maybe you guys made some other mistake that was the *real* cause of the problems you encountered -- maybe you misidentified the issues -- I can't tell from your account). -t
--- Begin Message ---
- To: Thomas Lord <lord@xxxxxxx>
- Subject: Re: the "Unicode Background" section
- From: Matthew Flatt <mflatt@xxxxxxxxxxx>
- Date: Thu, 21 Jul 2005 17:52:28 -0600
- Cc: srfi-75@xxxxxxxxxxxxxxxxx
- Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
- In-reply-to: <1121985934.4501.46.camel@xxxxxxxxxxxxxx>
- List-help: <mailto:srfi-75-request@srfi.schemers.org?subject=help>
- List-post: <mailto:srfi-75@srfi.schemers.org>
- List-subscribe: <mailto:srfi-75-request@srfi.schemers.org?subject=subscribe>
- List-unsubscribe: <mailto:srfi-75-request@srfi.schemers.org?subject=unsubscribe>
- Old-return-path: <mflatt@xxxxxxxxxxx>
- References: <1121985934.4501.46.camel@xxxxxxxxxxxxxx>
- Resent-date: Fri, 22 Jul 2005 01:54:17 +0200 (DFT)
- Resent-from: srfi-75@xxxxxxxxxxxxxxxxx
- Resent-message-id: <I5orWB.A.o3H.nWD4CB@rotkohl>
- Resent-sender: srfi-75-request@xxxxxxxxxxxxxxxxx
At Thu, 21 Jul 2005 15:45:34 -0700, Thomas Lord wrote: > If CHARs are codepoints, more basic Unicode algorithms translate > into Scheme cleanly. I don't see what you mean. Can you provide an example? > What is gained by forcing surrogates to be unrepresentable as CHAR? Every string is representable in UTF-8, UTF-16, etc. > What kind of code will I wind up with if I want to iterate over > a large range of CHAR values? Two loops: one from 0 to #xD7FF, and one from #xE000 to #x10FFFF. > It's not as if by excluding surrogates we arrive at a CHAR definition > that is significantly more "linguistic" than if we don't. True, but we arrive at a definition that is more standards-friendly, and that's part of the overall compromise. FWIW: MzScheme originally supported a larger set of characters, mainly because extra bits are available my implementation. The resulting bad experience convinced me to define characters in terms of scalar values, instead. Matthew
--- End Message ---