[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: the "Unicode Background" section

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



At Thu, 21 Jul 2005 20:28:14 -0700, Thomas Lord wrote:
> At Thu, 21 Jul 2005 15:45:34 -0700, Thomas Lord wrote:
> >> If CHARs are codepoints, more basic Unicode algorithms translate
> >> into Scheme cleanly.   
> 
> > I don't see what you mean. Can you provide an example?
> 
> How about: Emitting a UTF-16 encoded stream of the contents
> of a string?   Doesn't that sound like an application for
> WRITE-CHAR?

I see that this is a missing part of the story so far: as I understand
it, a forthcoming R6RS SRFI will add `write-byte' and `read-byte' to
Scheme. So, I would say that it's a job for `write-byte'.

> >> What is gained by forcing surrogates to be unrepresentable as CHAR?
> 
> > Every string is representable in UTF-8, UTF-16, etc.
> 
> You are concerned about sequences containing isolated (unpaired)
> surrogates and their implications for string algebra.  Your
> concerns are entirely reducible to a concern with UTF-16 --
> in all other encodings, there is no ambiguity.
> 
> So... how can we represent a string containing an isolated
> surrogate in UTF-16?   One idea is for an implementation
> to privately allocate a range of characters for that purpose.
> Stuffing an isolated surrogate into a string in such an 
> implementation may result in storing 32-bytes (a surrogate
> pair encoding an isolated surrogate) but so what?  There
> are other techniques available too.

No, I didn't explain myself well. I'm not concerned about how to
implement characters internally. I'm concerned with how to communicate
with the rest of the world.

In particular, in many cases it will be necessary to marshal outgoing
characters as bytes, and unmarshal incoming bytes as characters.
Standards such as UTF-8 and UTF-16 fit the bill nicely.

However, if "\uD800" is a string, then there's no natural way to encode
it as UTF-8. We could make up some standard, but made-up encodings
won't really help interoperate with other programs and tools.

> In fact, it would be a MINOR arbitrary limitation of a
> conforming implementation (according to your own standards
> of what's important, evidenced by the draft) if that implementation
> simply aborted when an attempt to read or form an isolated
> surrogate happened.  Why, then, would the standard bother
> to forbid it?

This is another facet of how I was unclear. I'm less worried about
rejecting ill-formed input than having to do something sensible on
output. If standard Scheme allows programmers to output a string
"\uD800", then my implementation will need to handle that case somehow.

> >> It's not as if by excluding surrogates we arrive at a CHAR definition
> >> that is significantly more "linguistic" than if we don't.
> 
> > True, but we arrive at a definition that is more standards-friendly,
> 
> I don't know what you mean by "standards-friendly" here.

Hopefully the above clarifies somewhat. I mean that it's obvious how to
read and write UTF-8, UTF-16, etc.

> > FWIW: MzScheme originally supported a larger set of characters, mainly
> > because extra bits are available my implementation. The resulting bad
> > experience convinced me to define characters in terms of scalar 
> > values, instead.
> 
> I don't see your point.  I don't see what "extra bits" have to do with
> surrogates.  You also don't explain why a set of characters larger
> than "Unicode scalar values" caused a bad experience and I don't take
> your word for it (maybe you guys made some other mistake that was
> the *real* cause of the problems you encountered -- maybe you
> misidentified the issues -- I can't tell from your account).

I wasn't sure that the anecdotes would be of interest, and I'm happy to
elaborate.

I started by defining the set of characters to match the integer range
0 to #x7FFFFFFF, since if I had to spend 21 bits for Unicode, then my
encoding would spend at least 31. I think this matches old-style UCS-4.

The first roadblock was dealing with encodings, along the lines
sketched above. It wasn't clear how to output things with surrogates in
them for UTF-16 output, in particular. (I agree that I could have made
something up, but for output, it didn't make sense to me even then.) I
didn't at first realize that UTF-8 was a problem, too.

In any case, I removed the surrogates, but left the range extended.
This first bit me when I started testing the GUI toolkit. For example,
MzScheme handed the Mac toolbox a "UTF-8" encoded string with the "code
point" #x10000000 in it, and the toolbox promptly complained, because
it wasn't well-formed UTF-8. Output remained a problem for the same
reason, of course, though that took me a little while longer to
discover.

The problem here isn't really about UTF-8, but the mismatch in
definitions of character. My choices seemed to be to define a subset of
strings that were allowed for GUI labels and such, or to fix the
definition of character. The former seemed error prone (it wasn't clear
how many places that would be necessary, both now and in the future),
so I went with the latter.

After fixing the definition of character, encoding issues have been
pretty clear, and I haven't run into the same sort of bugs in the GUI
connection or other tools.

So, that's my experience, for what it's worth. I actually tried what
you're suggesting, I found it unsatisfactory. Of course, your mileage
may vary.

Matthew