[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: the "Unicode Background" section



At Thu, 21 Jul 2005 15:45:34 -0700, Thomas Lord wrote:
>> If CHARs are codepoints, more basic Unicode algorithms translate
>> into Scheme cleanly.   

> I don't see what you mean. Can you provide an example?

How about: Emitting a UTF-16 encoded stream of the contents
of a string?   Doesn't that sound like an application for
WRITE-CHAR?   Or is that the kind of thing one shouldn't
be able to do in portable Scheme?

>> What is gained by forcing surrogates to be unrepresentable as CHAR?

> Every string is representable in UTF-8, UTF-16, etc.

You are concerned about sequences containing isolated (unpaired)
surrogates and their implications for string algebra.  Your
concerns are entirely reducible to a concern with UTF-16 --
in all other encodings, there is no ambiguity.

So... how can we represent a string containing an isolated
surrogate in UTF-16?   One idea is for an implementation
to privately allocate a range of characters for that purpose.
Stuffing an isolated surrogate into a string in such an 
implementation may result in storing 32-bytes (a surrogate
pair encoding an isolated surrogate) but so what?  There
are other techniques available too.

In fact, it would be a MINOR arbitrary limitation of a
conforming implementation (according to your own standards
of what's important, evidenced by the draft) if that implementation
simply aborted when an attempt to read or form an isolated
surrogate happened.  Why, then, would the standard bother
to forbid it?


>> What kind of code will I wind up with if I want to iterate over
>> a large range of CHAR values? 

> Two loops: one from 0 to #xD7FF, and one from #xE000 to #x10FFFF.

I'm not sure what to say other than that I don't see why you
are comfortable with that.  Surely people will want to paper
that over and the net result will be what I suggested that you
did not quote: we'll wind up with a separate set of APIs to 
cope with character arithmetic -- odd since arithmetic is just
arithmetic no matter how you spell it.


>> It's not as if by excluding surrogates we arrive at a CHAR definition
>> that is significantly more "linguistic" than if we don't.

> True, but we arrive at a definition that is more standards-friendly,

I don't know what you mean by "standards-friendly" here.

> FWIW: MzScheme originally supported a larger set of characters, mainly
> because extra bits are available my implementation. The resulting bad
> experience convinced me to define characters in terms of scalar 
> values, instead.

I don't see your point.  I don't see what "extra bits" have to do with
surrogates.  You also don't explain why a set of characters larger
than "Unicode scalar values" caused a bad experience and I don't take
your word for it (maybe you guys made some other mistake that was
the *real* cause of the problems you encountered -- maybe you
misidentified the issues -- I can't tell from your account).

-t


--- Begin Message ---
At Thu, 21 Jul 2005 15:45:34 -0700, Thomas Lord wrote:
> If CHARs are codepoints, more basic Unicode algorithms translate
> into Scheme cleanly.   

I don't see what you mean. Can you provide an example?

> What is gained by forcing surrogates to be unrepresentable as CHAR?

Every string is representable in UTF-8, UTF-16, etc.

> What kind of code will I wind up with if I want to iterate over
> a large range of CHAR values? 

Two loops: one from 0 to #xD7FF, and one from #xE000 to #x10FFFF.

> It's not as if by excluding surrogates we arrive at a CHAR definition
> that is significantly more "linguistic" than if we don't.

True, but we arrive at a definition that is more standards-friendly,
and that's part of the overall compromise.

FWIW: MzScheme originally supported a larger set of characters, mainly
because extra bits are available my implementation. The resulting bad
experience convinced me to define characters in terms of scalar values,
instead.

Matthew



--- End Message ---