[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



John.Cowan writes:
> All other undefined codepoints are potentially definable: they correspond
> to Unicode scalar values.  Surrogate codepoints are not definable and
> don't correspond to any Unicode scalar value.  The difference is
> architectural.

FFFE is never (by architectural design) going to be defined
either.

Surrogate codepoints have a character property. They should be usable
in a string, and individually can be considered a character. Most
implementations won't see them: only library code that is
reading/writing UTF-16 needs to worry about them in any significant
way. Application code should not see them. They will see U+20069 as
having the value 0x20069, not 0xD840DC69.

In other words, I guess I'm saying that surrogates don't need to be
special cased, because the existing Unicode property model accounts
for them, and the generation/interpretation of them should be handled
at a lower level. Special casing them just complicates everything for
everyone.

> > One question I've had: how are 8-bit (i.e., byte) strings handled
> > here? Is there no distinction between operations on raw bytes and
> > operations on characters?
> 
> Those things are not strings: they are vectors of unsigned 8-bit integers.

Of course. My Python hat is still on where 8-bit strings and Unicode
strings are different beasts, and 8-bit strings are used for
byte-strings.

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"