[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation



Alan Watson writes:
> Hmm. That would seem to prevent an implementation representing strings 
> internally using UTF-8. This is convenient in some contexts as Scheme 
> strings can be trivially converted to UTF-8 C strings.

You can create surrogate values in UTF-8, the result is just
ill-formed.  A conformant (Unicode) implementation shouldn't generate
these, though one could argue that if you get garbage-in, you get
garbage-out.

Scenario 1: You have a text stream encoded in UTF-16. It contains a
valid surrogate pair <D840,DD9B>. This is converted to the USV
#x0002019B. If you represent the Unicode strings internally as UTF-8,
this gets converted to the byte-sequence #xF0 #xA0 #x86 #x9B. When
writing the text stream you pick the encoding and the USV gets written
appropriately.

Scenario 2: You have a text stream encoded in UTF-16. It contains a
lone surrogate, <D840>. This is an invalid string. You have a couple
of options:

 2a: reject the input as invalid.

 2b: replace the surrogate value with the replacement character
     U+FFFD (converted to #xEF #xBF #xBD in UTF-8 rep land)

 2c: keep the character, encode internally in UTF-8 (#xED #xA1
     #xB0). On output this gets converted back.

 2d: ignore that value completely, not preserving it on input.

Of these, 2c is non-conforming and not recommended, but avoids data
loss in cases where that is important.

Representing strings internally in UTF-8 is a loss though, since you
lose random access to the string. For some applications this isn't a
big deal, but in general using UTF-8 as an internal representation is
a bad idea.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"