[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

Alan Watson writes:
> Hmm. That would seem to prevent an implementation representing strings 
> internally using UTF-8. This is convenient in some contexts as Scheme 
> strings can be trivially converted to UTF-8 C strings.

You can create surrogate values in UTF-8, the result is just
ill-formed.  A conformant (Unicode) implementation shouldn't generate
these, though one could argue that if you get garbage-in, you get

Scenario 1: You have a text stream encoded in UTF-16. It contains a
valid surrogate pair <D840,DD9B>. This is converted to the USV
#x0002019B. If you represent the Unicode strings internally as UTF-8,
this gets converted to the byte-sequence #xF0 #xA0 #x86 #x9B. When
writing the text stream you pick the encoding and the USV gets written

Scenario 2: You have a text stream encoded in UTF-16. It contains a
lone surrogate, <D840>. This is an invalid string. You have a couple
of options:

 2a: reject the input as invalid.

 2b: replace the surrogate value with the replacement character
     U+FFFD (converted to #xEF #xBF #xBD in UTF-8 rep land)

 2c: keep the character, encode internally in UTF-8 (#xED #xA1
     #xB0). On output this gets converted back.

 2d: ignore that value completely, not preserving it on input.

Of these, 2c is non-conforming and not recommended, but avoids data
loss in cases where that is important.

Representing strings internally in UTF-8 is a loss though, since you
lose random access to the string. For some applications this isn't a
big deal, but in general using UTF-8 as an internal representation is
a bad idea.


Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"