[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: Alan Watson <a.watson@xxxxxxxxxxxxxxxx>
Subject: Re: Surrogates and character representation
From: Tom Emerson <tree@xxxxxxxxxxxxx>
Date: Sun, 24 Jul 2005 13:54:41 -0400
Cc: "John.Cowan" <jcowan@xxxxxxxxxxxxxxxxx>, Thomas Bushnell BSG <tb@xxxxxxxxxx>, srfi-75@xxxxxxxxxxxxxxxxx
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
In-reply-to: <42E3D086.90403@xxxxxxxxxxxxxxxx>
References: <1122002894.6607.29.camel@xxxxxxxxxxxxxx> <17120.28178.788826.533753@xxxxxxxxxxxxxxxxxxxxxx> <20050722040917.GB7576@NYCMJCOWA2> <17120.30080.768671.539970@xxxxxxxxxxxxxxxxxxxxxx> <878xzykn0y.fsf@xxxxxxxxxxxxxxxxx> <17122.31220.22073.72951@xxxxxxxxxxxxxxxxxxxxxx> <20050724053713.GM2784@NYCMJCOWA2> <42E3D086.90403@xxxxxxxxxxxxxxxx>
Reply-to: tree@xxxxxxxxxxxxx

Alan Watson writes:
> Hmm. That would seem to prevent an implementation representing strings 
> internally using UTF-8. This is convenient in some contexts as Scheme 
> strings can be trivially converted to UTF-8 C strings.

You can create surrogate values in UTF-8, the result is just
ill-formed.  A conformant (Unicode) implementation shouldn't generate
these, though one could argue that if you get garbage-in, you get
garbage-out.

Scenario 1: You have a text stream encoded in UTF-16. It contains a
valid surrogate pair <D840,DD9B>. This is converted to the USV
#x0002019B. If you represent the Unicode strings internally as UTF-8,
this gets converted to the byte-sequence #xF0 #xA0 #x86 #x9B. When
writing the text stream you pick the encoding and the USV gets written
appropriately.

Scenario 2: You have a text stream encoded in UTF-16. It contains a
lone surrogate, <D840>. This is an invalid string. You have a couple
of options:

 2a: reject the input as invalid.

 2b: replace the surrogate value with the replacement character
     U+FFFD (converted to #xEF #xBF #xBD in UTF-8 rep land)

 2c: keep the character, encode internally in UTF-8 (#xED #xA1
     #xB0). On output this gets converted back.

 2d: ignore that value completely, not preserving it on input.

Of these, 2c is non-conforming and not recommended, but avoids data
loss in cases where that is important.

Representing strings internally in UTF-8 is a loss though, since you
lose random access to the string. For some applications this isn't a
big deal, but in general using UTF-8 as an internal representation is
a bad idea.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"

Follow-Ups:
- Re: Surrogates and character representation
  - From: Alan Watson
- Re: Surrogates and character representation
  - From: Per Bothner

References:
- Re: the "Unicode Background" section
  - From: Thomas Lord
- Surrogates and character representation
  - From: Tom Emerson
- Re: Surrogates and character representation
  - From: John.Cowan
- Re: Surrogates and character representation
  - From: Tom Emerson
- Re: Surrogates and character representation
  - From: Thomas Bushnell BSG
- Re: Surrogates and character representation
  - From: Tom Emerson
- Re: Surrogates and character representation
  - From: John.Cowan
- Re: Surrogates and character representation
  - From: Alan Watson

Prev by Date: Re: Surrogates and character representation
Next by Date: Re: Surrogates and character representation
Previous by thread: Re: Surrogates and character representation
Next by thread: Re: Surrogates and character representation
Index(es):
- Date
- Thread