[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: srfi-75@xxxxxxxxxxxxxxxxx
Subject: Surrogates and character representation
From: Tom Emerson <tree@xxxxxxxxxxxxx>
Date: Thu, 21 Jul 2005 23:54:58 -0400
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
In-reply-to: <1122002894.6607.29.camel@xxxxxxxxxxxxxx>
References: <1122002894.6607.29.camel@xxxxxxxxxxxxxx>
Reply-to: tree@xxxxxxxxxxxxx

Just US$0.02 worth from the lurking depths.

Surrogates are no more than an elegant hack to extend the original
16-bit codespace to a 32-bit codespace. This talk of blocking the
surrogate blocks as the range of character values is silly, IMHO.

The implementation should be concerned with codepoints, in the range
0x000000 to 0x10FFFF. How these get mapped to bytes or words is an
issue with whatever transcoder you have in place to generate a
printable form of the abstract character.

Looking at characters this way, any codepoint in the range 0xD800
through 0xDFFF is considered in invalid character. This conforms with
section 3.8 of TUS, D26a and D27. These characters only show up when
dealing with UTF-16. UCS-4, UTF-32, UTF-8, etc. don't use them.

If you treat the surrogates as undefined within the character range,
then you must (for consistency) treat all of the other undefined
abstract characters as holes. This just complicates processing.

From the programmer's perspective, I just want to deal with characters
as single entities (combining forms aside for the moment.) It is up to
me to knwo whether my string has been normalized or not, and deal with
that situation. For most uses it doesn't matter.

Using Unicode as the underlying character rep while using glyph
semantics at the program level is, to me, a recipe for complete
confusion. Then iteration over strings, and random string access,
becomes difficult: <0054 0073 0068 0075 0308 00DF> would then have
physical character indicies at 0, 1, 2, 3, 5.

One question I've had: how are 8-bit (i.e., byte) strings handled
here? Is there no distinction between operations on raw bytes and
operations on characters?

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"

Follow-Ups:
- Re: Surrogates and character representation
  - From: John.Cowan

References:
- Re: the "Unicode Background" section
  - From: Thomas Lord

Prev by Date: Re: the "Unicode Background" section
Next by Date: Re: Surrogates and character representation
Previous by thread: Re: the "Unicode Background" section
Next by thread: Re: Surrogates and character representation
Index(es):
- Date
- Thread