[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Surrogates and character representation



Just US$0.02 worth from the lurking depths.

Surrogates are no more than an elegant hack to extend the original
16-bit codespace to a 32-bit codespace. This talk of blocking the
surrogate blocks as the range of character values is silly, IMHO.

The implementation should be concerned with codepoints, in the range
0x000000 to 0x10FFFF. How these get mapped to bytes or words is an
issue with whatever transcoder you have in place to generate a
printable form of the abstract character.

Looking at characters this way, any codepoint in the range 0xD800
through 0xDFFF is considered in invalid character. This conforms with
section 3.8 of TUS, D26a and D27. These characters only show up when
dealing with UTF-16. UCS-4, UTF-32, UTF-8, etc. don't use them.

If you treat the surrogates as undefined within the character range,
then you must (for consistency) treat all of the other undefined
abstract characters as holes. This just complicates processing.

From the programmer's perspective, I just want to deal with characters
as single entities (combining forms aside for the moment.) It is up to
me to knwo whether my string has been normalized or not, and deal with
that situation. For most uses it doesn't matter.

Using Unicode as the underlying character rep while using glyph
semantics at the program level is, to me, a recipe for complete
confusion. Then iteration over strings, and random string access,
becomes difficult: <0054 0073 0068 0075 0308 00DF> would then have
physical character indicies at 0, 1, 2, 3, 5.

One question I've had: how are 8-bit (i.e., byte) strings handled
here? Is there no distinction between operations on raw bytes and
operations on characters?

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"