[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogates and character representation

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



Thomas Bushnell BSG writes:
> This is exactly part of the reason why char=codepoint is such a lose.
> Most code doesn't *want* to see this kind of garbage; it's an encoding
> issue.  I want chars where the *computer* takes care of the coding.  I
> want chars that are fully-understood characters, not little pieces of
> a character.

Surrogates are a side-effect of UTF-16. Period. Application-level code
just doesn't see them. This entire discussion about whether or not a
CHAR should include surrogate code points is, IMHO, a waste of
everyones talents here. It's much ado about nothing.

The only time you should see a surrogate value is if the input text is
malformed. Otherwise the lower-level transcoders should have converted
to the appropriate astral plan codepoint. If the text is malformed,
big deal. It is not difficult to handle this case.

FWIW, I've been working in Unicode since before UTF-16 was
developed. Most of my work is in Asian languages, where I would expect
to see characters outside the BMP. The reality is that they are just
not that commmon. You don't see them. The only time I do see them is
once in a while when dealing with texts from Hong Kong that are
encoded in UTF-16. But the transcoding layers makes these go away, and
I just have the full codepoint. If you are a developer and you lose
sleep over surrogates, I envy you.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"