[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: the discussion so far

bear scripsit:

> The particular example I'm thinking of is splitting strings
> between base codepoint and combining codepoint. You get two
> substrings, and the second one is syntactically invalid.

Please point to a place in the Unicode Standard where any sequence
of Unicode scalar values is said to be "syntactically invalid".

> If you print the first substring and then the second, the
> combining codepoint is usually printed as though it modified
> a space character that isn't actually there.

That's one possibility; it can also be rendered on top of
a dotted-circle, which is what is done in the Unicode charts.
In any case, glyph rendering is not part of the Standard.

> If something
> normalizes the substrings first, the space may actually be
> added, although it wasn't present in the original string.

That turns out not to be the case.  The normalized form of
a string consisting of one combining character is itself.

> Gah.  Encodings, normalization forms, endianness, and all the
> rest of it.  When you want to write a "character" any of a dozen
> things can happen.

Blurring significant distinctions that have taken a long time to
nail down isn't very conducive to clear thinking.

Not to perambulate                 John Cowan <jcowan@xxxxxxxxxxxxxxxxx>    
    the corridors                  http://www.reutershealth.com
during the hours of repose         http://www.ccil.org/~cowan
    in the boots of ascension.       --Sign in Austrian ski-resort hotel