[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: the discussion so far
On Tue, 19 Jul 2005, John.Cowan wrote:
>> Right; substrings that aren't valid strings, or which combine into
>> something that isn't the original string, can result when you split
>> grapheme clusters; This happens when you take substrings on arbitrary
>> codepoint boundaries, or do buffered operations on arbitrary codepoint
>> boundaries, or any of a number of other things.
>These things turn out not to be the case. They are true if you split
>strings on arbitrary *octet* or *code unit* boundaries, but if you
>stick to *codepoint* boundaries, they are not true. Any sequence of
>codepoints is a valid string, and no amount of taking apart and putting
>back together can change the validity or the interpretation of the string.
The particular example I'm thinking of is splitting strings
between base codepoint and combining codepoint. You get two
substrings, and the second one is syntactically invalid.
If you print the first substring and then the second, the
combining codepoint is usually printed as though it modified
a space character that isn't actually there. If something
normalizes the substrings first, the space may actually be
added, although it wasn't present in the original string.
>The description of grapheme clusters in Unicode makes it clear that they
>are neither correct nor complete in all circumstances, just yet another
>global definition that provides a fairly good approximation.
In my opinion, they provide a *MUCH* better approximation to
what a "character" actually is than codepoints do.
> You do realize that there is a countable infinity of different grapheme
Yes. They're like integers that way - a useful type.
>> This introduces a distinction between text ports (which read and write
>> characters, full-stop) and binary ports (which read and write octets).
>> If you want to read or write characters on a binary port, you *SHOULD*
>> have to state explicitly what encoding to use.
>Indeed. That, however, has to do with encodings, not normalization forms.
Gah. Encodings, normalization forms, endianness, and all the
rest of it. When you want to write a "character" any of a dozen
things can happen.