[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RESET [was Re: Encodings]




On Sat, 14 Feb 2004, Ken Dickey wrote:

> A Scheme implementation which properly reads the two files should
> end up with the identifier occurrences [stored in different
> encodings] denoted above represented by symbols which are eq?  (NB:
> _not_ eqv?) to each other.  If not, I term this "broken".

Yup.  Agreed.  Conforming unicode systems read different encodings
(sequences of bytes) or canonicalizations (sequences of codepoints)
and recognize them as being the _same_ string (sequence of abstract
characters).  This is the strongest single reason why I decided for my
own implementation that The Right Thing was to draw boundaries for
character operations at the character level rather than the codepoint
level.  Unicode forces the abstraction of "character" to a level
higher than representation or encoding, but each file is still a
proper sequence of characters,  and each identifier that must be
equated is the same sequence of characters, even if not the same
sequence of codepoints.

So if in an NFD file, I get a sequence of codepoints that goes R, e,
combining accent grave, s, u, m, e, combining accent egu, and in an
NFC file I read a sequence of codepoints that's R, e-with-grave, s, u,
m, e-with-egu, then as a conforming implementation of unicode I *MUST*
recognize that these are the same sequence of characters and treat
them as the same sequence of characters.  In the case of scheme, that
means my compiler must understand that they are the same identifier.


>[2]
>
>[In the absence of reflection] one should be able to consistently replace all
>occurrences of an identifier in the same scope without changing the meaning/
>behavior of a program.  If not, I term the situation "broken".

I'll say "right", but as you note above, there is always the
possibility of reflection, since scheme has symbol->string,
string->symbol, and eval.  Programs that don't use them will not,
generally, need to agree on a syntax for unicode identifiers other
than a simple escape mechanism that allows them to be written in
ascii.

>[3]
>
> There are many concepts which come in paired/binary parts: on/off, up/down, et
> cetera, which have no meaning without both parts.
> [...]
> So if a glyph/character does not have a case variant, considering it to be
> lower case makes no logical sense.  I view this as an abuse of terminology.
> Being outside of normal logic, I term this "bizarre" and if pressed, probably
> "broken" as well.

This happens in one case, (eszett) for a singular reason; the uppercase
form of this *ONE* lowercase letter is *TWO* uppercase letters.

There are many other instances in Unicode in which a character's
lowercase and uppercase form must be represented by a different number
of codepoints, and if you regard codepoints as characters these
instances appear to have the same problem (isolated lowercase forms or
isolated uppercase forms).

> So in all this discussion of multiple canonical forms (another
> misuse of terminology, IMHO) multiple normal forms, et cetera, I am
> looking for a description of how to keep [1] and [2] from being
> broken.

The set of Unicode codepoints is not a character set that has these
properties.  The set of characters that can be represented by
sequences of these codepoints is a character set that has these
properties.

> If satisfying the Unicode Standard means breaking [1], then I say
> "Don't do that!".

No.  Satisfying Unicode means, precisely, *NOT* breaking [1].
Regardless of the encoding of the file (sequences of bytes or
codepoints) Unicode requires the system to recognize that these
identifiers are in fact the same sequences of characters.

The multiple "canonicalizations" that people are worried about
(NFC/NFD versus NFKC/NFKD) can properly be regarded as two character
sets.  The NFC/NFD character set includes many distinctions smaller
than the NFKC/NFKD character set can make, and there is a "standard"
mapping between the two character sets in which there are many
instances in which NFC/NFD characters are distinct, but mapped to the
"same" NFKC/NFKD character.

For example, counting mathematical forms, there are about a dozen
unaccented lowercase latin letter A's in the NFC/NFD character set,
varying mainly by font.  All of these map to the same NFKC/NFKD
character.  Inappropriately converting a file in which the
distinctions are important is a lot like converting a text processor
document in which different fonts are important to plain ascii - it
loses information.  I think it is up to the implementor's discretion
whether his scheme regards its "character set" as the NFKC/NFKD
character set or the NFC/NFD character set.

Both character sets are, technically, infinite, but the NFKC/NFKD
character set is a proper subset of the NFC/NFD character set.

Hope this helps,

				Bear