[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RESET [was Re: Encodings]

This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.



Ken Dickey wrote:
> Let's say there are two scheme source files, each of which uses the
> "same" identifier in the same global (module global) scope/context.
> We say that in a RNRS Scheme the identifier names or denotes the same
> value.
> 
> Let's say the two files are stored in different encodings (say utf-8
> and ucs-2) and processed by different but conforming Unicode systems
> (text editors, Scheme read/write, whatever) so that identifiers still
> appear the same when displayed but are stored in different encodings.
> 
> A Scheme implementation which properly reads the two files should end
> up with the identifier occurrences denoted above represented by
> symbols which are eq?  (NB: _not_ eqv?) to each other.  If not, I term
> this "broken".

That's the essence of the conformance requirement I quoted earlier. If a
process claims to support Unicode, UTF-8, and UCS-2, then all of the
many ways of encoding the same character or symbol *must* be recognized
as canonically identical (EQ? in Scheme terms).

Also, that's the whole reason that the "normalization forms" exist: They
make it easier to compare text for canonical equivalence. They're very
similar to the C function strxfrm, which changes a string to a form that
is easier to compare and collate. (The major difference is that a normal
form is still directly printable; the result of strxfrm is not.)

A process *could* keep all text in its original format, with some
characters in fully-composed (NFC) UTF-8 and others in fully-decomposed
(NFD) UTF-32. Checking for canonical equivalence would involve some
complicated function that interprets and normalizes characters on the
fly (like using "strcoll" in C). But it's usually easier to transform
all strings into some "normal" form -- make them all NFC UTF-32 first --
and use a simple binary comparison (like usign "strxfrm" then "strcmp"
in C).

Your earlier idea of using separate programs to convert and then process
characters was on the right track. The only flaw (and it's a big flaw)
was the idea of putting them into completely separate processes.

> So if a glyph/character does not have a case variant, considering it
> to be lower case makes no logical sense.  I view this as an abuse of
> terminology.  Being outside of normal logic, I term this "bizarre" and
> if pressed, probably "broken" as well.

In that case, the German language is "broken." The ß character is
lowercase, but it's generally impossible to use it in a round-trip
"raise and lower case" process.

There are many situations where you need to know what words mean in
order to judge whether they're really the "same thing." Computers aren't
very good at that, so programmers need to be careful about (for example)
using homonyms as identifiers. If you have two identifiers named
"resume," you can't expect the computer to tell them apart just because
one of them means "restart" and the other means "curriculum vitae."

The trouble with ß is that the computer must understand the meaning of
words just to do "mechanical" transformations like changing case! Many
programmers assume that you can change case without understanding, but ß
is the classic example of why that is impossible. (It's not the only
example. For example, what's the lowercase version of SMITH? In English,
that depends on whether it's a surname or a job title.)

I think this is getting off on a bit of a tangent from what you're
talking about, so please excuse me for going off on a rant! Programmers
-- at least English-speaking ones -- have historically over-simplified
the issues related to lettercasing, and some of the problems have made
their way into the Scheme standard.
-- 
Bradd W. Szonye
http://www.szonye.com/bradd