[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Encodings.

This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.

Bradd wrote:
>> *Any* program that handles Unicode data implements Unicode! That
>> includes Scheme compilers that support Unicode sources.

Ken Dickey wrote:
> Ok.  Pick an example.

Why? Any process that claims to support Unicode must conform to the
Unicode standard.

> According to the docs, Gambit 3.0 supports Unicode.  
> But..
> > (define great (string-ref "\x5927" 0)) ;; "(U+5927)"
> > great
> #\*** ERROR -- IO error on #<output-port (stdout)>

I have no idea whether that indicates conformance or not. Is "\x5927"
valid Gambit syntax for the Unicode codepoint U+5927? If not, then this
example is meaningless. Does the output port use a Unicode encoding by
default? If not, this example is meaningless.

>>> Who cares?

>> Anybody who wants to claim that his compiler supports Unicode. It's a
>> licensing issue. Unicode is a trademark, and you can't claim that you
>> "support" Unicode unless you actually conform to the standard.

> So does Gambit support Unicode or is the consortium going after
> somebody for non-compliance?  

They might. I don't know what their enforcement policy is. I don't even
know for certain whether they have one (although that's usually how it
works when you trademark the name of the standard).

> While Gambit reads unicode files, I don't believe it does normalization.

I don't think normalization is required, but "reading Unicode files"
does demand that it recognize when graphemes are canonically identical.

> It does allow kanji identifiers
> ([kanji] 5) => 120
> Does Gambit comform?

That isn't nearly enough information to judge. And I don't know what
point you're trying to make here, but you're being extremely rude about
it. C'mon, you just asked a completely ridiculous question. You can't
judge conformance to a large standard from a small example like this,
unless the example demonstrates obvious *non*conformance. Why are you
being so antagonistic?

>> Normalization is not difficult or expensive in a batch program like a
>> compiler. 

> Huh?  There are plenty of small Scheme interpreters out there.  The
> binary for TinyScheme is ~100KB.  

Interpreters *are* compilers. They just target a software VM instead of
a hardware machine. See EOPL.

> There are plenty of interactive compilers out there.

"Batch" was a bad choice of words, perhaps. Anyway, processing Unicode
isn't any more difficult or expensive in an interactive process.

>> In particular, if you're carrying around the data for "Is this a
>> letter or a number?" it's trivial to also provide the canonical
>> compositions and decompositions. I don't know where you got the idea
>> that it's expensive.

> I think it is the "if you're carrying around the data for" part that I
> am worried about.  Blocks are one thing, but I see that the UniHan.txt
> file is 25 MB and I am worried that large tables could double or
> triple the size of a small Scheme implementation.

On many systems, the Scheme implementation doesn't need to carry the
data around. It's part of the operating system interface. If it isn't,
and that's a problem, then *don't implement Unicode.* But don't make a
half-assed implementation and claim that you "support" it.

Look, if a terminal claimed to support ANSI X3.64, but it didn't honor
the clear-screen function, you'd call it a crappy, non-conforming
implementation, wouldn't you? It's exactly the same with a compiler that
claims to support Unicode but doesn't recognize when two encodings are
canonically equivalent. I don't know whether you're upset that I
poo-pooed your idea or what, but you're being unreasonable and rude.
Please stop.
Bradd W. Szonye