[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: case mappings

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

Alex Shinn <alexshinn@xxxxxxxxx> writes:

> It would be nice to provide at least a place-holder for locales, but
> this does open another can of worms.  What is a locale?  In the
> implementation I provide for Chicken and Gauche it's just a string,
> but some schemes might want locale objects.  Furthermore, there's
> probably a (current-locale).  

I don't think this has any worms at all.

(current-locale) or current-locale is fine; I'm not sure which is

But we don't need to standardize it.  So don't!  Just provide some
guaranteed standard locale values if you want them.

> Given that, does
>   (string-ci=? s1 s2)
> mean the same thing as
>   (string-ci=? s1 s2 (current-locale))
> or the same as
>   (string-ci=? s1 s2 (independent-locale))

It should be current-locale by default, without any doubt whatsoever.
I said this earlier in the thread, but I think it got lost.

I do not envision simultaneously using different encodings inside one
character set.  My vision of a fancy-ass Unicode compliant Scheme
system would have it that "character" is a unicode character.

ASCII characters are not unicode characters; they would be probably
just integers or octets or what-have-you.  

The problem here is *precisely* that people are thinking "operating on
a series of octets" is the same basic thing as "operating on text".
That C Programmer thinking. :)

An incoming email message is not a series of characters.  An ISO
Latin-1 encoded file is not a series of characters.  Both of these are
series of *octets*, strings of *bytes*.  And there is a mapping
necessary to turn them into a series of characters.  In the case of
the email message, the headers are supposed to be in ASCII, with some
embedded mappings allowed, and the body is in a mapping specified by
an tag in the headers.  

Reading such a message is *not* a matter of taking the octets, turning
them into characters with integer->char, and then operating on the
resulting "string".  No.  It's a matter of taking the octets, and
*interpreting* them, indeed, *translating* them into strings.  And the
strings you get at the end of that operation are *unicode* strings.

That's the kind of system I think I want.  I certainly don't expect it
to be mandated, but I don't want it prohibited either.  It gets
prohibited the instant you start requiring operations on *characters*
which only make sense for this or that *encoding*.

I imagine a function (ascii->string ....) which takes an array of
octets and returns a string.  A string of *characters*, each of which
is a Unicode character.  There can be (latin-1->string ...) and
(latin-2->string ...) and so forth too.  

If you want special functions to operate on ascii, that's fine.  But
ASCII is an encoding, so ASCII-operating functions should operate only
on encodings.  If you want (ascii-upcase ...) which takes an *integer*
and returns another *integer* I don't object, though I will request
language to make clear this is for specialized uses, and doing things
like (integer->char (ascii-upcase (char->integer FOO))) is almost
certainly wrong, telling people to use string-upcase instead.

This I think makes the most sense.  It's the Right Thing for Unicode
if you really want to go whole hog and do it all.  And Scheme
standards should not be written in such a way that it's essentially
impossible.  The chief obstacle is the tedious writing of functions
that "everyone wants" but which preclude the use of the character type
to represent Unicode characters.