[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: character strings versus byte strings



    > From: tb@xxxxxxxxxx (Thomas Bushnell, BSG)

    > Tom Lord <lord@xxxxxxx> writes:

    > >     > Wrong.  A Scheme character should be a codepoint.  The representation
    > >     > of code points as sequences of bytes should be under the hood.

    > > Misleading.

    > > It isn't obvious that Scheme characters should be _Unicode_
    > > codepoints.  For (much) more inclusive definitions of "codepoint",
    > > that characters should be codepoints is tautologically true.

    > Fair enough, though I think Unicode is the best choice at present.  It
    > might be perfectly fine to leave that agnostic too.  (If you don't
    > want specify even Unicode, then you certainly can't specify UTF-8!)  

You slightly misundertand.

First of all, I agree that encoding schemes have no relation to the
char type.   There should be nothing, say, UTF-8- or UTF-16-specific
about the char type.

Second of all: I agree that Unicode is the best choice.  I'd say it is
the only realistic choice.  I'd even say that it is a pleasant choice
since Unicode is basically very well designed (excuse me a second
while I duck the rotten tomatoes). 

The problem is that _given_unicode_, there is _still_ no definition of
"character" that simultaneously makes sense for both the Scheme CHAR?
type and from a Unicode perspective.  It's a dainty task, at best, to
avoid reflecting that bogosity in the FFI.



    > > There's a serious problem regarding Scheme and Unicode in that, for
    > > any sane definition of "character" in Unicode, the character type in
    > > R5RS is not sanely isomorphic.

    > I think there is a problem in that the R5RS character functions are
    > simply too simplistic, most notably in the case-mapping functions.

Right.  CHAR? necessarily has to come out as a very low-level type.  A
high-level interface is going to wind up being all about strings,
where some strings are kind of "character-like" in some way or other.

One problem I see is that implementations with different purposes will
want to make the CHAR? type quite different from one another.   For
reasons I'm not yet getting into detail about here, I think that
ultimately Scheme's CHAR? and STRING? types are doomed and that we're
going to have to leave them underspecified and eventually unimportant
(in favor of a new TEXT? type).


    > Case-mapping is a locale-dependent task;

Yes and no.  There is a locale-independent definition for it that is
useful.

    > however difficult that may make the world, it's a fact of the
    > world.  

If I detach that sentence fragment from its context, I think it would
serve well as an informal axiom for any discussion regarding unicode.

    > Many many many computer systems could get away with
    > ignoring the locale-dependency of case-mapping, but now they can
    > no longer plead ignorance.  (Though the problems are hardly
    > obscure; even German causes problems.)

(I think that, being a culturally unbiased person, you mean that
German causes one _unique_ problem regarding case mapping.)

    > I would like to see Scheme DTRT, which means not creating a
    > foolish oversimplification.  We have finally gotten away from
    > oversimplifying numbers; it's time to stop oversimplifying
    > characters too.

Here here, cheers, and happy holidays.  Now, to what extent to we want
the SRFI-50 process to become that battleground vs. to what extent do
we want it to step lightly around the issue :-)

    > We are stuck with R5RS at present, but we should at least not make
    > things worse.

!

    > I am happy to let others hash out the actual topic of this SRFI.  My
    > concern is that the SRFI not start constraining Scheme in a bad
    > way,
!!

    > and if you start saying things like "Scheme strings are UTF-8", I
    > start to get *really* nervous that someone is going to start making a
    > single codepoint take up multiple elements in a Scheme string.

!!!

-t