[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: character strings versus byte strings

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.



    > From: tb@xxxxxxxxxx (Thomas Bushnell, BSG)

    > Tom Lord <lord@xxxxxxx> writes:

    > >     > Wrong.  A Scheme character should be a codepoint.  The representation
    > >     > of code points as sequences of bytes should be under the hood.

    > > Misleading.

    > > It isn't obvious that Scheme characters should be _Unicode_
    > > codepoints.  For (much) more inclusive definitions of "codepoint",
    > > that characters should be codepoints is tautologically true.

    > Fair enough, though I think Unicode is the best choice at present.  It
    > might be perfectly fine to leave that agnostic too.  (If you don't
    > want specify even Unicode, then you certainly can't specify UTF-8!)  

You slightly misundertand.

First of all, I agree that encoding schemes have no relation to the
char type.   There should be nothing, say, UTF-8- or UTF-16-specific
about the char type.

Second of all: I agree that Unicode is the best choice.  I'd say it is
the only realistic choice.  I'd even say that it is a pleasant choice
since Unicode is basically very well designed (excuse me a second
while I duck the rotten tomatoes). 

The problem is that _given_unicode_, there is _still_ no definition of
"character" that simultaneously makes sense for both the Scheme CHAR?
type and from a Unicode perspective.  It's a dainty task, at best, to
avoid reflecting that bogosity in the FFI.



    > > There's a serious problem regarding Scheme and Unicode in that, for
    > > any sane definition of "character" in Unicode, the character type in
    > > R5RS is not sanely isomorphic.

    > I think there is a problem in that the R5RS character functions are
    > simply too simplistic, most notably in the case-mapping functions.

Right.  CHAR? necessarily has to come out as a very low-level type.  A
high-level interface is going to wind up being all about strings,
where some strings are kind of "character-like" in some way or other.

One problem I see is that implementations with different purposes will
want to make the CHAR? type quite different from one another.   For
reasons I'm not yet getting into detail about here, I think that
ultimately Scheme's CHAR? and STRING? types are doomed and that we're
going to have to leave them underspecified and eventually unimportant
(in favor of a new TEXT? type).


    > Case-mapping is a locale-dependent task;

Yes and no.  There is a locale-independent definition for it that is
useful.

    > however difficult that may make the world, it's a fact of the
    > world.  

If I detach that sentence fragment from its context, I think it would
serve well as an informal axiom for any discussion regarding unicode.

    > Many many many computer systems could get away with
    > ignoring the locale-dependency of case-mapping, but now they can
    > no longer plead ignorance.  (Though the problems are hardly
    > obscure; even German causes problems.)

(I think that, being a culturally unbiased person, you mean that
German causes one _unique_ problem regarding case mapping.)

    > I would like to see Scheme DTRT, which means not creating a
    > foolish oversimplification.  We have finally gotten away from
    > oversimplifying numbers; it's time to stop oversimplifying
    > characters too.

Here here, cheers, and happy holidays.  Now, to what extent to we want
the SRFI-50 process to become that battleground vs. to what extent do
we want it to step lightly around the issue :-)

    > We are stuck with R5RS at present, but we should at least not make
    > things worse.

!

    > I am happy to let others hash out the actual topic of this SRFI.  My
    > concern is that the SRFI not start constraining Scheme in a bad
    > way,
!!

    > and if you start saying things like "Scheme strings are UTF-8", I
    > start to get *really* nervous that someone is going to start making a
    > single codepoint take up multiple elements in a Scheme string.

!!!

-t