[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: character strings versus byte strings

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.

To: tb@xxxxxxxxxx
Subject: Re: character strings versus byte strings
From: Tom Lord <lord@xxxxxxx>
Date: Mon, 22 Dec 2003 14:59:57 -0800 (PST)
Cc: mflatt@xxxxxxxxxxx, srfi-50@xxxxxxxxxxxxxxxxx
Delivered-to: srfi-50@xxxxxxxxxxxxxxxxx
In-reply-to: <87hdzsjxxg.fsf@xxxxxxxxxxxxxxxxx> (tb@xxxxxxxxxx)
References: <20031222141633.829B7828@xxxxxxxxxxxxxxxxx> <87vfo8k3ef.fsf@xxxxxxxxxxxxxxxxx> <200312222230.OAA06693@xxxxxxxxxxxxxxxxxxxxxxx> <87hdzsjxxg.fsf@xxxxxxxxxxxxxxxxx>

    > From: tb@xxxxxxxxxx (Thomas Bushnell, BSG)

    > Tom Lord <lord@xxxxxxx> writes:

    > >     > Wrong.  A Scheme character should be a codepoint.  The representation
    > >     > of code points as sequences of bytes should be under the hood.

    > > Misleading.

    > > It isn't obvious that Scheme characters should be _Unicode_
    > > codepoints.  For (much) more inclusive definitions of "codepoint",
    > > that characters should be codepoints is tautologically true.

    > Fair enough, though I think Unicode is the best choice at present.  It
    > might be perfectly fine to leave that agnostic too.  (If you don't
    > want specify even Unicode, then you certainly can't specify UTF-8!)  

You slightly misundertand.

First of all, I agree that encoding schemes have no relation to the
char type.   There should be nothing, say, UTF-8- or UTF-16-specific
about the char type.

Second of all: I agree that Unicode is the best choice.  I'd say it is
the only realistic choice.  I'd even say that it is a pleasant choice
since Unicode is basically very well designed (excuse me a second
while I duck the rotten tomatoes). 

The problem is that _given_unicode_, there is _still_ no definition of
"character" that simultaneously makes sense for both the Scheme CHAR?
type and from a Unicode perspective.  It's a dainty task, at best, to
avoid reflecting that bogosity in the FFI.

    > > There's a serious problem regarding Scheme and Unicode in that, for
    > > any sane definition of "character" in Unicode, the character type in
    > > R5RS is not sanely isomorphic.

    > I think there is a problem in that the R5RS character functions are
    > simply too simplistic, most notably in the case-mapping functions.

Right.  CHAR? necessarily has to come out as a very low-level type.  A
high-level interface is going to wind up being all about strings,
where some strings are kind of "character-like" in some way or other.

One problem I see is that implementations with different purposes will
want to make the CHAR? type quite different from one another.   For
reasons I'm not yet getting into detail about here, I think that
ultimately Scheme's CHAR? and STRING? types are doomed and that we're
going to have to leave them underspecified and eventually unimportant
(in favor of a new TEXT? type).

    > Case-mapping is a locale-dependent task;

Yes and no.  There is a locale-independent definition for it that is
useful.

    > however difficult that may make the world, it's a fact of the
    > world.  

If I detach that sentence fragment from its context, I think it would
serve well as an informal axiom for any discussion regarding unicode.

    > Many many many computer systems could get away with
    > ignoring the locale-dependency of case-mapping, but now they can
    > no longer plead ignorance.  (Though the problems are hardly
    > obscure; even German causes problems.)

(I think that, being a culturally unbiased person, you mean that
German causes one _unique_ problem regarding case mapping.)

    > I would like to see Scheme DTRT, which means not creating a
    > foolish oversimplification.  We have finally gotten away from
    > oversimplifying numbers; it's time to stop oversimplifying
    > characters too.

Here here, cheers, and happy holidays.  Now, to what extent to we want
the SRFI-50 process to become that battleground vs. to what extent do
we want it to step lightly around the issue :-)

    > We are stuck with R5RS at present, but we should at least not make
    > things worse.

!

    > I am happy to let others hash out the actual topic of this SRFI.  My
    > concern is that the SRFI not start constraining Scheme in a bad
    > way,
!!

    > and if you start saying things like "Scheme strings are UTF-8", I
    > start to get *really* nervous that someone is going to start making a
    > single codepoint take up multiple elements in a Scheme string.

!!!

-t

Follow-Ups:
- Re: character strings versus byte strings
  - From: Thomas Bushnell, BSG
- Re: character strings versus byte strings
  - From: Shiro Kawai
- Re: character strings versus byte strings
  - From: bear

References:
- character strings versus byte strings
  - From: Matthew Flatt
- Re: character strings versus byte strings
  - From: Thomas Bushnell, BSG
- Re: character strings versus byte strings
  - From: Tom Lord
- Re: character strings versus byte strings
  - From: Thomas Bushnell, BSG

Prev by Date: Re: GC safety and return values
Next by Date: Re: character strings versus byte strings
Previous by thread: Re: character strings versus byte strings
Next by thread: Re: character strings versus byte strings
Index(es):
- Date
- Thread