[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: character strings versus byte strings

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.

To: Tom Lord <lord@xxxxxxx>
Subject: Re: character strings versus byte strings
From: tb@xxxxxxxxxx (Thomas Bushnell, BSG)
Date: 22 Dec 2003 14:21:47 -0800
Cc: mflatt@xxxxxxxxxxx, srfi-50@xxxxxxxxxxxxxxxxx
Delivered-to: srfi-50@xxxxxxxxxxxxxxxxx
In-reply-to: <200312222230.OAA06693@xxxxxxxxxxxxxxxxxxxxxxx>
References: <20031222141633.829B7828@xxxxxxxxxxxxxxxxx> <87vfo8k3ef.fsf@xxxxxxxxxxxxxxxxx> <200312222230.OAA06693@xxxxxxxxxxxxxxxxxxxxxxx>
User-agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3

Tom Lord <lord@xxxxxxx> writes:

>     > Wrong.  A Scheme character should be a codepoint.  The representation
>     > of code points as sequences of bytes should be under the hood.
> 
> Misleading.
> 
> It isn't obvious that Scheme characters should be _Unicode_
> codepoints.  For (much) more inclusive definitions of "codepoint",
> that characters should be codepoints is tautologically true.

Fair enough, though I think Unicode is the best choice at present.  It
might be perfectly fine to leave that agnostic too.  (If you don't
want specify even Unicode, then you certainly can't specify UTF-8!)  

> There's a serious problem regarding Scheme and Unicode in that, for
> any sane definition of "character" in Unicode, the character type in
> R5RS is not sanely isomorphic.

I think there is a problem in that the R5RS character functions are
simply too simplistic, most notably in the case-mapping functions.

Case-mapping is a locale-dependent task; however difficult that may
make the world, it's a fact of the world.  Many many many computer
systems could get away with ignoring the locale-dependency of
case-mapping, but now they can no longer plead ignorance.  (Though the
problems are hardly obscure; even German causes problems.)  

I would like to see Scheme DTRT, which means not creating a foolish
oversimplification.  We have finally gotten away from oversimplifying
numbers; it's time to stop oversimplifying characters too.

We are stuck with R5RS at present, but we should at least not make
things worse.

Ok, off that soapbox:

I am happy to let others hash out the actual topic of this SRFI.  My
concern is that the SRFI not start constraining Scheme in a bad way,
and if you start saying things like "Scheme strings are UTF-8", I
start to get *really* nervous that someone is going to start making a
single codepoint take up multiple elements in a Scheme string.

Thomas

Follow-Ups:
- Re: character strings versus byte strings
  - From: Tom Lord

References:
- character strings versus byte strings
  - From: Matthew Flatt
- Re: character strings versus byte strings
  - From: Thomas Bushnell, BSG
- Re: character strings versus byte strings
  - From: Tom Lord

Prev by Date: Re: Strings/chars
Next by Date: Re: GC safety and return values
Previous by thread: Re: character strings versus byte strings
Next by thread: Re: character strings versus byte strings
Index(es):
- Date
- Thread