[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: character strings versus byte strings
> From: tb@xxxxxxxxxx (Thomas Bushnell, BSG)
> Tom Lord <lord@xxxxxxx> writes:
> > > Wrong. A Scheme character should be a codepoint. The representation
> > > of code points as sequences of bytes should be under the hood.
> > Misleading.
> > It isn't obvious that Scheme characters should be _Unicode_
> > codepoints. For (much) more inclusive definitions of "codepoint",
> > that characters should be codepoints is tautologically true.
> Fair enough, though I think Unicode is the best choice at present. It
> might be perfectly fine to leave that agnostic too. (If you don't
> want specify even Unicode, then you certainly can't specify UTF-8!)
You slightly misundertand.
First of all, I agree that encoding schemes have no relation to the
char type. There should be nothing, say, UTF-8- or UTF-16-specific
about the char type.
Second of all: I agree that Unicode is the best choice. I'd say it is
the only realistic choice. I'd even say that it is a pleasant choice
since Unicode is basically very well designed (excuse me a second
while I duck the rotten tomatoes).
The problem is that _given_unicode_, there is _still_ no definition of
"character" that simultaneously makes sense for both the Scheme CHAR?
type and from a Unicode perspective. It's a dainty task, at best, to
avoid reflecting that bogosity in the FFI.
> > There's a serious problem regarding Scheme and Unicode in that, for
> > any sane definition of "character" in Unicode, the character type in
> > R5RS is not sanely isomorphic.
> I think there is a problem in that the R5RS character functions are
> simply too simplistic, most notably in the case-mapping functions.
Right. CHAR? necessarily has to come out as a very low-level type. A
high-level interface is going to wind up being all about strings,
where some strings are kind of "character-like" in some way or other.
One problem I see is that implementations with different purposes will
want to make the CHAR? type quite different from one another. For
reasons I'm not yet getting into detail about here, I think that
ultimately Scheme's CHAR? and STRING? types are doomed and that we're
going to have to leave them underspecified and eventually unimportant
(in favor of a new TEXT? type).
> Case-mapping is a locale-dependent task;
Yes and no. There is a locale-independent definition for it that is
> however difficult that may make the world, it's a fact of the
If I detach that sentence fragment from its context, I think it would
serve well as an informal axiom for any discussion regarding unicode.
> Many many many computer systems could get away with
> ignoring the locale-dependency of case-mapping, but now they can
> no longer plead ignorance. (Though the problems are hardly
> obscure; even German causes problems.)
(I think that, being a culturally unbiased person, you mean that
German causes one _unique_ problem regarding case mapping.)
> I would like to see Scheme DTRT, which means not creating a
> foolish oversimplification. We have finally gotten away from
> oversimplifying numbers; it's time to stop oversimplifying
> characters too.
Here here, cheers, and happy holidays. Now, to what extent to we want
the SRFI-50 process to become that battleground vs. to what extent do
we want it to step lightly around the issue :-)
> We are stuck with R5RS at present, but we should at least not make
> things worse.
> I am happy to let others hash out the actual topic of this SRFI. My
> concern is that the SRFI not start constraining Scheme in a bad
> and if you start saying things like "Scheme strings are UTF-8", I
> start to get *really* nervous that someone is going to start making a
> single codepoint take up multiple elements in a Scheme string.