[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: character strings versus byte strings

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 are here. Eventually, the entire history will be moved there, including any new messages.

    > From: tb@xxxxxxxxxx (Thomas Bushnell, BSG)

    > Matthew Flatt <mflatt@xxxxxxxxxxx> writes:

    > >  * For Scheme characters, pick a specific encoding, probably one of
    > >    UTF-16, UTF-32, UCS-2, or UCS-4 (but I don't know which is the right
    > >    choice).

    > Wrong.  A Scheme character should be a codepoint.  The representation
    > of code points as sequences of bytes should be under the hood.


It isn't obvious that Scheme characters should be _Unicode_
codepoints.  For (much) more inclusive definitions of "codepoint",
that characters should be codepoints is tautologically true.

There's a serious problem regarding Scheme and Unicode in that, for
any sane definition of "character" in Unicode, the character type in
R5RS is not sanely isomorphic.

I think that the best way to handle that in an FFI is to try to remain
agnostic about the range of the scheme CHAR? type when mapped into C.
I _guess_ that the error-signalling-on-range-error property of
SCHEME_EXTRACT_CHARACTER satisfies this but it could certainly be
rounded out and made more useful.