[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Why are byte ports "ports" as such?



Per Bothner <per@bothner.com> writes:

> [...]

The argument between you two here is not about whether characters
exist or not, it's about how they are represented.

One side is arguing for a character represented by a separate data
type.

The other side is arguing for a character represented by a string
of length 1.


The argument for the latter is that, in Unicode, a "character" (a
vague term, as John Cowan repeatedly pointed out) might very well
be a number of code points, so you need to store something like a
string anyways. This is the idea that a "character" is a grapheme
cluster. It's of course trivial to provide an API for information
about the first (nth) grapheme cluster in a string, which an
editor can use to provide Emacs' C-x = feature.

The argument for the former is that Unicode does specify a
smallest component, a code point, and so far, the smallest
component of a "character set" has been called "character". That
is, a "character" is a "code point". This can also be seen as
being a bit "cleaner", implementation-wise: A string consists of
characters. We have data types for both. Contrast this to "a
string consists of a number of substrings of length 1".

Note that arguing for grapheme clusters as a separate data type is
also possible, but somewhat problematic due to its very variable
size.


I don't think this argument would exist at all if the procedure
we're discussing here would be called READ-CODEPOINT (and "strings
consist of code points"). It's clear what it does, and does not
use the ambiguous term of "character". Of course, SRFI 75 (R6RS
Unicode data) has the following to say about this:

| Despite this complexity, most things that a literate human would
| call a ``character'' can be represented by a single code point
| in Unicode (though there may exist code-point sequences that
| represent that same character). For example, Roman letters,
| Cyrillic letters, Hebrew consonants, and most Chinese characters
| fall into this category. Thus, the ``code point'' approximation
| of ``character'' works well for many purposes. It is thus
| appropriate to define Scheme characters as Unicode scalar values

So it is entirely appropriate in this context to call the
procedure READ-CHARACTER. The debate on whether "character" is an
appropriate name for a code point should be held on the SRFI-75
mailing list, not here.

Regards,
        -- Jorgen

-- 
((email . "forcer@forcix.cx") (www . "http://www.forcix.cx/";)
 (gpg   . "1024D/028AF63C")   (irc . "nick forcer on IRCnet"))