This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 contain all messages, not just those from before July 7th, 2015.
Per Bothner <firstname.lastname@example.org> writes: > [...] The argument between you two here is not about whether characters exist or not, it's about how they are represented. One side is arguing for a character represented by a separate data type. The other side is arguing for a character represented by a string of length 1. The argument for the latter is that, in Unicode, a "character" (a vague term, as John Cowan repeatedly pointed out) might very well be a number of code points, so you need to store something like a string anyways. This is the idea that a "character" is a grapheme cluster. It's of course trivial to provide an API for information about the first (nth) grapheme cluster in a string, which an editor can use to provide Emacs' C-x = feature. The argument for the former is that Unicode does specify a smallest component, a code point, and so far, the smallest component of a "character set" has been called "character". That is, a "character" is a "code point". This can also be seen as being a bit "cleaner", implementation-wise: A string consists of characters. We have data types for both. Contrast this to "a string consists of a number of substrings of length 1". Note that arguing for grapheme clusters as a separate data type is also possible, but somewhat problematic due to its very variable size. I don't think this argument would exist at all if the procedure we're discussing here would be called READ-CODEPOINT (and "strings consist of code points"). It's clear what it does, and does not use the ambiguous term of "character". Of course, SRFI 75 (R6RS Unicode data) has the following to say about this: | Despite this complexity, most things that a literate human would | call a ``character'' can be represented by a single code point | in Unicode (though there may exist code-point sequences that | represent that same character). For example, Roman letters, | Cyrillic letters, Hebrew consonants, and most Chinese characters | fall into this category. Thus, the ``code point'' approximation | of ``character'' works well for many purposes. It is thus | appropriate to define Scheme characters as Unicode scalar values So it is entirely appropriate in this context to call the procedure READ-CHARACTER. The debate on whether "character" is an appropriate name for a code point should be held on the SRFI-75 mailing list, not here. Regards, -- Jorgen -- ((email . "email@example.com") (www . "http://www.forcix.cx/") (gpg . "1024D/028AF63C") (irc . "nick forcer on IRCnet"))