This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 contain all messages, not just those from before July 7th, 2015.
Jorgen Schaefer scripsit: > The argument for the latter is that, in Unicode, a "character" (a > vague term, as John Cowan repeatedly pointed out) might very well > be a number of code points, so you need to store something like a > string anyways. This is the idea that a "character" is a grapheme > cluster. It's of course trivial to provide an API for information > about the first (nth) grapheme cluster in a string, which an > editor can use to provide Emacs' C-x = feature. The trouble is, it's far from clear whether grapheme clusters have much to do with what users see as characters (assuming that users do have a uniform vision on this point, which is far from certain). Consider the Devanagari sequence usually transliterated "kshe". This is four codepoints (KA, VIRAMA, SHA, E), two grapheme clusters (KA-VIRAMA, SHA-E), conceptually either two letters (KSHA, E) or three (KA, SHA, E), and is rendered as a single glyph. How many "characters" does it consist of? > The argument for the former is that Unicode does specify a > smallest component, a code point, and so far, the smallest > component of a "character set" has been called "character". That > is, a "character" is a "code point". This can also be seen as > being a bit "cleaner", implementation-wise: A string consists of > characters. We have data types for both. Contrast this to "a > string consists of a number of substrings of length 1". Languages (varying from Basic to Q) that take the no-characters perspective don't think a string *consists* of anything, any more than 15 *consists* of 3 x 5, though that is its unique prime factorization. In Q, indeed, Character is a subtype of String, which can be informally characterized as "strings with only one codepoint". > | Despite this complexity, most things that a literate human would > | call a ``character'' can be represented by a single code point > | in Unicode (though there may exist code-point sequences that > | represent that same character). For example, Roman letters, > | Cyrillic letters, Hebrew consonants, and most Chinese characters > | fall into this category. Thus, the ``code point'' approximation > | of ``character'' works well for many purposes. It is thus > | appropriate to define Scheme characters as Unicode scalar values Well, those examples are well-chosen from the fairly simple cases. They gloss over, for example, the fact that Hindi is conceptualized by people who read and write it as an alphabet, whereas the structurally parallel Tamil is conceptualized as a syllabary (so KA + U is two letters in Hindi, one in Tamil -- and two codepoints and one default grapheme cluster and one glyph in both cases). Also, while it's clear that Hebrew consonants are within the Hebrew reader's notion of characters, it's not so clear about Hebrew vowel signs, which traditionally -- except when writing the Bible -- are treated as optional assistants. It's not Unicode's *encoding* that makes it complicated; it's the *repertoire*, which is complicated because the world of writing is complicated. -- Some people open all the Windows; John Cowan wise wives welcome the spring email@example.com by moving the Unix. http://www.ccil.org/~cowan --ad for Unix Book Units (U.K.) (see http://cm.bell-labs.com/cm/cs/who/dmr/unix3image.gif)