This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 contain all messages, not just those from before July 7th, 2015.
Jonathan S. Shapiro scripsit: > The underlying issue within UNICODE is the existence of the so-called > "combining characters". There exist characters that have no single > defining codepoint. These exist primarily in Asian languages, for > example in the form of multiple code points that together form a single > "glyph". In fact they are all over the place: you cannot write such a very European language as Lithuanian, which uses the Latin script, without employing them. (Well, you can write memos or to-do lists, but not poetry or dictionaries.) However, whether a "default grapheme cluster" (the Unicode name for a base character together with its combining characters) is a "character" in the non-technical sense depends on the culture. Is an "o" with a dot-above accent and a macron accent a single "character"? Sure. How about a Hindi consonant letter with associated vowel mark? Not at all: one sense of "character" in Hindi covers consonants and vowels separately just as in Latin, another sense is "run of consonants up to and including the next vowel." What about Korean? Is a Hangul syllable one character or 2-3? Depends on the context: sometimess one, sometimes the other. "Character" is not a technical term in Unicode because it can't be; it would have to match too many contradictory expectations. The Unicode Glossary, which is not normative, says: Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader's understanding. (2) Synonym for abstract character [defined as "A unit of information used for the organization, control, or representation of textual data. "]. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. (See ideograph(2).) There *are* technical terms in Unicode, like code unit, code point, default grapheme cluster, and so on. Which of these should be mapped to a given programming culture's pre-existing concept of "characters" is a question which Unicode by itself cannot answer. So far, C has gone for the 8-bit code unit interpretation, Java for the 16-bit code unit interpretation, and XML for the code point interpretation. (The Glossary is at http://www.unicode.org/glossary/ .) -- Andrew Watt on Microsoft: John Cowan Never in the field of human computing cowan@ccil.org has so much been paid by so many http://www.ccil.org/~cowan to so few! (pace Winston Churchill)