This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 contain all messages, not just those from before July 7th, 2015.
John Cowan <firstname.lastname@example.org> writes: > Consider the Devanagari sequence usually transliterated "kshe". This > is four codepoints (KA, VIRAMA, SHA, E), two grapheme clusters > (KA-VIRAMA, SHA-E), conceptually either two letters (KSHA, E) or > three (KA, SHA, E), and is rendered as a single glyph. > > How many "characters" does it consist of? It seems to me that within a single language, we count graphemes as abstract characters. But: > Well, those examples are well-chosen from the fairly simple cases. > They gloss over, for example, the fact that Hindi is conceptualized > by people who read and write it as an alphabet, whereas the structurally > parallel Tamil is conceptualized as a syllabary (so KA + U is two > letters in Hindi, one in Tamil -- and two codepoints and one default > grapheme cluster and one glyph in both cases). Also, while it's clear > that Hebrew consonants are within the Hebrew reader's notion of characters, > it's not so clear about Hebrew vowel signs, which traditionally -- > except when writing the Bible -- are treated as optional assistants. Again, it seems to me that Hebrew is easy; the vowel signs are abstract characters, because we can (I hope) say that in a monolingual context a grapheme is an abstract character. But alas, your example of Hindi and Tamil causes a difficulty. I would take an abstract character as an interlingual version of a grapheme, but making that work is not so simple. Can it be the case that the same code points, standing for equivalent glyphs, are one grapheme in one language and multiple graphemes in another? I can't quite tell from your example. It seems to me that in the case you identify, both Hindi and Tamil agree that KA and U are distinct graphemes. The difference is that "move on character" in a Hindi editor will do something different than in a Tamil editor. Have I understood rightly? If this is the case, then there is certainly a difficulty with what I'm trying to sell. One difficulty is the potential case that different languages parse the same code points into different numbers of graphemes. I don't think this ever happens, but I don't know for sure. If it happens, then a reduction of abstract characters to graphemes is impossible. Another difficulty is the case that reducing abstract characters to graphemes makes sense in some linguistic contexts and not others. It would make dubious sense in Hebrew (if there are vowes points), and from your explanation it sounds like it would make no sense in Tamil. It would also make dubious sense in Korean, since my understanding is that Koreans expect "move one character" to advance a full hangul. The first difficulty is unsurmountable; if it's a real case (but I don't think it is), then my argument collapses completely. The second difficulty is not unsurmountable; it produces a perfectly workable specification of abstract character, but one which does not match cleanly to what users in different languages think of as characters, with the result that the concept resulting is not of significant value in editing and other such things. I'm not sure what I think about this case. What seems more and more clear to me is that since "character" is a loose and fuzzy concept, we are really better off *not using it at all*. Thomas