This page is part of the web mail archives of SRFI 91 from before July 7th, 2015. The new archives for SRFI 91 contain all messages, not just those from before July 7th, 2015.
Thomas Bushnell BSG scripsit: > Again, it seems to me that Hebrew is easy; the vowel signs are > abstract characters, because we can (I hope) say that in a monolingual > context a grapheme is an abstract character. Why is yod-with-hiriq (which is a yod character with a dot below to make it clear that it's a vowel not a consonant, the same dot used in fully vocalized text to indicate the vowel "i") different in this respect from i-with-acute, which you presumably think of as a single abstract character and grapheme? Both are written separately from the base character, both merge with it conceptually. > But alas, your example of Hindi and Tamil causes a difficulty. I > would take an abstract character as an interlingual version of a > grapheme, but making that work is not so simple. Can it be the case > that the same code points, standing for equivalent glyphs, are one > grapheme in one language and multiple graphemes in another? I think so, but this is not the evidence for it. Welsh "ch" is a separate letter of the alphabet from "c" + "h", and the same is true of Croatian "lj", as is shown by the three compound characters "LJ", "Lj", and "lj" present in Unicode. Hindi uses one script and Tamil another, so this is a matter of more abstract equivalence between different Indic scripts and doesn't really reflect my point well. > I can't quite tell from your example. It seems to me that in the case > you identify, both Hindi and Tamil agree that KA and U are distinct > graphemes. The difference is that "move on character" in a Hindi > editor will do something different than in a Tamil editor. Have I > understood rightly? Probably, but that can be handled in your terms by making Devanagari KA+U two abstract characters whereas Tamil KA+U is just one. Since they are distinct Unicode codepoints, that can be arranged. > One difficulty is the potential case that different languages parse > the same code points into different numbers of graphemes. I don't > think this ever happens, but I don't know for sure. If it happens, > then a reduction of abstract characters to graphemes is impossible. Not impossible, just locale-dependent. Another problem case is Vietnamese, where many Latin vowels have two accents, one for vowel quality, one for tone. The vowel-quality accents are "horn" and circumflex and are thought of as part of the vowel; the five tone accents are acute, grave, hook-above, tilde, and dot-below, and are thought of separately. This is of course quite different from the use of grave and acute in European languages, though even there there are discrepancies: French has just one "e" in its alphabet, though it can come with acute and grave and circumflex decorations; Icelandic treats a-with-acute as completely separate from a. Same story with o-umlaut in German and Swedish. > It would also make dubious sense in Korean, since my understanding is > that Koreans expect "move one character" to advance a full hangul. Typically backspace in a Korean IM removes the last jamo if the current syllable is incomplete, or the whole syllable if it is complete. There are ambiguous cases. > What seems more and more clear to me is that since "character" is a > loose and fuzzy concept, we are really better off *not using it at > all*. Well, then we are stuck with these R5RS datatype and procedures. What's to become of them? -- In my last lifetime, John Cowan I believed in reincarnation; http://www.ccil.org/~cowan in this lifetime, email@example.com I don't. --Thiagi