This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
I have made my feelings on the matter of "what is a character" clear in several different discussion threads on Unicode and characters that have taken place in the SRFI lists. Now I see it coming up again. For the sake of posterity and the standardization process, I'll reiterate the outline once more; but I won't go into exhaustive detail about this again. I feel that "unicode default grapheme clusters" more closely map to what users call "characters" than codepoints do. In the interests of keeping the abstractions used by the programmer as close as possible to the abstractions used by ordinary users, I therefore support defining scheme characters as DCG's. Consideration of other points only reinforces this opnion, because this has several other advantages besides the ability for users and programmers to communicate clearly and without mistakes about what the other means. The first technical advantage is that if the units are DCG's, then ordinary string operations that treat characters as atomic, leave DCG's unseparated. That is, when I take substrings at arbitrary indexes of characters and append them to create a new string, I am in no danger of having a substring that begins with a combining codepoint which, when appended to another substring, may create a DCG that did not exist in either string. Nor is there danger of separating a combining codepoint from the end of the substring, resulting in a "substring" that ends with a DCG that did not exist in the original string. Considering DCG's as characters, naturally gives string operations such as "substring" and "append" the unicode-independent semantics I consider appropriate. Another technical advantage is that adding an accent or other combining codepoint to a character is semantically different from creating a string of two characters - as it should be. A third technical advantage is that with the sole exceptions of eszett and the deprecated ligature characters, changes in case do not change string length. Furthermore, by use of the "Ligating joiner" character to form altercase ligatures, even the deprecated ligature characters can be converted in case with preservation of string length. This means that 99% + of the world never has to deal with the possibility that a string will change length on casing operations, and helps to minimize the frequency of occurrence of a source of errors. A fourth technical advantage is that it's "future proof." There is still dispute about Unicode's appropriateness, particularly in asian scripts, and it is reasonable to presume that Unicode is no more the Last Encoding Ever than was ASCII. Unicode has several disadvantages such as the use of elephantine tables for simple operations and the interspersal of dissimilar character types throughout the codespaces. Indeed, it appears to be accumulated rather than designed - the mark of a "second system" standard that eventually gets overturned by something more deeply consistent. There is still good reason to use encoding systems that are not Unicode in many places, still millions of asian characters (mostly proper names for places and things) that Unicode cannot and will not represent, and the use of other encodings besides Unicode is inevitable. I do not want the semantics of the programming language tied to the idiosyncracies of Unicode's particular encoding and representation, and the character-as- grapheme-cluster is more nearly an abstraction of "character" the concept that people actually use rather than an abstraction of the means we use to represent them. In other words, it supports a concept of "character" that is vastly more portable among different encodings and vastly more amenable to the kind of string handling that people in langauges not well served by Unicode will inevitably do anyway. The fifth technical advantage is where the burden of implementation lies. If all the grapheme-cluster handling is part of the language, the implementor has to do it once. If all the langauge supports is codepoints, then application programmers have to do it dozens of times or hundreds of times. And every line of code is first an opportunity to make a mistake, second a duplication of effort, and third a source of code-level incompatibilities when some routines use codepoint strings and other routines assume DCG strings. Scheme already has a history of abstracting objects of non-uniform lengths; scheme code does not, for example, have to care about whether a particular integer is a bignum or not. I cannot think of a good reason to back away from this approach when dealing with characters. Anyway; if you want to look at other opinions from me about Unicode, just check the SRFI archives; whatever objections you want to raise, I've probably answered them already several times and I'm just not going to go there again. This message is a summary and also a notice that the topic has already been thrashed in other threads. It may come down to the simple fact that we disagree about what is valuable in character handling routines. That's okay. We can disagree. Bear