This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.
First, a few notes on terminology. "Letter," by all standard definitions and consistent with Unicode usage, specifically refers an element of an alphabet. It therefore would not apply to syllabic or ideographic characters. "Ideograph" applied to all Han characters is technically incorrect. Linguists prefer the term "sinogram" which refers to Chinese-derived characters. "Sinogram" fits all uses being applied to the term "ideograph" in these discussions (at least until Unicode adds hieroglyphs). Since the usage of ideograph is fairly ubiquitous, however, it may not be worth fighting it. The character property suggested by char-letter? as the union of alphabetic, syllabic and ideographic characters seems roughly equal with the natural language (non-computer-encoded) notion of "character." It should probably be named something like char-linguistic-character?. This is vague and will almost certainly be handled by lookup tables of Unicode data - I don't think we need this for basic Scheme text processing. The concept of case is orthogonal to being alphabetic. There are alphabetic characters with no case, and (Unicode-classified) symbols which are given case mappings such as Circled-A (U+24B6). Defining anything in terms of character level case procedures seems like a bad idea, since any individual character can map to 0-3 characters (German eszett, although the most famous, is not by any means the only exception here). However, because Scheme itself and many formats and protocols make use of basic ASCII case operations, it is worthwhile to include these in the Scheme core. A possible way to break these up is: char-* => core Scheme character case-mapping (ASCII-only) string-* => SRFI-13 string case-mapping (ASCII-only) text-* => SRFI-XX full linguistic string case-mapping w/ locale For Schemes that wish to provide a full linguistic folding of identifiers, you definitely want some sort of locale-neutral folding. I posted the general possible combinations on c.l.s. earlier. Unicode does define locale-neutral case-foldings which are a subset of those combinations - they break it down into whether or not you unify Turkish i (and ignore other accent marks), and whether or not to allow folding to more than one character (as an optimization). The "one-character" folding seems fairly arbitrary and undesirable if you're going the whole hog anyway. Regardless of the folding, I like the string->symbol-name idea. Core Unicode character properties can be provided as SRFI-14 char-sets. Additional properties may be better provided as introspection on the UCD (Unicode Character Database). The question remains how to handle R5RS character predicates related to these values: * char-alphabetic? char * char-numeric? char * char-whitespace? char ; rename to char-white-space? please! * char-upper-case? char ; rename to char-uppercase? please! * char-lower-case? char ; rename to char-lowercase? please! As mentioned, it can be useful to have these functioning on pure ASCII for use in parsers and tools for common protocols. Moreover, the Unicode equivalents are often very expensive (if not in time then in space). Should a Scheme that wants to provide the full Unicode equivalents of these extend the core procedures or should we define disjoint procedures such as * char-unicode-alphabetic? -- Alex