This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.
On Tue, 10 Feb 2004, Alex Shinn wrote: >First, a few notes on terminology. > > "Letter," by all standard definitions and consistent with Unicode > usage, specifically refers an element of an alphabet. It therefore > would not apply to syllabic or ideographic characters. > > "Ideograph" applied to all Han characters is technically incorrect. > Linguists prefer the term "sinogram" which refers to Chinese-derived > characters. "Sinogram" fits all uses being applied to the term > "ideograph" in these discussions (at least until Unicode adds > hieroglyphs). Since the usage of ideograph is fairly ubiquitous, > however, it may not be worth fighting it. Hm, okay. Duly noted, I will try to stop misusing the term. >The concept of case is orthogonal to being alphabetic. There are >alphabetic characters with no case, and (Unicode-classified) symbols >which are given case mappings such as Circled-A (U+24B6). Argh. Yes. Thank you. >Defining anything in terms of character level case procedures seems like >a bad idea, since any individual character can map to 0-3 characters >(German eszett, although the most famous, is not by any means the only >exception here). Characters which casemap to characters outside the single-codepoint character set are not a problem for me since my characters aren't limited to a single codepoint. I'm mostly here trying to avoid getting the Right Thing defined out of existence in favor of kluges and hacks designed to accomodate the shortcomings of single-codepoint character sets. Eszett is unique (and the only case where I share this problem with schemes having only a single-codepoint character set) in that it case maps not just to a multi-codepoint character, but to multiple separate characters! > For Schemes that wish to provide a full linguistic folding of > identifiers, you definitely want some sort of locale-neutral > folding. I posted the general possible combinations on > c.l.s. earlier. Unicode does define locale-neutral case-foldings > which are a subset of those combinations - they break it down into > whether or not you unify Turkish i (and ignore other accent marks), > and whether or not to allow folding to more than one character (as > an optimization). The "one-character" folding seems fairly > arbitrary and undesirable if you're going the whole hog anyway. True. The fundamental relationship that must hold seems to be that two symbols foo and bar will be read as the same identifier if and only if: (string=? (symbol->string foo) (symbol->string bar)) => #t Looks so simple, doesn't it? It turns out we've got a lot more going on in terms of dependent properties and definitions. >Core Unicode character properties can be provided as SRFI-14 char-sets. Agreed. I'd recommend adding one char-set to the list, charset:1code, the set of characters that can be represented as single unicode codepoints. >(Unicode Character Database). The question remains how to handle R5RS >character predicates related to these values: > * char-alphabetic? char > * char-numeric? char > * char-whitespace? char ; rename to char-white-space? please! > * char-upper-case? char ; rename to char-uppercase? please! > * char-lower-case? char ; rename to char-lowercase? please! :-) So that bugs you, too, huh? I agree, those predicates ought to be renamed. I also agree that there's some question about how to handle the predicates. I think the correct response is to simply drop the requirement of char-alphabetic having anything to do with the case predicates. Bear