This page is part of the web mail archives of SRFI 52 from before July 7th, 2015. The new archives for SRFI 52 contain all messages, not just those from before July 7th, 2015.
<previously posted to srfi-50. See discussion there.> On Mon, 9 Feb 2004, Tom Lord wrote: > > From: bear <bear@xxxxxxxxx> > > > The result of case-mapping via char-ci=? only on cased characters is > > that distinct identifiers written using these characters remain > > distinct no matter what the preferred case of the implementation > > is. That's the desirable, crucial property that I was trying to > > capture with the distinction between cased and uncased characters. > >I don't see why that property is crucial. > >Here's where I am on these thing: please have a look at the >"References" section of the "Unicode Identifiers" draft. The >consortium has made recommendations for case-insensitive programming >languages. I think we should follow those and I don't think that >they're consistent with what you are adocating. They are, very slightly, a superset of the unicode consortium's reccomendations. A scheme implementing that property will be consistent with Unicode's reccomendations, but Unicode's reccomendations are not entirely adequate to describe that property. I believe that the additional restrictions become necessary because the consortium's recommendations are not entirely adequate for the needs of programming languages in which identifiers can be manipulated as strings or result from calculation on strings, and where operations such as string-ci=? are expected to be able to detect identifiers which are "the same" identifier. Scheme is such a language. > > I chose the properties of characters I called "cased" and "uncased" > > carefully; the distinctions they make are necessary and sufficient to > > allow implementations to detect which characters can safely be > > regarded as cased characters in the normal sense, > >I assume that you mean (Scheme) programs, not implementations. > >Programs can already detect which characters are naively cased in the >sense of your terms. That you are able to define your CHAR-CASED in >a few lines of R5RS illustrates that. While the definitions are implementable in a few lines of R5RS, the point is that developers realizing they need such a predicate are likely to implement it as you implemented your version of char-alphabetic? - without realizing that the "simplest" definition does not in fact describe characters having a one-to-one correspondence between lowercase and uppercase characters and therefore is not sufficient to preserve the portability of their programs from harm. It is better to provide this definition in a standard and nail its meaning down rather than allowing its necessity to drive the creation of many incompatible and/or buggy versions. > consider me to have written: > For example, a Unicode STRING->SYMBOL _may_ wish to not > canonicalize ..... > >and my point stands. > >You'll want to take up this issue separately, in response to "Scheme >Characters as (Extended) Unicode Codepoints", I think. > > > > IOW, because Macron and Cedilla are in different combining > > classes, the sequences A, Macron, Cedilla and A, Cedilla, Macron > > ought to be regarded as equal in a string comparison. > >Not by STRING=? in a Scheme in which the strings are regarded as >codepoint sequences, since STRING=? is the equivalence relation >induced by CHAR=?. no... (Char=? #\A:Macron:Cedilla #\A:Cedilla:Macron) => #t (= (char->integer #\A:Macron:Cedilla) (char->integer #\A:Cedilla:Macron)) => #t (String=? "arf\(U+41:Macron:Cedilla)arf" "arf\(U+41:Cedilla:Macron)arf") => #t string=? is in fact the equivalence relation induced by char=?. You are, I expect, running into a problem I don't experience because you prefer a representation that requires individual combining codepoints to occupy separate, distinguishable locations in a string, and as a result you are setting up a situation in which autocanonicalization cannot be done transparently. implementations that conform to your proposal will need to take extra steps (canonicalization, etc) to conform to the consortium's definition of string equality. > And, incidentally, although that STRING=? is not the linguistically > sensitive string-equality relation that Unicode defines, it _is_ a > useful procedure to have around for _implementing_ Unicode text > processes. Please humor me by not banning schemes in which string=? can be both. > _IF_ it were possible to define CHAR-ALPHABETIC? in a way which was > both linguistically correct _and_ upwards compatible with R5RS then > perhaps that would be almost a good idea. I say "almost" because > CHAR-IDEOGRAPHIC? and CHAR-SYLLABIC? add bloat and those plus > CHAR-ALPHABETIC? fails to be a complete enumeration of letter > types.... > But CHAR-ALPHABETIC? is just a botch. It can not be rescued. All > of these character classes belong elsewhere, with different names -- > in a "Linguistic Text Processing" SRFI. If you don't care to rescue it, then at least try to avoid abusing it further. I'd rather drop it all together rather than forcing this case mapping property that doesn't belong with it onto it. Char-alphabetic? is properly, and should be, of exactly the same stature as char-ideographic? or char-syllabic? or (just remembered this) char-phonemic? and maybe a few others. If any of these don't belong in the standard, then none of these belong in the standard. They can be reintroduced as library procedures in a language-handling library, if and when that becomes necessary. Maybe char-letter?, completely devoid of case requirements, is in fact all that the standard needs. > A predicate to detect "cased" characters can be trivially > synthesized from CHAR-UPCASE, CHAR-DOWNCASE, and CHAR=?. I see no > need for it to be required by R6RS. It can be trivially synthesized, but more than half of the people who do it will do it with slightly different semantics if a precise definition is not given. > Breaking CHAR-ALPHABETIC? in the way that you propose will not break > correct protable programs whose _input_data_ consists only of > portable characters, but it can break correct portable programs > whose input data includes extended characters. There is no > particular reason to introduce that breakage. Can you give an example? > You are thinking that I am trying to make make CHAR-ALPHABETIC? > linguistically useful. What I'm actually trying to do is to > minimize the degree to which CHAR-ALPHABETIC? is linguistically > useless. The invariant above is in that spirit. > The requirements in R5RS for CHAR-ALPHABETIC? already make it > linguistic nonsense. There's no hope for it. Deprecating it is the > best thing. You may be right; I'd prefer to see it excised completely from the standard rather than preserved with this bizarre case requirement. I consider it nonsensical to say "this character fails to behave according to these expectations for cased characters and therefore we will call it non-alphabetic even though it is part of an alphabet." Even if previous editions of the standard presumed that all alphabetic characters were cased, this is breakage. You need to identify the set of characters that behave as previous editions of the standard assumed "alphabetic" characters behaved, but "alphabetic" is not the right word to describe those characters. "Char-alphabetic?" should be simply clarified NOT to be a description of case properties, although there are no counterexamples to such a reading in the portable character set. As a general description of characters having these case properties, a properly-named predicate should be introduced instead. This permits "alphabetic" to retain its case semantics over at least the portable character set, (which are all that portable programs have ever relied on), without abandoning its linguistic meaning. > > Further, your definition does not capture the full range of what you > > need to express when checking for this property; characters such as > > dotless-i will be char-alphabetic? according to the definition above > > while still capable of causing bugs with char-ci=? and case-blind > > identifiers because they are not the preferred lowercase mappings of > > their own preferred uppercase mappings. >I'm following the letter of the (deprecated, stupid) law. R5RS does >_not_ require, _even_for_ CHAR-ALPHABETIC? _characters_, that: > (char=? (char-downcase c) (char-downcase (char-upcase c))) > => #t >Amazing but true. It does not require it explicitly but it depends on it for the correct reading of identifiers which are not in the implementation's preferred case. Amazing but true. >There is no need to introduce the (linguistically random) notion of >"cased character". With the invariant I gave for CHAR-ALPHABETIC?, >correct, portable R5RS programs remain so. The invariant you gave for CHAR-ALPHABETIC? is not merely linguistically random. As applied to an extended character set, it is linguistically wrong. It is incorrect. It is false. Moreover, It allows merging of identifiers which should not be merged when those identifiers contain CHAR-ALPHABETIC? (your definition) characters which are not CHAR-CASED? (my definition) and their case mapping properties interact badly with the implementation's preferred case. Therefore, it does not have the properties you claim for it for all possible character sets. It happens to have those properties for the portable character set, but its definition is not adequate to assure them. If you desire those properties, you will have to use a definition like the one I proposed for CHAR-CASED?, whatever you choose to call it. >R6RS should not attempt to provide comprehensive facilities for >Unicode text processing. It should attempt to provide a minimum of >upward compatible character and string facilities which are a useful >_subset_ of Unicode text processing, close in informal meaning to what >the R5RS versions say. My proposal does that. I do not believe that it does. Setting aside for the moment the fact that attaching the case invariants to char-alphabetic? is incorrect, you have not identified the correct set of case invariants needed for character-insensitive identifiers to remain distinct in correct, portable programs. >The CHAR-ALPHABETIC? invariant that I gave is consistent with an >implementation that defines it for truly alphabetic characters that >are "cased" in the sense you have been using. It's consistent with >R5RS. It's a hopeless cause to try to require more from >CHAR-ALPHABETIC? than that and deprecating CHAR-ALPHABETIC? is >necessary. The invariant you gave is necessary, but not sufficient. It identifies characters which have both lowercase and uppercase forms, but it does not identify characters which are part of a reciprocal 1-to-1 case mapping. Bear