[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Permitting and Supporting Extended Character Sets: response.
> From: bear <bear@xxxxxxxxx>
I'll mostly answer your points in order but the last one is the most
> I think that if we have the new procedures char-cased? and
> char-uncased? we do not need the proposed char-letter?
(I argue below that your definition of "cased" characters is
problematic but that's not the main point here.)
A while back, tb argued that the case-mapping procedures of R5RS could
simply be dropped. There's something to that.
In fact, R6RS could go further -- it could:
(case and classes) (type, order, integer isomorphism)
Why do that? I'm not convinced we should but the arguments for doing
so would include:
~ it would remove from R5RS all traces of the naive approach
to character case
~ it would remove from R5RS the culturally biased character
~ it would evaded the tricky problem of define "numeric" usefully
yet without cultural bias
~ those changes would leave only the class CHAR-WHITESPACE? which
seems particularly odd in isolation
~ the ability to write metacircular programs would still be
present -- and improved
~ the basic structure of the CHAR? type, a well-ordered set isomorphic
to a subset of the integers, would be retained
Why not do it?
~ pedagogical reasons -- for the portable character set, the
metacircularity procedures can be defined using the dropped
~ practical reason -- it wouldn't leave enough standard machinary in
Scheme to parse simple formats like "whitespace separated fields"
~ practical reason -- implementors will want to provide all of the
procedures in the DROP column for years to come, at least. Useful
libraries will continue to rely on them. It is worthwhile to
(continue to) say what they should mean.
But, on to the proposed revisions to the proposed revisions to the
>> It should say:
>> Returns the name of symbol as a string. [...] will be in the
>> implementation's preferred standard case [...]
>> will prefer upper case, others lower case. If the symbol was
>> returned by string->symbol, [....] string=? to
>> the string that was passed to string->symbol. [....]
> I would propose instead:
> Returns the name of symbol as a string. [...] all cased
> characters in the identifier (see the definition of char-cased?
> for a precise definition of cased and uncased characters) will
> be in the implementation's preferred standard case [....]. If
> the symbol was returned by string->symbol, the case of the
> characters in the string returned will be the same as the case
> in the string that was passed to string->symbol. [....]
> Rationale; I think it's simply clearer. The above wording
> specifically permits uncased characters (ie, characters which do not
> conform to "normal" expectations of cased characters) to be present
> in lowercase in identifiers even if the preferred case is uppercase,
> and presumably vice versa.
Huh. I thought that my wording permitted that already. I mostly
dislike your wording.
> all cased characters in the identifier [...] will be in the
> implementation's preferred standard case
seems too strong to me. I'd be willing to accept it if (a) we nail a
good STRING->SYMBOL-NAME definition for the "Unicode Identifiers"
draft; (b) prove that the property you named is true for that
STRING->SYMBOL-NAME and for all future versions of Unicode.
> If the symbol was returned by string->symbol, the case of the
> characters in the string returned will be the same as the case
> in the string that was passed to string->symbol.
is too weak. The two strings must be STRING=?. For example, a
Unicode STRING->SYMBOL must not canonicalize its argument (and
STRING=? is a codepoint-wise comparison).
>> With regard to character class predicates such as char-alphabetic?
>> The procedure char-alphabetic? is deprecated. New programs should
>> usually use char-letter? (see below) instead. char-alphabetic? has a
>> precise definition in terms of char-letter?:
>> (define (char-alphabetic? c)
>> (and (char-letter? c)
>> (char-upper-case? (char-upcase c))
>> (char-lower-case? (char-downcase c))))
> This is not how linguists use the term "alphabetic." Please do
> not propose "alphabetic" as a procedure to use to mean this, as
> it will frustrate and confuse people.
It's true that that is not how linguists use the term "alphabetic".
It's also true that not all "letters", in the sense of Unicode, are
alphabetic characters. For example, ideographic characters are
categorized in Unicode as "letters"; syllabaries are classified as
In a Unicode implementation, a linguistic definition of
CHAR-ALPHABETIC? would be a subset of letters generally and would
include both characters which are not cased (U+13A0 ("CHEROKEE LETTER
A")) and characters with no single-character case-mappings (U+00DF
("LATIN SMALL LETTER SHARP S")).
That would, in some sense, be a an interesting procedure to have
around -- but really it belongs in a general library for linguistic
text processing (along with many other procedures).
Worse, a linguistically proper definition of CHAR-ALPHABETIC? would be
upwards incompatible with R5RS which requires that alphabetic
characters have upper and lowercase forms (which are themselves
When thinking about how to handle this situation, I reasoned this way:
1) One use for the R5RS character classes is to write programs which
process s-expressions (e.g. source text) over the portable
character set. This use should be preserved.
2) Another use for the R5RS character classes is to write programs
which parse other simple kinds of syntax. For example, parsing
a line of text into white-space separated fields. This use should
be preserved and expanded. For example, CHAR-LETTER? allows for a
field of letters which are not alphabetic characters or which
are alphabetic but not case-mapped in the naive way.
3) The R5RS character classes have never been well suited for
linguistic processing over anything but the portable character
set. Their use for such purposes for extended characters is
4) Upward compatability with R5RS is desirable.
5) The specifications for the character classes defined in R6RS
should be consistent with definitions that satisfy the
usual expectations of a Unicode programmer. In other words,
in a Unicode-based implementation, these procedures should
function as a useful subset of a comprehensive library for
Unicode text processing.
So, I proposed: adding CHAR-LETTER? which is (consistent with being)
the generalization of CHAR-ALPHABETIC? to all "letters" (in the
Unicode sense); deprecating CHAR-ALPHABETIC? (which is esoteric at
best, nonsense at worst); and defining the class of CHAR-ALPHABETIC?
characters to be the largest subset of CHAR-LETTER? which is
consistent with the R5RS definition.
Now, having said all of that, the definition of CHAR-ALPHABETIC? could
be improved: The possibilitiy of non-alphabetic letters with both
upper and lowercase forms seems plausble to me (are there any in
Unicode already?) So, instead of that definition of CHAR-ALPHABETIC?
I would agree to:
CHAR-ALPHABETIC? must be defined in such a way that
this is true of all characters:
(or (not (char-alphabetic? c))
(and (char-letter? c)
(char-upper-case? (char-upcase c))
(char-lower-case? (char-downcase c))))
Note: this requirement is necessary for a combination of upward
compatability with earlier versions of the Revised Report and
consistency with the new CHAR-LETTER?, yet it is also
linguistically undesirable. This is the reason that
CHAR-ALPHABETIC? is described as "deprecated" -- new programs
should avoid using this procedure and should, in most cases, use
CHAR-LETTER? instead. Programmers should be aware that the
class CHAR-LETTER? may include letters such as syllables and
ideographs which are not, in any sense, "alphabetic". It can
also include alphabetic characters which are neither upper or
lowercase, lowercase letters with no uppercase form, uppercase
letters with no lowercase form, lowercase characters which are
not returned by CHAR-DOWNCASE of their CHAR-UPCASE mapping, and
uppercase charactes which are not returned by CHAR-UPCASE of
their CHAR-DOWNCASE mapping. Programmers should also be aware
that in some situations, a string may contain a letter followed
by non-letters -- the sequence being "what a user would think of
as a single letter" -- a fact which limits the utility of even
CHAR-LETTER? unless additional facilities for text processing
are provided by an implementation. Yet at the same time, for
the portable character set and for many extended characters,
none of these peculiar circumstances apply -- programmers not
trying to write "fully general" text processing algorithms can
often ignore these complexities. Programmers wanting to
write "fully general" text algorithms, on the other hand, can
define additional procedures which complement the standard
> These procedures return #t if their arguments are alphabetic,
> numeric, whitespace, uppercase, or lowercase characters, respectively.
> Otherwise they return #f. The characters a..z and A..Z are required to
> be alphabetic. The digits 0..9 must be numeric. The space, newline, and
> tab characters must be whitespace. The characters a..z are required to
> be lowercase. The characters A..Z are required to be uppercase. No
> character may be both uppercase and lowercase.
That's consistent with my proposed revisions. I think CHAR-LETTER?
ought to be added and CHAR-ALPHABETIC? either dropped entirely or
mentioned as deprecated. If it is mentioned as deprecated, the
invariant shown above should be stated here. The corresponding
sentence in the definition of CHAR-UPCASE and CHAR-DOWNCASE should be
> Char-cased? returns #t if its argument is a character which conforms to
> "normal" case expectations, (see below) and #f otherwise. [....]
> Rationale: This allows char-lower-case?, char-upper-case?, and
> char-alphabetic? to go on meaning the same thing with respect to the
> 96-character portable character set and meaning the same thing
> linguists mean when they use these terms. This will reduce confusion
> in the long run. This particular notion of cased and uncased
> characters is also useful in other parts of the standard for saying
> exactly which characters case requirements should apply to. It leaves
> implementors free to not sweat about what to do with identifiers
> containing eszett, regardless of what they do with calls to
> (char-upcase #\eszett).
Among the rationales: I think this one is false (see above):
> This particular notion of cased and uncased characters is also
> useful in other parts of the standard for saying exactly which
> characters case requirements should apply to.
The other rationales are are good reasons to say _something_ but I
don't think two new procedures are needed. Instead, the possibilitiy
of oddly-cased characters can be explicitly mentioned in the
definitions of CHAR-LOWER-CASE?, CHAR-UPPER-CASE?, and CHAR-LETTER?.
(Additionally, CASED and UNCASED seems like poor names for the classes
of characters they describe.)
>> With regard to [...] char-upcase and char-upcase
>> It should say
>> [....] char-upcase must map a..z to A..Z and
>> char-downcase must map A..Z to a..z.
> I would propose instead:
> [...] if char is alphabetic and cased, then the result of
> char-upcase is upper case and the result of char-downcase is
> lower case.
I'm not sure I see any value to the stronger requirement, especially
since CHAR-ALPHABETIC? should be deprecated and there is otherwise no
need to introduce the concept of a "cased" character. Your
alternative is implied by the definition of CHAR-ALPHABETIC? I gave in
the draft -- but you've earlier convinced me to weaken that
>> The introduction to strings [....] should say:
>> Some of the procedures that operate on strings ignore the difference
>> between strings in which upper and lower case variants of the same
>> character occur in corresponding positions. The versions that ignore
>> case have ``-ci'' (for ``case insensitive'') embedded in their
> I would propose instead:
> Some of the procedures that operate on strings ignore the difference
> between upper and lower case cased characters. The versions that
> ignore case in cased characters have ``-ci'' (for ``case
> insensitive'') embedded in their names.
I believe that this should be true:
(char=? #\dotless-i #\U+0131) => #t
(char-ci=? #\I #\dotless-i) => #t
and that STRING-CI=? is just the string equivalence induced by
However, #\dotless-i is not "cased" as you have defined it. Are you
saying that #\dotless-i and #\I are not CHAR-CI=? or that STRING-CI=?
is not the equivalence induced by CHAR-CI=?? Either way: why in the
world do that?
Like my work on GNU arch, Pika Scheme, and other technical contributions
to the public sphere? Show your support!
lord@xxxxxxx for www.moneybookers.com payments.