[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Alex raises this question:
> The question remains how to handle R5RS character predicates
> related to these values:
> * char-alphabetic? char
> * char-numeric? char
> * char-whitespace? char ; rename to char-white-space? please!
> * char-upper-case? char ; rename to char-uppercase? please!
> * char-lower-case? char ; rename to char-lowercase? please!
> As mentioned, it can be useful to have these functioning on
> pure ASCII for use in parsers and tools for common protocols.
> Moreover, the Unicode equivalents are often very expensive (if
> not in time then in space). Should a Scheme that wants to
> provide the full Unicode equivalents of these extend the core
> procedures or should we define disjoint procedures such as
> * char-unicode-alphabetic?
First, while I like the suggested renames (e.g., CHAR-LOWER-CASE ->
CHAR-LOWERCASE), I think it is not the place of _this_ SRFI to propose
those changes. They would do nothing to help implementations provide
Unicode support while also conforming to R6RS.
Second, I think it is essential that, regardless of any changes
proposed in this SRFI, those procedures must have the same behavior
they have in R5RS when applied to the "portable Scheme character
set". The portable character set is not quite ASCII (the integer
mappings are specified and not all ASCII characters are included, even
abstractly) -- but it can be regarded as a subset of the abstract
characters encoded in ASCII.
Third, should a Unicode Scheme extend those predicates? or define new
ones? In a weak sense, that's not a question for this SRFI. I've
specified the answer I prefer in another draft ("Scheme Characters as
(Extended) Unicode Codepoints",
http://regexps.srparish.net/srfi-drafts/unicode-chars.srfi) but those
answers should not be presumed by this SRFI.
What _are_ questions for this SRFI are: should an implementation be
_permitted_ to extend those predicates. If so, should it be permitted
to extend them in the "most natural" way for Unicode characters? (The
other draft I just mentioned explains what I think the "most natural"
The strict letter of the law in R5RS says (by implication):
~ Yes, implementations _may_ extend those predicates.
(They are, indeed, expected to do so.)
~ No, implementations _may_not_ use the most natural Unicode
definitions. In particular, R5RS requires that alphabetic
characters must return an upper case equivalent from CHAR-UPCASE
and a lower case equivalent from CHAR-DOWNCASE. So the
specifications for all of these procedures:
are "intertwingled" in an unfortunate way: not all Unicode
characters that ought to be considered "alphabetic" satisfy
the case-mapping requirements of R5RS.
The situation is worsened by the relationship between those
procedures, STRING-CI=?, identifier equivalence, and the relationship
between a literal symbol name in a program text and the string
returned by SYMBOL->STRING for that symbol. For example, R5RS says
(by implication) that the SYMBOL->STRING value can be formed from an
identifier name by applying one of (depending on the implementation's
preferred case) CHAR-UPCASE or CHAR-DOWNCASE to each character of the
identifier. Unicode defines a (fairly complicated) algorithm defining
"case-insensitive identifier equivalence" -- but it has little resemblence
to the naive algorithm implied by R5RS.
My opinion is that R5RS is wrong to forbid the "most natural" Unicode
extensions of these standard procedures. Some of the revisions
proposed in this SRFI are aimed at removing that restriction.
In designing the proposed revisions, I reasoned this way:
~ Incompatible Changes Must Not Be Made.
Specifically, the unfortunate "intertwingling" of the procedures
listed above all hinges on CHAR-ALPHABETIC?. By happy coincidence,
with a global character set, CHAR-ALPHABETIC? is a poor choice of
name for the concept that procedure is intended to capture --
CHAR-LETTER? (that's "Letter" in the broad sense of the Unicode
standard) is a better name.
Rather than undo the intertwingling by changing _any_ of the
procedures in an incompatible way, it is simpler to leave
CHAR-ALPHABETIC? in its damaged state, deprecate it, and introduce a
new procedure -- CHAR-LETTER? -- defined in a way that doesn't
perpetuate the problems.
~ Implementations Must Provide the Identifier -> Symbol Mapping
The naive process of applying CHAR-UPCASE (or CHAR-DOWNCASE) to
every character in an identifier to yield it's canonical symbol name
is far removed from reality.
The Unicode process for canonicalizing a symbol name is quite
complicated and would require a great deal of more primitive
machinery to implement in Scheme.
Finally, this SRFI is _not_ intended to be Unicode-specific: only
to be Unicode-permissive. So it is not the place of this SRFI to
specify a canonicalization algorithm.
Therefore, I have proposed that the revised report just give general
guidance (that case distincitions are ignored; that implementations
have a preferred case) -- but that they must also provide their
canonicalization algorithm as a required procedure. Rather than
trying to "casemap identifiers" themselves, programs should use the
new STRING->SYMBOL-NAME procedure.
~ The Portable Characte Set Must Retain Its Simple Structure
For example, if an identifier name is spelled using only the
portable character set, then the CHAR-UPCASE (or DOWNCASE) technique
for canonicalizing that identifier name should continue to work.
Really, this requirement is a kind of "corallary" of the earlier
one that "Incompatible Changes Must Not Be Made" but it is worth
In the draft SRFI, I have ensured that the portable character set
retains its simple structure by including explicit language to that
effect. For example, CHAR-UPCASE and CHAR-DOWNCASE are described:
These procedures return a character CHAR2 such that
(CHAR-CI=? CHAR CHAR2). In addition, CHAR-UPCASE must
map a..z to A..Z and CHAR-DOWNCASE must map A..Z to a..z.