[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Unicode, case-mapping, comparison & the Java spec
I have sorted through the internationalisation issues, and have a fairly
simple proposal for them, which essentially follows the Java spec, and punts
complex handling of these properties to another "text" or collation SRFI. I
believe this is the last really major issue outstanding on SRFI 13.
* What Java Does
I should like to recommend to interested parties that you take the time to
read the specs for Java's string libs. They strike me as being very carefully
thought out. Here are some relevant links:
java.lang.string: (immutable strings)
java.lang.StringBuffer: (mutable strings)
Here are some notes summarising what these specs contain.
- Java characters are Unicode. Period.
SRFI-13 does not require this.
- Java provides a string hash routine. I consider this to be a checklist item;
I am adding one to SRFI-13.
Java gives a precise definition of the string-hash operation.
Unfortunately, it has changed over time. Here is the earlier spec:
If N is the length of the string, then
n <= 15: sum(i=0,n-1, s[i] * 37^i)
otw: sum(i=0,m, s[i*k]*39^i) for k=floor(n/8), m=ceil(n/k)
which has the property that it only samples 8 or 9 chars from the
string, when the string is long.
Here is the later spec, which uses every char in the string:
s*31^(n-1) + s*31^(n-2) + ... + s[n-1]
Specifying the hash function has the benefit that one can write out
hash values and have them be invariant across implementations. This
presumably is required by Java's write-once/run-anywhere mandate.
The downside is that one loses implementation flexibility, of course.
I do *not* plan to specify a specific hash function in SRFI-13; I've
left it open to the implementation. I am willing to consider requiring
a specific hash, e.g., the Java hash, if there is wide support for this.
- Java provides simple default-locale case-mapping operations that
are defined in terms of 1-1 character case mapping. So
+ the individual character transforms are context independent, and
+ the result string is guaranteed to be the same length as the input string.
- Java *also* provides case-mapping operations that take a locale parameter.
These may return strings that differ in length from the input string.
- Java provides string comparison and a simple case-insensitive comparison.
Case-insensitive comparison is simply
(compare (lower (upper s1)) (lower (upper s2)))
Note that it has *no* locale-specific processing.
Java *also* provides a case-insensitive string equality predicate,
which has *different* semantics -- it's
(and (= (length s1) (length s2))
(every (lambda (c1 c2) (or (char=? c1 c2)
(char=? (upcase c1) (upcase c2))
(char=? (downcase c1) (upcase c2))))
Could this be different from the comparison function? I'm not sure; it does
seems like a minor ugliness.
- There are separate text and collator classes
that provide much more complex operations on strings of text, such as
locale-specific collation. These are beyond the scope of SRFI-13.
- Java's "index" methods search for the occurrence of a char or a substring
within a string. Java also has prefix? and suffix? ops.
- Java's string class provides a set of primitive parsers & unparsers for base
types such as ints, bools & floats.
* What SRFI-13 Does
Having considered Java's solutions, I am doing the following for SRFI-13:
Like Java, this library treats strings simply as sequences of characters or
"code points." It supports simple char-at-a-time, context-independent case
mapping and case-insensitive operations. There are no locale parameters;
case-mapping ops *are*, however, sensitive to some "default" locale (which
could be dynamically bound by an extra-SRFI-13 facility).
Like Java, and as Mikael has been strongly suggesting, we punt more complex
functionality to a "text" or collation library. The simple operations defined
in SRFI-13 are suitable for processing file names or program symbols. True
text processing would want to use "text process" procedures.
- *No* locales
This library does not have locale parameters, or mechanisms for
dynamically binding a default locale. These features are beyond the
scope of this SRFI, and are postponed for a separate collation or text
Case-mapping and case-folding operators *are* defined to be sensitive,
in a limited fashion, to a "default locale," if the Scheme system
provides such a thing.
- Case mapping
Case mapping is context-independent, char-by-char. It is locale-sensitive
to the default locale.
STRING-UPCASE! STRING-DOWNCASE! and STRING-TITLECASE! are back. As in Java,
they and their pure STRING-UPCASE STRING-DOWNCASE and STRING-TITLECASE
variants do 1-1 character context-insensitive case mapping, sensitive to the
default locale. This means, for example, that the German strasse character
does *not* upcase to "SS." It maps to itself.
The simple rules for 1-1 char case mapping are laid out by the Unicode
standards and also by the Java specs.
- String comparison
STRING-COMPARE STRING< STRING<= STRING>= STRING> STRING= STRING<>
are locale-blind, and work purely in terms of "code points" -- the individual
chars of the string. In a Unicode Scheme, then, the e-accent-acute character
would not compare equal to the e character followed by the zero width
accent-acute character. A kana character would not compare equal to
its half-width variant. And so forth.
The case-insensitive versions of these ops are sensitive to the default
locale for case-mapping (but *not* for character collation order), and are
defined to do a char-by-char code-point comparison on
(char-downcase (char-upcase c)).
More sophisticated string comparison belongs in a separate "text" or
collation library, as Java does and Mikael has been suggesting. Such
a library would compute sort/collation keys, case mapping, text
normalisation, and operations that are blind to or fold away case,
accents/diacritical marks, ligatures, etc.
I will modify the SRFI to reflect these decisions.