This page is part of the web mail archives of SRFI 13 from before July 7th, 2015. The new archives for SRFI 13 contain all messages, not just those from before July 7th, 2015.
I have sorted through the internationalisation issues, and have a fairly simple proposal for them, which essentially follows the Java spec, and punts complex handling of these properties to another "text" or collation SRFI. I believe this is the last really major issue outstanding on SRFI 13. * What Java Does ---------------- I should like to recommend to interested parties that you take the time to read the specs for Java's string libs. They strike me as being very carefully thought out. Here are some relevant links: For characters http://www.unicode.org/ http://java.sun.com/docs/books/jls/html/javalang.doc4.html#14345 http://java.sun.com/products/jdk/1.2/docs/api/java/lang/Character.html java.lang.string: (immutable strings) http://java.sun.com/docs/books/jls/html/javalang.doc11.html#14460 http://java.sun.com/products/jdk/1.2/docs/api/java/lang/String.html java.lang.StringBuffer: (mutable strings) http://java.sun.com/docs/books/jls/html/javalang.doc12.html#14461 http://java.sun.com/products/jdk/1.2/docs/api/java/lang/StringBuffer.html Here are some notes summarising what these specs contain. - Java characters are Unicode. Period. SRFI-13 does not require this. - Java provides a string hash routine. I consider this to be a checklist item; I am adding one to SRFI-13. Java gives a precise definition of the string-hash operation. Unfortunately, it has changed over time. Here is the earlier spec: If N is the length of the string, then n <= 15: sum(i=0,n-1, s[i] * 37^i) otw: sum(i=0,m, s[i*k]*39^i) for k=floor(n/8), m=ceil(n/k) which has the property that it only samples 8 or 9 chars from the string, when the string is long. Here is the later spec, which uses every char in the string: s*31^(n-1) + s*31^(n-2) + ... + s[n-1] Specifying the hash function has the benefit that one can write out hash values and have them be invariant across implementations. This presumably is required by Java's write-once/run-anywhere mandate. The downside is that one loses implementation flexibility, of course. I do *not* plan to specify a specific hash function in SRFI-13; I've left it open to the implementation. I am willing to consider requiring a specific hash, e.g., the Java hash, if there is wide support for this. - Java provides simple default-locale case-mapping operations that are defined in terms of 1-1 character case mapping. So + the individual character transforms are context independent, and + the result string is guaranteed to be the same length as the input string. - Java *also* provides case-mapping operations that take a locale parameter. These may return strings that differ in length from the input string. - Java provides string comparison and a simple case-insensitive comparison. Case-insensitive comparison is simply (compare (lower (upper s1)) (lower (upper s2))) Note that it has *no* locale-specific processing. Java *also* provides a case-insensitive string equality predicate, which has *different* semantics -- it's (and (= (length s1) (length s2)) (every (lambda (c1 c2) (or (char=? c1 c2) (char=? (upcase c1) (upcase c2)) (char=? (downcase c1) (upcase c2)))) s1 s2)) Could this be different from the comparison function? I'm not sure; it does seems like a minor ugliness. - There are separate text and collator classes http://java.sun.com/products/jdk/1.2/docs/api/java/text/Collator.html http://java.sun.com/products/jdk/1.2/docs/api/java/text/package-summary.html that provide much more complex operations on strings of text, such as locale-specific collation. These are beyond the scope of SRFI-13. - Java's "index" methods search for the occurrence of a char or a substring within a string. Java also has prefix? and suffix? ops. - Java's string class provides a set of primitive parsers & unparsers for base types such as ints, bools & floats. * What SRFI-13 Does ------------------- Having considered Java's solutions, I am doing the following for SRFI-13: Like Java, this library treats strings simply as sequences of characters or "code points." It supports simple char-at-a-time, context-independent case mapping and case-insensitive operations. There are no locale parameters; case-mapping ops *are*, however, sensitive to some "default" locale (which could be dynamically bound by an extra-SRFI-13 facility). Like Java, and as Mikael has been strongly suggesting, we punt more complex functionality to a "text" or collation library. The simple operations defined in SRFI-13 are suitable for processing file names or program symbols. True text processing would want to use "text process" procedures. - *No* locales This library does not have locale parameters, or mechanisms for dynamically binding a default locale. These features are beyond the scope of this SRFI, and are postponed for a separate collation or text library. Case-mapping and case-folding operators *are* defined to be sensitive, in a limited fashion, to a "default locale," if the Scheme system provides such a thing. - Case mapping Case mapping is context-independent, char-by-char. It is locale-sensitive to the default locale. STRING-UPCASE! STRING-DOWNCASE! and STRING-TITLECASE! are back. As in Java, they and their pure STRING-UPCASE STRING-DOWNCASE and STRING-TITLECASE variants do 1-1 character context-insensitive case mapping, sensitive to the default locale. This means, for example, that the German strasse character does *not* upcase to "SS." It maps to itself. The simple rules for 1-1 char case mapping are laid out by the Unicode standards and also by the Java specs. - String comparison STRING-COMPARE STRING< STRING<= STRING>= STRING> STRING= STRING<> are locale-blind, and work purely in terms of "code points" -- the individual chars of the string. In a Unicode Scheme, then, the e-accent-acute character would not compare equal to the e character followed by the zero width accent-acute character. A kana character would not compare equal to its half-width variant. And so forth. The case-insensitive versions of these ops are sensitive to the default locale for case-mapping (but *not* for character collation order), and are defined to do a char-by-char code-point comparison on (char-downcase (char-upcase c)). More sophisticated string comparison belongs in a separate "text" or collation library, as Java does and Mikael has been suggesting. Such a library would compute sort/collation keys, case mapping, text normalisation, and operations that are blind to or fold away case, accents/diacritical marks, ligatures, etc. I will modify the SRFI to reflect these decisions. -Olin