[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Case-mapping, Unicode & internationalisation
I would like SRFI-13 to take advantage of the opportunity to tackle
the issues arising from internationalisation and Unicode, and do a
proper job. My design criteria for SRFI-13 are these:
- The SRFI-13 spec is independent of the implementation chosen for
representing characters -- one should be able to use SRFI-13 procedures
in Schemes that use ASCII, Latin-1, Unicode or other encodings for chars.
- The spec *is* designed to allow string-processing code to be portable
across different character encodings. This means that we include string
primitives (such as string comparison, case mapping) which cannot be
portably implemented using simple character primitives for Unicode
Schemes. For example, lower-casing a string requires more than mapping
CHAR-DOWNCASE over the string -- see below for the subtleties involved
when dealing with the full spectrum of Unicode.
In other words, I don't want to put in Unicode-specific ops, but I want all
the ops to make sense in a Unicode world. This is similar to my design
criteria for shared-text substrings.
Ben Goettner has been advising me on the subtleties of Unicode and case. The
good news is that there is a whole tech report from the Unicode people on this
issue. The bad news is that the possibility of Unicode does have impact on the
design of basic string operations.
The issues of case-mapping are laid out in Unicode Tech Report 21, which is
short, clear and available on the Web:
(It can be easily read in a few minutes.)
The short summary is that we are dropping two procedures (STRING-UPCASE!
and STRING-DOWNCASE!) and reinstating WORD-CAPITALIZE with a new name
(STRING-TITLECASE) and new semantics.
Here are the issues and their impact on SRFI-13.
- Case-mapping requires surrounding context
In Unicode, you can't actually do case-mapping on a single char in
isolation. In a few cases, it requires surrounding context info. For
example, Greek capital sigma downcases to two different chars depending upon
whether it is the final character of a word or not.
STRING-UPCASE & STRING-DOWNCASE use context in a Unicode Scheme. However,
context does not extend beyond the limits of the start/end indices, when
these are supplied.
CHAR-UPCASE and CHAR-DOWNCASE are not in the purview of SRFI-13. However,
this SRFI recommends that these two functions simply choose a reasonable
default for these cases (e.g., the NON_FINAL mapping).
- Titlecase <> uppercase
Unicode defines three kinds of case mapping: lowercase, uppercase, and
titlecase. The difference between uppercasing and titlecasing a character
or character sequence can be seen in compound characters (that is,
a single character that represents a compount of two characters).
For example, in Unicode, character U+01F3 is LATIN SMALL LETTER DZ. (Let us
write this compound character using ASCII as "dz".) This character
uppercases to character U+01F1, LATIN CAPITAL LETTER DZ. (Which is
basically "DZ".) But it titlecases to to character U+01F2, LATIN CAPITAL
LETTER D WITH SMALL LETTER Z. (Which we can write "Dz".)
character uppercase titlecase
--------- --------- ---------
dz DZ Dz
Scheme needs CHAR-TITLECASE and CHAR-TITLECASE? functions, but this is not
in the purview of SRFI-13, which handles strings, not chars.
STRING-CAPITALIZE is required to do the right thing with compound characters
in a Unicode implementation.
We also add STRING-TITLECASE, which uses the Unicode definition
of titlecasing a text string: every character not preceded by a
cased character is titlecased. All other characters are lowercased. E.g.
(string-titlecase "olin g. sHIVERS") => "Olin G. Shivers"
(string-titlecase "Laurence McCullough") => "Laurence Mccullough"
(string-titlecase "3com mAkes ROUTERS.") => "3Com Makes Routers."
(This is essentially the task handled by the old CAPITALIZE-WORDS function,
which was dropped a few rounds ago.) If the optional start index is given,
it is treated as the beginning of the string. E.g.:
(string-titlecase "jamie clark" 2) => "Mie Clark"
To recap, STRING-CAPITALIZE titlecases the *initial* character of a string.
STRING-TITLECASE processes the entire string.
- A single lowercase char can upcase into multiple chars
For example, German eszet upcases to "SS".
This is a problem for CHAR-UPCASE; STRING-UPCASE and STRING-TITLECASE can
handle it properly, and are required to do so in a Latin-1 or Unicode Scheme.
STRING-UPCASE! and STRING-DOWNCASE! are being dropped, since they cannot
guarantee to handle their arguments in-place. (Bummer.)
- Turkish has different case mappings.
Case-mapping functions are sensitive to external environment settings
in ways not defined by this SRFI. E.g., the current $LC locale in Unix.
Note that Turkish is the only language in the Unicode set with this problem.
- CHAR-UPCASE and CHAR-DOWNCASE
These functions are not in the purview of SRFI-13. However,
this SRFI recommends that these two functions
- pass through unchanged characters whose case-mapping expands them into
multi-character sequences, such as when upcasing the Latin-1 German
eszet to "SS." This will allow old code to continue to work, and is
consistent with what modern Unicode OS's do (e.g., Windows 2000) --
hence implementations can use the native OS case-mapping facilities,
- return a reasonable default when asked to case-map a character
that has multiple possible results depending upon context (such as
downcasing the Greek capital sigma).
- This SRFI additionally recommends
- numeric codes for standard functions that map between characters and
integers should be required to use the Unicode/Latin-1/ASCII mapping. This
allows programmers to write portable code.
- CHAR-TITLECASE be added to CHAR-UPCASE and CHAR-DOWNCASE
- CHAR-TITLECASE? be added to CHAR-UPCASE? and CHAR-DOWNCASE?
- Title/up/down-case functions might be added to the character-processing
suite which return immutable string values. Note that the context issue
(e.g., properly downcasing Greek Sigma) is not resolved by these functions.
These recommendations are not a part of the SRFI-13 spec. Note also that
requiring a Unicode/Latin-1/ASCII interface to integer/char mapping
functions does not imply anything about the actual underlying encodings of
(upcase-string string [start end]) -> string
(downcase-string string [start end]) -> string
(titlecase-string string [start end]) -> string
- Char function recommendations:
(char-upcase char) -> char
(char-downcase char) -> char
(char-titlecase char) -> char
(char-upcase? char) -> boolean
(char-downcase? char) -> boolean
(char-titlecase? char) -> boolean
(upcase-char->string char) -> immutable-string
(downcase-char->string char) -> immutable-string
(titlecase-char->string char) -> immutable-string
- Other internationalisation issues
Case mapping is not the only tricky issue in a rich character world
like Unicode. I'll deal with the following issues in later notes.
- Procedures to find word boundaries and line-break opportunities portably.
- String comparison: collation order, case-folding, normalisation