Title

R6RS Unicode data

Authors

Matthew Flatt and Marc Feeley

Status

This SRFI is currently in withdrawn status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-75@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Received: 2005-07-11
Draft: 2005-07-12--2005-09-08
Revised: 2005-07-21
Withdrawn: 2006-05-27

This SRFI is being submitted by members of the Scheme Language Editor's Committee as part of the R6RS Scheme standardization process. The purpose of such ``R6RS SRFIs'' is to inform the Scheme community of features and design ideas under consideration by the editors and to allow the community to give the editors some direct feedback that will be considered during the design process.

At the end of the discussion period, this SRFI will be withdrawn. When the R6RS specification is finalized, the SRFI may be revised to conform to the R6RS specification and then resubmitted with the intent to finalize it. This procedure aims to avoid the situation where this SRFI is inconsistent with R6RS. An inconsistency between R6RS and this SRFI could confuse some users. Moreover it could pose implementation problems for R6RS compliant Scheme systems that aim to support this SRFI. Note that departures from the SRFI specification by the Scheme Language Editor's Committee may occur due to other design constraints, such as design consistency with other features that are not under discussion as SRFIs.

Abstract

Unicode is a widespread universal character code that supports most of the world's (natural) languages. The extensions to Scheme specified in this SRFI concern the support of Unicode in Scheme's character, string, and symbol datatypes. This SRFI does not (fully) specify how I/O of Unicode data is performed or how Scheme source code is encoded in files; these aspects are left for other SRFIs to specify.

Issues

The extension of symbol syntax to include all non-whitespace characters above Unicode 127 may be too liberal. At the same time, it does not extend the set of symbols to include sensible ASCII combinations, such as ->. Finally, it may not be necessary to place symbols in one-to-one correspondence (which motivates the new explicitly-quoted syntax for symbol literals).
C and many other languages (including some Scheme implementations) support octal notation within strings and characters. Octal notation is not included in this draft because the notation seems no longer as popular as other formats (with the notable exception of \0), and the variable-width encoding is potentially confusing.
This draft includes both #\newline and #\linefeed as character constants. The former is compatible with R5RS, but the latter is arguably preferable. Maybe we should pick one.
The syntax for numerical scalar values in character and string literals --- using \x, \u, and \U --- avoids the variable-length encoding of C's \x, but it's an ad hoc mixture of various standards. Another possibility would be to use a delimited Scheme number within a string, as in Gambit.
Here strings appeared in an earlier draft, but they have been removed, perhaps to reappear in future SRFI.
The \<linefeed><intraline-whitespace> may not be necessary. Meanwhile, unescaped newlines perhaps should be prohibited in strings.
The current draft provides no locale-specific operations on strings. Probably it belongs in another standard, but a few placeholders might be useful here.
The string-titlecase procedure is not the same as Unicode's titlecase conversion for strings, because Unicode defines a more elaborate word-breaking algorithm. More generally, it's not clear that titlecase operations are useful enough to include in the standard.

Revision History

Second draft:
- Removed here strings.
- Added char-foldcase and string-foldcase and redefined the case-insensitive operations in terms of these.
- Added string-upcase, string-downcase, and string-titlecase.
- Changed the definition of char-lower-case? and char-upper-case? to better match Unicode instead of SRFI-14.
- Removed char-symbolic?, char-punctuation?, char-graphic?, char-blank?, and char-iso-control?.
- Removed \' and \? string/symbol escapes.
- Added examples for character and string operations.
- Expanded rationale.

Rationale

The manipulation of text is a fundamental information processing task, and software increasingly must process text in a variety of natural languages, possibly multiple languages in the same document. The Unicode standard specifies how the textual data of most of the world's languages is represented and handled. Several operating systems, programming languages, libraries, and software tools have now embraced the Unicode standard. Adding Unicode support to Scheme, as specified by this SRFI, will allow

multilingual text processing;
internationalization (adaptation of software to the linguistic preference of its users);
improved interoperability with operating systems, programming languages, libraries, and software tools that support Unicode; and
improved portability of programs between R6RS Scheme implementations.

The SRFI mandates a specific set of values for characters, a specific definition of strings in terms of characterm, and a specific definitiond of operations like char<? and string-ci=? in terms of those characters and strings. The goal of such mandates is to dramatically increase portability of Scheme programs through to well understand (if imperfect) concepts and definitions.

For some implementations of Scheme, such as those that are targeted to small devices, the mandates of this SRFI ask too much. For other implementations of Scheme, such as those that can support a more sophisticated definition of "character", this SRFI interferes by requiring specifically less. In the long run, a module system for Scheme should support a considerably broader range of conformant implementations, allowing implementations to support certain modules and not support others. The purpose of this SRFI, however, is to establish a baseline from which we can define simplifications for "smaller" Scheme and elaborations for "larger" Schemes.

Unicode Background

Unicode defines a standard mapping between sequences of code points (integers in the range 0 to #x10FFFF in the latest version of the standard) and human-readable ``characters.'' More precisely, Unicode distinguishes between glyphs, which are printed for humans to read, and characters, which are abstract entities that map to glyphs (sometimes in a way that's sensitive to surrounding characters). Furthermore, different sequences of code points sometimes correspond to the same character. The relationships among code points, characters, and glyphs are subtle and complex.

Despite this complexity, most things that a literate human would call a ``character'' can be represented by a single code point in Unicode (though there may exist code-point sequences that represent that same character). For example, Roman letters, Cyrillic letters, Hebrew consonants, and most Chinese characters fall into this category. Thus, the ``code point'' approximation of ``character'' works well for many purposes. It is thus appropriate to define Scheme characters as Unicode scalar values, which includes all code points except those designated as surrogates. A surrogate is a code point in the range #xD800 to #xDFFF that is used in pairs in the UTF-16 encoding to encode a supplementary character (whose code is in the range #x10000 to #x10FFFF).

String and Symbol Literals

Many programming languages use a lexical syntax for strings that is similar to the one used by the C language. In particular, Java has extended C's notation for Unicode. Adopting a similar syntax for Scheme has the advantage of making it easier to learn and remember, particularly by programmers accustomed to other languages.

R5RS specifies that the escape sequences \\ and \" can be used in string literals to denote the backslash and doublequote characters respectively. This SRFI introduces new escape sequences so that any Scheme string can be expressed using the ASCII subset of Unicode. Also, most C string literals have the same meaning as a Scheme string literal.

This SRFI also extends the lexical syntax of symbols, and it puts symbols in one-to-one correspondence with immutable strings. In the revised lexical syntax, most Unicode characters can be used directly as symbol characters. Furthermore, an explicitly quoted form for symbols supports an arbitrary sequence of characters in a symbol literal.

Locales

Besides printing and reading characters, humans also compare character strings, and humans perform operations such as changing characters to uppercase. To make programs geographically portable, humans must agree to compare or upcase characters consistently, at least in certain contexts. The Unicode standard provides such standard case mappings on scalar values.

In other contexts, global agreement is unnecessary, and the user's culture should determine a string operation, such as when sorting a list of file names, perhaps case-insensitively. A locale captures information about a user's culture-specific interpretation of character sequences. In particular, a locale determines how strings are sorted, how a lowercase character is converted to an uppercase character, and how strings are compared without regard to case.

String operations such as string-ci=? are not sensitive to the current locale, because they should be portable. A future SRFI might define operations like string-locale-ci=? to produce results that are consistent with the current locale as determined by an implementation.

Not Addressed in this SRFI

This SRFI does not address locales, and it does not address encoding issues, such as how a sequence of bytes in a file is to be decoded into a sequence of characters, or how a filesystem path is encoded as a string.

Specification

Types

This SRFI extends or re-defines the standard types character, string, and symbol.

Character Type

The Scheme character type corresponds to the set of Unicode scalar values. Specifically, each character corresponds to a number in the range [0, #xD7FF] union [#xE000, #x10FFFF], and properties of the character are as defined for the corresponding Unicode scalar value.

The integer->char procedure takes a Unicode scalar value as an exact integer, and it produces the corresponding character. The char->integer procedure takes a character and produces the corresponding scalar value. It is an error to call integer->char with an integer that is not in the range [0, #xD7FF] union [#xE000, #x10FFFF].

Examples:

      (integer->char 32) => #\space
      (char->integer (integer->char 5000)) => 5000
      (integer->char #xD800) => *error*

String Type

Like in R5RS, a Scheme string is a sequence of Scheme characters. The procedure call (string-ref str i) returns the Scheme character at index i in the string str. The procedure call (string-set! str i char) stores the Scheme character char at index i in the string str, and an unspecified value is returned.

Symbol Type

A symbol is defined by a sequence of characters. The string->symbol procedure works on any string, and symbol->string works on any symbol.

Lexical Syntax

The syntax of Scheme is defined in terms of characters, not bytes, and remains largely unchanged with the refined definition of character. This SRFI defines and extends only the lexical syntax of characters, strings, and symbols.

Character Lexical Syntax

R5RS specifies two lexical syntaxes for characters: named characters, e.g. #\space, and plain characters, e.g. #\[. For consistency with the escape sequences of strings and symbols, the following set of named characters is defined by this SRFI:

#\nul : Unicode 0
#\alarm : Unicode 7
#\backspace : Unicode 8
#\tab : Unicode 9
#\linefeed : Unicode 10
#\newline : Unicode 10 (as in R5RS)
#\vtab : Unicode 11
#\page : Unicode 12
#\return : Unicode 13
#\esc : Unicode 27
#\space : Unicode 32 (as in R5RS)
#\delete : Unicode 127

To allow denoting any Unicode character using the ASCII subset of Unicode and for consistency with the string escape sequences, this SRFI specifies the following escape character syntaxes:

#\x<x><x> : where <x> is a hexadecimal digit and the sequence of two <x>s forms a hexadecimal number between 0 and #xFF
#\u<x><x><x><x> : where <x> is a hexadecimal digit and the sequence of four <x>s forms a hexadecimal number between 0 and #xFFFF excluding the range [#xD800, #xDFFF]
#\U<x><x><x><x><x><x><x><x> : where <x> is a hexadecimal digit and the sequence of eight <x>s forms a hexadecimal number between 0 and #x10FFFF excluding the range [#xD800, #xDFFF]; the range restriction implies that the first two <x>s are 0

In short, \x specifies a character using 2 hex digits, \u using 4 hex digits, and \U using 8 hex digits.

All character syntaxes are case-sensitive, except that <x> can be an uppercase or lowercase hexadecimal digit. Unlike R5RS, every character datum must be followed by a delimiter.

Examples:

  #\xFF       ; Unicode 255
  #\u03BB     ; Unicode 955
  #\U00006587 ; Unicode 25991
  #\λ         ; Unicode 955

  #\u006587   ; parse error
  #\λx        ; parse error
  #\alarmx    ; parse error
  #\alarm x   ; Unicode 7 followed by x
  #\Alarm     ; parse error
  #\alert     ; parse error
  #\xFF       ; Unicode 255
  #\xff       ; Unicode 255
  #\x ff      ; Unicode 120 followed by another datum, ff
  #\x(ff)     ; Unicode 120 followed by another datum, a parenthesized ff
  #\(x)       ; parse error
  #\((x)      ; Unicode 40 followed by another datum, parenthesized x
  #\U00110000 ; parse error (out of range)
  #\uD800     ; parse error (in excluded range)

String Lexical Syntax

As in R5RS, a string datum is enclosed between double quotes ("), where a backslash adjusts the meaning of a character within the double quotes. The set of escape sequences is as follows:

\a : alarm, Unicode 7
\b : backspace, Unicode 8
\t : tab, Unicode 9
\n : linefeed, Unicode 10
\v : vertical tab, Unicode 11
\f : formfeed, Unicode 12
\r : return, Unicode 13
\" : doublequote, Unicode 34
\\ : backslash, Unicode 92
\| : vertical bar, Unicode 124
\<linefeed><intraline-whitespace> : nothing, where <linefeed> is Unicode 10, and <intraline-whitespace> is a sequence of non-linefeed whitespace characters (where whitespace is defined in SRFI-14)
\<space> : space, Unicode 32, where <space> is the character Unicode 32 (useful for terminating the previous escape sequence before continuing with whitespace)
\x<x><x> : where <x> is a hexadecimal digit and the sequence of two <x>s forms a hexadecimal number between 0 and #xFF
\u<x><x><x><x> : where <x> is a hexadecimal digit and the sequence of four <x>s forms a hexadecimal number between 0 and #xFFFF excluding the range [#xD800, #xDFFF]
\U<x><x><x><x><x><x><x><x> : where <x> is a hexadecimal digit and the sequence of eight <x>s forms a hexadecimal number between 0 and #x10FFFF excluding the range [#xD800, #xDFFF]; the range restriction implies that the first two <x>s are 0

These escape sequences are case-sensitive, except that <x> can be an uppercase or lowercase hexadecimal digit. As in character constants, \x specifies a character using 2 hex digits, \u using 4 hex digits, and \U using 8 hex digits.

Any other character in a string after a backslash is an error. Any character outside of an escape sequence and not a doublequote stands for itself in the string literal. For example the single-character string "λ" (double quote, a lowercase lambda, double quote) denotes the same string literal as "\u03bb".

Note that the meaning of \u<x><x><x><x> is slightly different in Java, because Java handles this escape sequence in the escape-processing phase, which can be viewed as preprocessing that precedes lexical analysis.

Examples:

 "abc"        ; Unicode sequence 97, 98, 99
 "\x41bc"     ; "Abc", which is Unicode sequence 65, 98, 99
 "\x41 bc"    ; "A bc", which is Unicode sequence 65, 32, 98, 99
 "\u41bc"     ; Unicode sequence 16828
 "\u41 bc"    ; parse error
 "\u41"       ; parse error
 "\x0041"     ; Unicode sequence 0, 52, 49
 "\u0041"     ; "A", which is Unicode sequence 65
 "\U0041"     ; parse error
 "\U00000041" ; "A", which is Unicode sequence 65
 "\U0010FFFF" ; "Unicode sequence #x10FFFF
 "\U00110000" ; parse error (out of range)
 "\uD800"     ; parse error (in excluded range)

Symbol Lexical Syntax

The syntax of symbols extends R5RS in three ways:

Symbols are case-sensitive.
Where R5RS allows a <letter>, this SRFI allows any character whose scalar value is greater than 127 and that is not considered whitespace according to SRFI-14.
A vertical bar (|) begins a quoted symbol. Like the double-quote character for strings, the vertical-bar character indicates both the start and end of a symbol. The characters between the two vertical bars denote the symbol's constituent characters, and it is parsed as for strings, including the treatment of escape sequences. Unlike strings, double-quote characters that are part of the symbol need not be escaped, whereas vertical-bar characters in the symbol must be escaped.

Examples:

      'Hello => Hello

      'λ => λ

      '|Hello| => Hello

      (symbol->string '|a "b\" \|c\| \n|) => "a \"b\" |c| \n"

Procedures

This SRFI defines and extends procedures for characters and strings. Programmers should recognize that these procedures may not produce results that an end-user would consider sensible with a particular locale. This SRFI defines no locale-sensitive operations.

Character Procedures

Character-comparison procedures are defined as follows:

   ;; char-comparator itself is not part of this SRFI; it is
   ;; used only to define other procedures
   (define (char-comparator num-comp)
     (lambda (a-char b-char)
       (num-comp (char->integer a-char) (char->integer b-char))))

   (define char=? (char-comparator =))
   (define char<? (char-comparator <))
   (define char>? (char-comparator >))
   (define char<=? (char-comparator <=))
   (define char>=? (char-comparator >=))

Unicode defines locale-independent mappings from scalar values to scalar values for upcase, downcase, titlecase, and case-folding operations. (These mappings can be extracted from UnicodeData.txt and CaseFolding.txt from the Unicode Consortium.) The following Scheme procedures map characters consistent with the Unicode specification:

char-upcase
char-downcase
char-titlecase
char-foldcase

These procedures take a character argument and return a character result. If the argument is an uppercase or titlecase character, and if there is a single character which is its lowercase form, then char-downcase returns that character. If the argument is a lowercase or titlecase character, and if there is a single character which is its uppercase form, then char-upcase returns that character. Otherwise, the character returned is the same as the argument. Note that this is an incomplete approximation to case conversion, even ignoring the user's locale; in general, case mappings require the context of a string, both in arguments and in result. See string-upcase and string-downcase for more general case-conversion procedures.

Case-insensitive character-comparison procedures are defined as follows:

   ;; char-ci-comparator itself is not part of this SRFI; it is
   ;; used only to define other procedures
   (define (char-ci-comparator cs-comp)
     (lambda (a-char b-char)
       (cs-comp (char-foldcase a-char) (char-foldcase b-char))))

   (define char-ci=? (char-ci-comparator char=?))
   (define char-ci<? (char-ci-comparator char<?))
   (define char-ci>? (char-ci-comparator char>?))
   (define char-ci<=? (char-ci-comparator char<=?))
   (define char-ci>=? (char-ci-comparator char>=?))

Among the following predicates, the first three are as defined by SRFI-14; the last three are defined as scalar values having the Unicode "Uppercase" property, the "Lowercase" property, and the "Lt" general category, respectively.

char-alphabetic?
char-numeric?
char-whitespace?
char-upper-case?
char-lower-case?
char-title-case?

Examples:

  (char<? #\z #\ß) => #t
  (char<? #\z #\Z) => #f
  (char-ci<? #\z #\Z) => #f
  (char-ci=? #\z #\Z) => #t
  (char-ci=? #\ς #\σ) => #t

  (char-upcase #\i) => #\I
  (char-downcase #\i) => #\i
  (char-titlecase #\i) => #\I
  (char-foldcase #\i) => #\i

  (char-upcase #\ß) => #\ß
  (char-downcase #\ß) => #\ß
  (char-titlecase #\ß) => #\ß
  (char-foldcase #\ß) => #\ß

  (char-upcase #\Σ) => #\Σ
  (char-downcase #\Σ) => #\σ
  (char-titlecase #\Σ) => #\Σ
  (char-foldcase #\Σ) => #\σ

  (char-upcase #\ς) => #\Σ
  (char-downcase #\ς) => #\ς
  (char-titlecase #\ς) => #\Σ
  (char-foldcase #\ς) => #\σ

  (char-alphabetic? #\a) => #t
  (char-numeric? #\1) => #t
  (char-whitespace? #\space) => #t
  (char-whitespace? #\u00A0) => #t
  (char-upper-case? #\Σ) => #t
  (char-lower-case? #\σ) => #t
  (char-lower-case? #\u00AA) => #t
  (char-title-case? #\I) => #f
  (char-title-case? #\u01C5) => #t

String Procedures

The following string procedures are defined as in R5RS, which means that they are defined by pointwise operation on the string's characters:

string<?
string>?
string=?
string<=?
string>=?

The following string operations are not defined in terms of character-by-character conversions. Instead, they are defined as in terms of Unicode's locale-independent string mappings from scalar-value sequences to scalar-value sequences. (These mappings can be extracted from UnicodeData.txt, SpecialCasing.txt, and CaseFolding.txt from the Unicode Consortium.) In particular, the length of the result string can be different than the length of the input string:

string-upcase
string-downcase
string-titlecase
string-foldcase

These procedures take a string argument argument and return a string result. The string-upcase procedure converts a string to uppercase, string-downcase converts a string to lowercase, and string-foldcase converts each character in the string to its case-folded representative(s). The string-titlecase procedure converts the first character to titlecase in each contiguous sequence of cased characters within \var{string}, and it downcases all other cased characters; for the purposes of detecting cased-character sequences, case-ignorable characters are ignored (i.e., they do not interrupt the sequence). Since each of these procedures is locale-independent, they still are suboptimal for some locales, but this SRFI defines no locale-sensitive operations.

Case-insensitive string-comparison procedures are defined as follows:

   ;; string-ci-comparator itself is not part of this SRFI; it is
   ;; used only to define other procedures
   (define (string-ci-comparator cs-comp)
     (lambda (a-string b-string)
       (cs-comp (string-foldcase a-string) (string-foldcase b-string))))

   (define string-ci=? (string-ci-comparator string=?))
   (define string-ci<? (string-ci-comparator string<?))
   (define string-ci>? (string-ci-comparator string>?))
   (define string-ci<=? (string-ci-comparator string<=?))
   (define string-ci>=? (string-ci-comparator string>=?))

Examples:

  (string<? "z" "ß") => #t
  (string<? "z" "zz") => #t
  (string<? "z" "Z") => #f
  (string=? "Straße" "Strasse") => #f

  (string-upcase "Hi") => "HI"
  (string-downcase "Hi") => "hi"
  (string-foldcase "Hi") => "hi"

  (string-upcase "Straße") => "STRASSE"
  (string-downcase "Straße") => "straße"
  (string-foldcase "Straße") => "strasse"
  (string-downcase "STRASSE")  => "strasse"

  (string-upcase "ΧΑΟΣ") => "ΧΑΟΣ"
  (string-downcase "Σ") => "σ"
  (string-downcase "ΧΑΟΣ") => "χαος"
  (string-downcase "ΧΑΟΣΣ") => "χαοσς"
  (string-downcase "ΧΑΟΣ Σ") => "χαος σ"
  (string-foldcase "ΧΑΟΣΣ") => "χαοσσ"
  (string-upcase "χαος") => "ΧΑΟΣ"
  (string-upcase "χαοσ") => "ΧΑΟΣ"
  
  (string-titlecase "kNock KNoCK") =gt; "Knock Knock"
  (string-titlecase "who's there?") =gt; "Who's There?"
  (string-titlecase "r6rs") =gt; "R6Rs"
  (string-titlecase "R6RS") =gt; "R6Rs"

  (string-ci<? "z" "Z") => #f
  (string-ci=? "z" "Z") => #t
  (string-ci=? "Straße" "Strasse") => #t
  (string-ci=? "Straße" "STRASSE") => #t
  (string-ci=? "ΧΑΟΣ" "χαοσ") => #t

Reference Implementation

MzScheme version 299.108 and up implements the character and string operations described in this SRFI. A portable implementation using R5RS Scheme (to implement a distinct character type) will accompany a future draft of this SRFI.

The file r6rs-reader.ss implements a MzScheme-specific reader for the character, string, and symbol syntax described in this SRFI. This reader uses readtable support from MzScheme 299.103 and up. It does not fully restrict unquoted symbols to the syntax of R5RS; for example, -> is parsed as a symbol. The reference reader does, however, disable MzScheme's backslash escape for symbols (making an unquoted backslash illegal). The file r6rs-reader-test.ss contains a test suite.

Acknowledgements

This SRFI was written in consultation with the full set of R6RS editors: Will Clinger, Kent Dybvig, Marc Feeley, Matthew Flatt, Manuel Serrano, Michael Sperber, and Anton van Straaten. In addition, Ray Dillinger supplied part of the text, and many other helpful suggestions from the SRFI-75 mailing have been incoporated.

References

The Unicode Consortium publishes the Unicode standard and several Unicode related documents on their web page.
Unicode Demystified by Richard Gillam, Addison-Wesley Professional, 2002.

Copyright

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Editor: Mike Sperber