Title

R6RS Unicode data

Authors

Matthew Flatt and Marc Feeley

Status

This SRFI is being submitted by members of the Scheme Language Editor's Committee as part of the R6RS Scheme standardization process. The purpose of such ``R6RS SRFIs'' is to inform the Scheme community of features and design ideas under consideration by the editors and to allow the community to give the editors some direct feedback that will be considered during the design process.

At the end of the discussion period, this SRFI will be withdrawn. When the R6RS specification is finalized, the SRFI may be revised to conform to the R6RS specification and then resubmitted with the intent to finalize it. This procedure aims to avoid the situation where this SRFI is inconsistent with R6RS. An inconsistency between R6RS and this SRFI could confuse some users. Moreover it could pose implementation problems for R6RS compliant Scheme systems that aim to support this SRFI. Note that departures from the SRFI specification by the Scheme Language Editor's Committee may occur due to other design constraints, such as design consistency with other features that are not under discussion as SRFIs.

This SRFI is currently in ``draft'' status. To see an explanation of each status that a SRFI can hold, see here. It will remain in draft status until 2005/09/08, or as amended. To provide input on this SRFI, please


mailto:srfi-75@srfi.schemers.org

. See instructions here to subscribe to the list. You can access previous messages via the archive of the mailing list.

Received: 2005/07/11
Draft: 2005/07/12 - 2005/09/08

Abstract

Unicode is a widespread universal character code that supports most of the world's (natural) languages. The extensions to Scheme specified in this SRFI concern the support of Unicode in Scheme's character, string and symbol datatypes. This SRFI does not (fully) specify how I/O of Unicode data is performed or how Scheme source code is encoded in files. These aspects are left for other SRFIs to specify.

Issues

The extension of symbol syntax to include all non-whitespace characters above Unicode 127 may be too liberal. At the same time, it does not extend the set of symbols to include sensible ASCII combinations, such as ->. Finally, it may not be necessary to place symbols in one-to-one correspondence (which motivates the new explicitly-quoted syntax for symbol literals).
C and many other languages (including some Scheme implementations) support octal notation within strings and characters. Octal notation is not included in this draft because the notation seems no longer as popular as other formats (with the notable exception of \0), and the variable-width encoding is potentially confusing.
This draft includes both #\newline and #\linefeed as character constants. The former is compatible with R5RS, but the latter is arguably preferable. Maybe we should pick one.
The syntax for numerical scalar values in character and string literals --- using \x, \u, and \U --- avoids the variable-length encoding of C's \x, but it's an ad hoc mixture of various standards. Another possibility would be to use a delimited Scheme number within a string, as in Gambit.

Here strings might be generalized as in Perl, where multiple #<< can appear on a line, and the completing of each follows in later lines:

        (string-append #<<ONE #<<TWO)
        a b c
        d e f
        ONE
        g h i
        j k l
        TWO
        => "a b c\nd e fg h i\nj k l"

The \<newline><intraline-whitespace> may not be necessary. Meanwhile, unescaped newlines perhaps should be prohibited in strings.
The current draft provides no locale-specific operations on strings. Probably it belongs in another standard, but a few placeholders might be useful here.

Rationale

The manipulation of text is a fundamental information processing task. Increasingly it is important for software to process text in a variety of natural languages, possibly multiple languages in the same document. The Unicode standard specifies how the textual data of most of the world's languages is represented and handled. Several operating systems, programming languages, libraries, and software tools have now embraced the Unicode standard. Adding Unicode support to Scheme, as specified by this SRFI, will allow

multilingual text processing
internationalization (adaptation of software to the linguistic preference of its users)
improved interoperability with operating systems, programming languages, libraries, and software tools that support Unicode
improved portability of programs between R6RS Scheme implementations

Unicode Background

Unicode defines a standard mapping between sequences of code points (integers in the range 0 to #x10FFFF in the latest version of the standard) and human-readable ``characters.'' More precisely, Unicode distinguishes between glyphs, which are printed for humans to read, and characters, which are abstract entities that map to glyphs (sometimes in a way that's sensitive to surrounding characters). Furthermore, different sequences of code points sometimes correspond to the same character. The relationships among code points, characters, and glyphs are subtle and complex.

Despite this complexity, most things that a literate human would call a ``character'' can be represented by a single code point in Unicode (though there may exist code-point sequences that represent that same character). For example, Roman letters, Cyrillic letters, Hebrew consonants, and most Chinese characters fall into this category. Thus, the ``code point'' approximation of ``character'' works well for many purposes. It is thus appropriate to define Scheme characters as Unicode scalar values, which includes all code points except those designated as surrogates. A surrogate is a code point in the range #xd800 to #xdfff that is used in pairs in the UTF-16 encoding to encode a supplementary character (whose code is in the range #x10000 to #x10ffff).

String and Symbol Literals

Many programming languages use a lexical syntax for strings that is similar to the one used by the C language. In particular, Java has extended C's notation for Unicode. Adopting a similar syntax for Scheme has the advantage of making it easier to learn and remember, particularly by programmers accustomed to other languages.

R5RS specifies that the escape sequences \\ and \" can be used in string literals to denote the backslash and doublequote characters respectively. This SRFI introduces new escape sequences so that any Scheme string can be expressed using the ASCII subset of Unicode. Also, most C string literals have the same meaning as a Scheme string literal.

This SRFI also extends the lexical syntax of symbols, and it puts symbols in one-to-one correspondence with immutable strings. In the revised lexical syntax, most Unicode characters can be used directly as symbol characters. Furthermore, an explicitly quoted from for symbols supports an arbitrary sequence of characters in a symbol literal.

Locales

Besides printing and reading characters, humans also compare character strings, and humans perform operations such as changing characters to uppercase. To make programs geographically portable, humans must agree to compare or upcase characters consistently, at least in certain contexts. The Unicode standard provides such standard case mappings on scalar values.

In other contexts, global agreement is unnecessary, and the user's culture should determine a string operation, such as when sorting a list of file names, perhaps case-insensitively. A locale captures information about a user's culture-specific interpretation of character sequences. In particular, a locale determines how strings are sorted, how a lowercase character is converted to an uppercase character, and how strings are compared without regard to case.

String operations such as string-ci=? are not sensitive to the current locale, because they should be portable. A future SRFI might define operations like string-locale-ci=? to produce results that are consistent with the current locale as determined by an implementation.

Not Addressed in this SRFI

This SRFI does not address locales, and it does not address encoding issues, such as how a sequence of bytes in a file is to be decoded into a sequence of characters, or how a filesystem path is encoded as a string.

Specification

Types

This SRFI extends or re-defines the standard types character, string, and symbol.

Character Type

The Scheme character type corresponds to the set of Unicode scalar values. Specifically, each character corresponds to a number in the range [0, #xd7ff] union [#xe000, #x10ffff], and properties of the character are as defined for the corresponding Unicode scalar value.

The integer->char procedure takes a Unicode scalar value as an exact integer, it and produces the corresponding character. The char->integer procedure takes a character and produces the corresponding scalar value. It is an error to call integer->char with an integer that is not in the range [0, #xd7ff] union [#xe000, #x10ffff].

Examples:

      (integer->char 32) => #\space
      (char->integer (integer->char 5000)) => 5000
      (integer->char #xd800) => *error*

String Type

Like in R5RS, a Scheme string is a fixed-length array of Scheme characters. The procedure call (string-ref str i) returns the Scheme character at index i in the string str. The procedure call (string-set! str i char) stores the Scheme character char at index i in the string str, and an unspecified value is returned.

Symbol Type

A symbol is defined by a sequence of characters. The string->symbol procedure works on any string, and symbol->string works on any symbol.

Lexical Syntax

The Syntax of Scheme is defined in terms of characters, not bytes, and remains largely unchanged with the refined definition of character. This SRFI defines and extends only the lexical syntax of characters, strings, and symbols.

Character Lexical Syntax

R5RS specifies two lexical syntaxes for characters: named characters, e.g. #\space, and plain characters, e.g. #\[. For consistency with the escape sequences of strings and symbols, the following set of named characters is defined by this SRFI:

#\nul : Unicode 0
#\alarm : Unicode 7
#\backspace : Unicode 8
#\tab : Unicode 9
#\linefeed : Unicode 10
#\newline : Unicode 10 (as in R5RS)
#\vtab : Unicode 11
#\page : Unicode 12
#\return : Unicode 13
#\esc : Unicode 27
#\space : Unicode 32 (as in R5RS)
#\rubout : Unicode 127

Note that the linefeed character has two external representations: #\linefeed and #\newline.

To allow denoting any Unicode character using the ASCII subset of Unicode and for consistency with the string escape sequences, this SRFI specifies these escape character syntaxes:

#\x<x><x> : where <x> is a hexadecimal digit and the sequence of two <x>s forms a hexadecimal number between 0 and #xff
#\u<x><x><x><x> : where <x> is a hexadecimal digit and the sequence of four <x>s forms a hexadecimal number between 0 and #xffff excluding the range [#xd800, #xdfff]
#\U<x><x><x><x><x><x><x><x> : where <x> is a hexadecimal digit and the sequence of eight <x>s forms a hexadecimal number between 0 and #x10ffff excluding the range [#xd800, #xdfff]; the range restriction implies that the first two <x>s are 0

All character syntaxes are case-sensitive, except that <x> can be an uppercase or lowercase hexadecimal digit. Unlike R5RS, every character datum must be followed by a delimiter.

Examples:

  #\xFF       ; Unicode 255
  #\u03BB     ; Unicode 955
  #\U00006587 ; Unicode 25991
  #\λ         ; Unicode 955

  #\u006587   ; parse error
  #\λx        ; parse error
  #\alarmx    ; parse error
  #\alarm x   ; Unicode 7 followed by x
  #\Alarm     ; parse error
  #\alert     ; parse error
  #\xFF       ; Unicode 255
  #\xff       ; Unicode 255
  #\x ff      ; Unicode 120 followed by ff
  #\x(ff)     ; Unicode 120 followed by a parenthesized ff
  #\(x)       ; parse error
  #\((x)      ; Unicode 40 followed by a parenthesized x

String Lexical Syntax

A string can be written in either of two forms: using R5RS-style double quotes, or using a new here-string style.

Quoted Strings

As in R5RS, a string datum can be enclosed between double quotes ("), where a backslash adjusts the meaning of a character within the double quotes. The set of escape sequences is as follows:

\a : alarm, Unicode 7
\b : backspace, Unicode 8
\t : tab, Unicode 9
\n : linefeed, Unicode 10
\v : vertical tab, Unicode 11
\f : formfeed, Unicode 12
\r : return, Unicode 13
\" : doublequote, Unicode 34
\' : quote, Unicode 39
\? : question mark, Unicode 63
\\ : backslash, Unicode 92
\| : vertical bar, Unicode 124
\<newline><intraline-whitespace> : nothing, where <newline> is Unicode 10, and <intraline-whitespace> is a sequence of non-newline whitescape characters (where whitespace is defined in SRFI-14)
\<space> : space, Unicode 32, where <space> is the character Unicode 32 (useful for terminating the previous escape sequence before continuing with whitespace)
\x<x><x> : where <x> is a hexadecimal digit and the sequence of two <x>s forms a hexadecimal number between 0 and #xff
\u<x><x><x><x> : where <x> is a hexadecimal digit and the sequence of four <x>s forms a hexadecimal number between 0 and #xffff excluding the range [#xd800, #xdfff]
\U<x><x><x><x><x><x><x><x> : where <x> is a hexadecimal digit and the sequence of eight <x>s forms a hexadecimal number between 0 and #x10ffff excluding the range [#xd800, #xdfff]; the range restriction implies that the first two <x>s are 0

These escape sequences are case-sensitive, except that <x> can be an uppercase or lowercase hexadecimal digit.

Any other character in a string after a backslash is an error. Any character outside of an escape sequence and not a doublequote stands for itself in the string literal. For example the single-character string "λ" (double quote, a lowercase lambda, double quote) denotes the same string literal as "\u03bb".

Note that the meaning of \u<x><x><x><x> is slightly different in Java, because Java handles this escape sequence in the escape-processing phase, which can be viewed as preprocessing that precedes lexical analysis.

Here Strings

Here strings are a new syntax for denoting string literals that eliminates the quoting problem inherent in the traditional string syntax. The sequence #<< marks the start of a here string. The part of the line after the #<< up to and including the newline character (Unicode 10) is the key. The first line afterward that matches the key marks the end of the here string. The string contains all the characters between the start key and the end key, with the exception of the newline before the end key. The end key can be terminated by an end-of-file instead of a newline.

Example:

      #<<THE-END
      printf("hello\n");
      THE-END
        => "printf(\"hello\\n\");"
      
      #<<THE-END
      printf("hello\n");
      
      THE-END
        => "printf(\"hello\\n\");\n"

Here strings are particularly useful to include verbatim the text of some other document, that has its own syntax that might conflict with the traditional Scheme string syntax, for example a source code fragment or a TeX document.

The here string syntax of this SRFI is based on the line-delimited here strings of Scsh (see Scsh Reference Manual), which in turn resemble ``here documents'' in shells such as bash and csh.

Symbol Lexical Syntax

The syntax of symbols extends R5RS in three ways.:

Symbols are case-sensitive.
Where R5RS allows a <letter>, this SRFI allows any character whose scalar value is greater than 127 and that is not considered whitescape according to SRFI-14.
A vertical bar (|) begins a quoted symbol. Like the double-quote character for strings, the vertical-bar character indicates both the start and end of a symbol. The characters between the two vertical bars denote the symbol's constituent characters, and it is parsed as for strings, including the treatment of escape sequences. Unlike strings, double-quote characters that are part of the symbol need not be escaped, whereas vertical-bar characters in the symbol must be escaped.

Examples:

      'Hello => Hello

      'λ => λ

      '|Hello| => Hello

      (symbol->string '|a "b\" \|c\| \n|) => "a \"b\" |c| \n"

Procedures

This SRFI defines and extends procedures for characters and strings.

Character Procedures

Character-comparison procedures are defined as follows:

   ;; char-comparator itself is not part of this SRFI; it is
   ;; used only to define other procedures
   (define (char-comparator num-comp)
     (lambda (a-char b-char)
       (num-comp (char->integer a-char) (char->integer b-char))))

   (define char=? (char-comparator =))
   (define char<? (char-comparator <))
   (define char>? (char-comparator >))
   (define char<=? (char-comparator <=))
   (define char>=? (char-comparator >=))

Unicode defines locale-independent mappings from scalar values to scalar values for upcase, downcase, and titlecase operations. (These mappings can be extracted from UnicodeData.txt from the Unicode Consortium.) The following Scheme procedures map characters consistent with the Unicode specification:

char-upcase
char-downcase
char-titlecase

Case-insensitive character-comparison procedures are defined as follows:

   ;; char-ci-comparator itself is not part of this SRFI; it is
   ;; used only to define other procedures
   (define (char-ci-comparator cs-comp)
     (lambda (a-char b-char)
       (cs-comp (char-downcase a-char) (char-downcase b-char))))

   (define char-ci=? (char-ci-comparator char=?))
   (define char-ci<? (char-ci-comparator char<?))
   (define char-ci>? (char-ci-comparator char>?))
   (define char-ci<=? (char-ci-comparator char<=?))
   (define char-ci>=? (char-ci-comparator char>=?))

Programmers should recognize that these procedures may not produce results that an end-user would consider sensible with a particular locale. This SRFI defines no locale-sensitive comparisons for characters.

The following predicates are as defined by SRFI-14:

char-alphabetic?
char-lower-case?
char-upper-case?
char-title-case?
char-numeric?
char-symbolic?
char-punctuation?
char-graphic?
char-whitespace?
char-blank?
char-iso-control?

String Procedures

The following string procedures are defined as in R5RS, which means that they are defined by pointwise operation on the string's characters:

string<?
string>?
string=?
string<=?
string>=?
string-ci<?
string-ci>?
string-ci=?
string-ci<=?
string-ci>=?

Reference Implementation

MzScheme version 299.100 and up implements the character and string operations described in this SRFI. A portable implementation using R5RS Scheme (to implement a distinct character type) will accompany the next draft of this SRFI.

The file r6rs-reader.ss implements a MzScheme-specific reader for the character, string, and symbol syntax described in this SRFI. This reader uses readtable support from MzScheme 299.103 and up. It does not fully restrict unquoted symbols to the syntax of R5RS; for example, -> is parsed as a symbol. The reference reader does, however, disable MzScheme's backslash escape for symbols (making an unquoted backslash illegal). The file r6rs-reader-test.ss contains a test suite.

Acknowledgements

This SRFI was written in consultation with the full set of R6RS editors: Will Clinger, Kent Dybvig, Marc Feeley, Matthew Flatt, Manuel Serrano, Michael Sperber, and Anton van Straaten.

References

The Unicode Consortium publishes the Unicode standard and several Unicode related documents on their web page.
Unicode Demystified by Richard Gillam, Addison-Wesley Professional, 2002.
Scsh Reference Manual by Olin Shivers, Brian D. Carlstrom, Martin Gasbichler, and Mike Sperber, 2004.

Copyright

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Editor: Mike Sperber