This SRFI is currently in ``draft'' status. To see an explanation of each status that a SRFI can hold, see here. It will remain in draft status until 2005/09/08, or as amended. To provide input on this SRFI, pleaseThis SRFI is being submitted by members of the Scheme Language Editor's Committee as part of the R6RS Scheme standardization process. The purpose of such ``R6RS SRFIs'' is to inform the Scheme community of features and design ideas under consideration by the editors and to allow the community to give the editors some direct feedback that will be considered during the design process.
At the end of the discussion period, this SRFI will be withdrawn. When the R6RS specification is finalized, the SRFI may be revised to conform to the R6RS specification and then resubmitted with the intent to finalize it. This procedure aims to avoid the situation where this SRFI is inconsistent with R6RS. An inconsistency between R6RS and this SRFI could confuse some users. Moreover it could pose implementation problems for R6RS compliant Scheme systems that aim to support this SRFI. Note that departures from the SRFI specification by the Scheme Language Editor's Committee may occur due to other design constraints, such as design consistency with other features that are not under discussion as SRFIs.
mailto:srfi-75@srfi.schemers.org
.
See instructions
here to subscribe to the list. You can access previous messages via
the
archive of the mailing list.
Unicode is a widespread universal character code that supports most of the world's (natural) languages. The extensions to Scheme specified in this SRFI concern the support of Unicode in Scheme's character, string, and symbol datatypes. This SRFI does not (fully) specify how I/O of Unicode data is performed or how Scheme source code is encoded in files; these aspects are left for other SRFIs to specify.
->
. Finally, it may
not be necessary to place symbols in one-to-one correspondence
(which motivates the new explicitly-quoted syntax for symbol
literals).
\0
), and the
variable-width encoding is potentially confusing.
#\newline
and
#\linefeed
as character constants. The former is
compatible with R5RS, but the latter is arguably preferable.
Maybe we should pick one.
\<linefeed><intraline-whitespace>
may not be necessary.
Meanwhile, unescaped newlines perhaps should be prohibited in strings.
string-titlecase
procedure is not the same as
Unicode's titlecase conversion for strings, because Unicode
defines a more elaborate word-breaking algorithm. More
generally, it's not clear that titlecase operations are useful
enough to include in the standard.
char-foldcase
and string-foldcase
and redefined the case-insensitive operations in terms of these.
string-upcase
, string-downcase
,
and string-titlecase
.
char-lower-case?
and
char-upper-case?
to better match Unicode instead of SRFI-14.
char-symbolic?
,
char-punctuation?
, char-graphic?
,
char-blank?
, and char-iso-control?
.
\'
and \?
string/symbol escapes.
The manipulation of text is a fundamental information processing task, and software increasingly must process text in a variety of natural languages, possibly multiple languages in the same document. The Unicode standard specifies how the textual data of most of the world's languages is represented and handled. Several operating systems, programming languages, libraries, and software tools have now embraced the Unicode standard. Adding Unicode support to Scheme, as specified by this SRFI, will allow
The SRFI mandates a specific set of values for characters, a
specific definition of strings in terms of characterm, and a specific
definitiond of operations like char<?
and
string-ci=?
in terms of those characters and strings. The
goal of such mandates is to dramatically increase portability of
Scheme programs through to well understand (if imperfect) concepts and
definitions.
For some implementations of Scheme, such as those that are targeted to small devices, the mandates of this SRFI ask too much. For other implementations of Scheme, such as those that can support a more sophisticated definition of "character", this SRFI interferes by requiring specifically less. In the long run, a module system for Scheme should support a considerably broader range of conformant implementations, allowing implementations to support certain modules and not support others. The purpose of this SRFI, however, is to establish a baseline from which we can define simplifications for "smaller" Scheme and elaborations for "larger" Schemes.
Unicode defines a standard mapping between sequences of code points (integers in the range 0 to #x10FFFF in the latest version of the standard) and human-readable ``characters.'' More precisely, Unicode distinguishes between glyphs, which are printed for humans to read, and characters, which are abstract entities that map to glyphs (sometimes in a way that's sensitive to surrounding characters). Furthermore, different sequences of code points sometimes correspond to the same character. The relationships among code points, characters, and glyphs are subtle and complex.
Despite this complexity, most things that a literate human would call a ``character'' can be represented by a single code point in Unicode (though there may exist code-point sequences that represent that same character). For example, Roman letters, Cyrillic letters, Hebrew consonants, and most Chinese characters fall into this category. Thus, the ``code point'' approximation of ``character'' works well for many purposes. It is thus appropriate to define Scheme characters as Unicode scalar values, which includes all code points except those designated as surrogates. A surrogate is a code point in the range #xD800 to #xDFFF that is used in pairs in the UTF-16 encoding to encode a supplementary character (whose code is in the range #x10000 to #x10FFFF).
Many programming languages use a lexical syntax for strings that is similar to the one used by the C language. In particular, Java has extended C's notation for Unicode. Adopting a similar syntax for Scheme has the advantage of making it easier to learn and remember, particularly by programmers accustomed to other languages.
R5RS specifies that the escape sequences \\
and
\"
can be used in string literals to denote the backslash
and doublequote characters respectively. This SRFI introduces new
escape sequences so that any Scheme string can be expressed using the
ASCII subset of Unicode. Also, most C string literals have the same
meaning as a Scheme string literal.
This SRFI also extends the lexical syntax of symbols, and it puts symbols in one-to-one correspondence with immutable strings. In the revised lexical syntax, most Unicode characters can be used directly as symbol characters. Furthermore, an explicitly quoted form for symbols supports an arbitrary sequence of characters in a symbol literal.
Besides printing and reading characters, humans also compare character strings, and humans perform operations such as changing characters to uppercase. To make programs geographically portable, humans must agree to compare or upcase characters consistently, at least in certain contexts. The Unicode standard provides such standard case mappings on scalar values.
In other contexts, global agreement is unnecessary, and the user's culture should determine a string operation, such as when sorting a list of file names, perhaps case-insensitively. A locale captures information about a user's culture-specific interpretation of character sequences. In particular, a locale determines how strings are sorted, how a lowercase character is converted to an uppercase character, and how strings are compared without regard to case.
String operations such as string-ci=?
are not
sensitive to the current locale, because they should be portable. A
future SRFI might define operations like
string-locale-ci=?
to produce results that are consistent
with the current locale as determined by an implementation.
The Scheme character type corresponds to the set of Unicode scalar values. Specifically, each character corresponds to a number in the range [0, #xD7FF] union [#xE000, #x10FFFF], and properties of the character are as defined for the corresponding Unicode scalar value.
The integer->char
procedure takes a Unicode scalar
value as an exact integer, and it produces the corresponding
character. The char->integer
procedure takes a
character and produces the corresponding scalar value. It is an error
to call integer->char
with an integer that is not in
the range [0, #xD7FF] union [#xE000, #x10FFFF].
Examples:
(integer->char 32) => #\space (char->integer (integer->char 5000)) => 5000 (integer->char #xD800) => *error*
Like in R5RS, a Scheme string is a sequence of Scheme characters.
The procedure call (string-ref str i)
returns the Scheme character at index i
in the
string str
. The procedure call (string-set!
str i char)
stores the Scheme character
char
at index i
in the string
str
, and an unspecified value is returned.
A symbol is defined by a sequence of characters. The
string->symbol
procedure works on any string, and
symbol->string
works on any symbol.
R5RS specifies two lexical syntaxes for characters: named characters,
e.g. #\space
, and plain characters, e.g. #\[
.
For consistency with the escape sequences of strings and symbols, the
following set of named characters is defined by this SRFI:
#\nul
: Unicode 0
#\alarm
: Unicode 7
#\backspace
: Unicode 8
#\tab
: Unicode 9
#\linefeed
: Unicode 10
#\newline
: Unicode 10 (as in R5RS)
#\vtab
: Unicode 11
#\page
: Unicode 12
#\return
: Unicode 13
#\esc
: Unicode 27
#\space
: Unicode 32 (as in R5RS)
#\delete
: Unicode 127
To allow denoting any Unicode character using the ASCII subset of Unicode and for consistency with the string escape sequences, this SRFI specifies the following escape character syntaxes:
#\x<x><x>
: where <x>
is a hexadecimal digit and the sequence of two
<x>
s forms a hexadecimal number between 0 and #xFF
#\u<x><x><x><x>
: where <x>
is a hexadecimal digit and the
sequence of four <x>
s forms a hexadecimal number between 0 and #xFFFF excluding the range [#xD800, #xDFFF]
#\U<x><x><x><x><x><x><x><x>
: where <x>
is
a hexadecimal digit and the sequence of eight <x>
s forms a hexadecimal number between 0 and #x10FFFF
excluding the range [#xD800, #xDFFF]; the range restriction implies that the first two <x>
s
are 0
\x
specifies a character using 2 hex digits, \u
using 4 hex digits,
and \U
using 8 hex digits.
All character syntaxes are case-sensitive, except that
<x>
can be an uppercase or lowercase hexadecimal
digit. Unlike R5RS, every character datum must be followed by a
delimiter.
Examples:
#\xFF ; Unicode 255 #\u03BB ; Unicode 955 #\U00006587 ; Unicode 25991 #\λ ; Unicode 955 #\u006587 ; parse error #\λx ; parse error #\alarmx ; parse error #\alarm x ; Unicode 7 followed by x #\Alarm ; parse error #\alert ; parse error #\xFF ; Unicode 255 #\xff ; Unicode 255 #\x ff ; Unicode 120 followed by another datum, ff #\x(ff) ; Unicode 120 followed by another datum, a parenthesized ff #\(x) ; parse error #\((x) ; Unicode 40 followed by another datum, parenthesized x #\U00110000 ; parse error (out of range) #\uD800 ; parse error (in excluded range)
As in R5RS, a string datum is enclosed between double quotes ("), where a backslash adjusts the meaning of a character within the double quotes. The set of escape sequences is as follows:
\a
: alarm, Unicode 7
\b
: backspace, Unicode 8
\t
: tab, Unicode 9
\n
: linefeed, Unicode 10
\v
: vertical tab, Unicode 11
\f
: formfeed, Unicode 12
\r
: return, Unicode 13
\"
: doublequote, Unicode 34
\\
: backslash, Unicode 92
\|
: vertical bar, Unicode 124
\<linefeed><intraline-whitespace>
: nothing,
where <linefeed>
is Unicode 10,
and <intraline-whitespace>
is a sequence of non-linefeed whitespace characters
(where whitespace is defined in SRFI-14)
\<space>
: space, Unicode 32, where <space>
is the character Unicode 32 (useful for
terminating the previous escape sequence before continuing with whitespace)
\x<x><x>
: where <x>
is a hexadecimal digit and the sequence of two <x>
s forms a hexadecimal number between 0 and #xFF
\u<x><x><x><x>
: where <x>
is a hexadecimal digit and the sequence of four <x>
s forms a hexadecimal number between 0 and #xFFFF excluding the range [#xD800, #xDFFF]
\U<x><x><x><x><x><x><x><x>
: where <x>
is a hexadecimal digit
and the sequence of eight <x>
s forms a hexadecimal number between 0 and #x10FFFF excluding the range
[#xD800, #xDFFF]; the range restriction implies that the first two <x>
s
are 0
These escape sequences are case-sensitive, except that
<x>
can be an uppercase or lowercase hexadecimal
digit. As in character constants, \x
specifies a character
using 2 hex digits, \u
using 4 hex digits, and
\U
using 8 hex digits.
Any other character in a string after a backslash is an error. Any
character outside of an escape sequence and not a doublequote stands
for itself in the string literal. For example the single-character string
"λ"
(double quote, a lowercase lambda,
double quote) denotes the same string literal as
"\u03bb"
.
Note that the meaning of
\u<x><x><x><x>
is slightly
different in Java, because Java handles this escape sequence in the
escape-processing phase, which can be viewed as preprocessing that
precedes lexical analysis.
Examples:
"abc" ; Unicode sequence 97, 98, 99 "\x41bc" ; "Abc", which is Unicode sequence 65, 98, 99 "\x41 bc" ; "A bc", which is Unicode sequence 65, 32, 98, 99 "\u41bc" ; Unicode sequence 16828 "\u41 bc" ; parse error "\u41" ; parse error "\x0041" ; Unicode sequence 0, 52, 49 "\u0041" ; "A", which is Unicode sequence 65 "\U0041" ; parse error "\U00000041" ; "A", which is Unicode sequence 65 "\U0010FFFF" ; "Unicode sequence #x10FFFF "\U00110000" ; parse error (out of range) "\uD800" ; parse error (in excluded range)
The syntax of symbols extends R5RS in three ways:
<letter>
, this SRFI allows
any character whose scalar value is greater than 127 and that is
not considered whitespace according to SRFI-14.
Examples:
'Hello => Hello 'λ => λ '|Hello| => Hello (symbol->string '|a "b\" \|c\| \n|) => "a \"b\" |c| \n"
Character-comparison procedures are defined as follows:
;; char-comparator itself is not part of this SRFI; it is ;; used only to define other procedures (define (char-comparator num-comp) (lambda (a-char b-char) (num-comp (char->integer a-char) (char->integer b-char)))) (define char=? (char-comparator =)) (define char<? (char-comparator <)) (define char>? (char-comparator >)) (define char<=? (char-comparator <=)) (define char>=? (char-comparator >=))
Unicode defines locale-independent mappings from scalar values to scalar values for upcase, downcase, titlecase, and case-folding operations. (These mappings can be extracted from UnicodeData.txt and CaseFolding.txt from the Unicode Consortium.) The following Scheme procedures map characters consistent with the Unicode specification:
char-upcase
char-downcase
char-titlecase
char-foldcase
char-downcase
returns that character. If the argument is
a lowercase or titlecase character, and if there is a single character
which is its uppercase form, then char-upcase
returns
that character. Otherwise, the character returned is the same as the
argument. Note that this is an incomplete approximation to case
conversion, even ignoring the user's locale; in general, case mappings
require the context of a string, both in arguments and in result. See
string-upcase
and string-downcase
for more
general case-conversion procedures.
Case-insensitive character-comparison procedures are defined as follows:
;; char-ci-comparator itself is not part of this SRFI; it is ;; used only to define other procedures (define (char-ci-comparator cs-comp) (lambda (a-char b-char) (cs-comp (char-foldcase a-char) (char-foldcase b-char)))) (define char-ci=? (char-ci-comparator char=?)) (define char-ci<? (char-ci-comparator char<?)) (define char-ci>? (char-ci-comparator char>?)) (define char-ci<=? (char-ci-comparator char<=?)) (define char-ci>=? (char-ci-comparator char>=?))
Among the following predicates, the first three are as defined by SRFI-14; the last three are defined as scalar values having the Unicode "Uppercase" property, the "Lowercase" property, and the "Lt" general category, respectively.
char-alphabetic?
char-numeric?
char-whitespace?
char-upper-case?
char-lower-case?
char-title-case?
Examples:
(char<? #\z #\ß) => #t (char<? #\z #\Z) => #f (char-ci<? #\z #\Z) => #f (char-ci=? #\z #\Z) => #t (char-ci=? #\ς #\σ) => #t (char-upcase #\i) => #\I (char-downcase #\i) => #\i (char-titlecase #\i) => #\I (char-foldcase #\i) => #\i (char-upcase #\ß) => #\ß (char-downcase #\ß) => #\ß (char-titlecase #\ß) => #\ß (char-foldcase #\ß) => #\ß (char-upcase #\Σ) => #\Σ (char-downcase #\Σ) => #\σ (char-titlecase #\Σ) => #\Σ (char-foldcase #\Σ) => #\σ (char-upcase #\ς) => #\Σ (char-downcase #\ς) => #\ς (char-titlecase #\ς) => #\Σ (char-foldcase #\ς) => #\σ (char-alphabetic? #\a) => #t (char-numeric? #\1) => #t (char-whitespace? #\space) => #t (char-whitespace? #\u00A0) => #t (char-upper-case? #\Σ) => #t (char-lower-case? #\σ) => #t (char-lower-case? #\u00AA) => #t (char-title-case? #\I) => #f (char-title-case? #\u01C5) => #t
The following string procedures are defined as in R5RS, which means that they are defined by pointwise operation on the string's characters:
string<?
string>?
string=?
string<=?
string>=?
The following string operations are not defined in terms of character-by-character conversions. Instead, they are defined as in terms of Unicode's locale-independent string mappings from scalar-value sequences to scalar-value sequences. (These mappings can be extracted from UnicodeData.txt, SpecialCasing.txt, and CaseFolding.txt from the Unicode Consortium.) In particular, the length of the result string can be different than the length of the input string:
string-upcase
string-downcase
string-titlecase
string-foldcase
string-upcase
procedure converts a string to
uppercase, string-downcase
converts a string to
lowercase, and string-foldcase
converts each character in
the string to its case-folded representative(s). The
string-titlecase
procedure converts the first character
to titlecase in each contiguous sequence of cased characters within
\var{string}, and it downcases all other cased characters; for the
purposes of detecting cased-character sequences, case-ignorable
characters are ignored (i.e., they do not interrupt the
sequence). Since each of these procedures is locale-independent, they
still are suboptimal for some locales, but this SRFI defines no
locale-sensitive operations.
Case-insensitive string-comparison procedures are defined as follows:
;; string-ci-comparator itself is not part of this SRFI; it is ;; used only to define other procedures (define (string-ci-comparator cs-comp) (lambda (a-string b-string) (cs-comp (string-foldcase a-string) (string-foldcase b-string)))) (define string-ci=? (string-ci-comparator string=?)) (define string-ci<? (string-ci-comparator string<?)) (define string-ci>? (string-ci-comparator string>?)) (define string-ci<=? (string-ci-comparator string<=?)) (define string-ci>=? (string-ci-comparator string>=?))
Examples:
(string<? "z" "ß") => #t (string<? "z" "zz") => #t (string<? "z" "Z") => #f (string=? "Straße" "Strasse") => #f (string-upcase "Hi") => "HI" (string-downcase "Hi") => "hi" (string-foldcase "Hi") => "hi" (string-upcase "Straße") => "STRASSE" (string-downcase "Straße") => "straße" (string-foldcase "Straße") => "strasse" (string-downcase "STRASSE") => "strasse" (string-upcase "ΧΑΟΣ") => "ΧΑΟΣ" (string-downcase "Σ") => "σ" (string-downcase "ΧΑΟΣ") => "χαος" (string-downcase "ΧΑΟΣΣ") => "χαοσς" (string-downcase "ΧΑΟΣ Σ") => "χαος σ" (string-foldcase "ΧΑΟΣΣ") => "χαοσσ" (string-upcase "χαος") => "ΧΑΟΣ" (string-upcase "χαοσ") => "ΧΑΟΣ" (string-titlecase "kNock KNoCK") =gt; "Knock Knock" (string-titlecase "who's there?") =gt; "Who's There?" (string-titlecase "r6rs") =gt; "R6Rs" (string-titlecase "R6RS") =gt; "R6Rs" (string-ci<? "z" "Z") => #f (string-ci=? "z" "Z") => #t (string-ci=? "Straße" "Strasse") => #t (string-ci=? "Straße" "STRASSE") => #t (string-ci=? "ΧΑΟΣ" "χαοσ") => #t
MzScheme version 299.108 and up implements the character and string operations described in this SRFI. A portable implementation using R5RS Scheme (to implement a distinct character type) will accompany a future draft of this SRFI.
The file r6rs-reader.ss implements a
MzScheme-specific reader for the character, string, and symbol syntax
described in this SRFI. This reader uses readtable support from
MzScheme 299.103 and up. It does not fully restrict unquoted symbols
to the syntax of R5RS; for example, ->
is parsed as a
symbol. The reference reader does, however, disable MzScheme's
backslash escape for symbols (making an unquoted backslash
illegal).
The file r6rs-reader-test.ss
contains a test suite.
This SRFI was written in consultation with the full set of R6RS editors: Will Clinger, Kent Dybvig, Marc Feeley, Matthew Flatt, Manuel Serrano, Michael Sperber, and Anton van Straaten. In addition, Ray Dillinger supplied part of the text, and many other helpful suggestions from the SRFI-75 mailing have been incoporated.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.