This SRFI is currently in ``draft'' status. To see an explanation of each status that a SRFI can hold, see here. It will remain in draft status until 2005/09/08, or as amended. To provide input on this SRFI, pleaseThis SRFI is being submitted by members of the Scheme Language Editor's Committee as part of the R6RS Scheme standardization process. The purpose of such ``R6RS SRFIs'' is to inform the Scheme community of features and design ideas under consideration by the editors and to allow the community to give the editors some direct feedback that will be considered during the design process.
At the end of the discussion period, this SRFI will be withdrawn. When the R6RS specification is finalized, the SRFI may be revised to conform to the R6RS specification and then resubmitted with the intent to finalize it. This procedure aims to avoid the situation where this SRFI is inconsistent with R6RS. An inconsistency between R6RS and this SRFI could confuse some users. Moreover it could pose implementation problems for R6RS compliant Scheme systems that aim to support this SRFI. Note that departures from the SRFI specification by the Scheme Language Editor's Committee may occur due to other design constraints, such as design consistency with other features that are not under discussion as SRFIs.
mailto:srfi-75@srfi.schemers.org
.
See instructions
here to subscribe to the list. You can access previous messages via
the
archive of the mailing list.
Unicode is a widespread universal character code that supports most of the world's (natural) languages. The extensions to Scheme specified in this SRFI concern the support of Unicode in Scheme's character, string and symbol datatypes. This SRFI does not (fully) specify how I/O of Unicode data is performed or how Scheme source code is encoded in files. These aspects are left for other SRFIs to specify.
->
. Finally, it may
not be necessary to place symbols in one-to-one correspondence
(which motivates the new explicitly-quoted syntax for symbol
literals).
\0
), and the
variable-width encoding is potentially confusing.
#\newline
and
#\linefeed
as character constants. The former is
compatible with R5RS, but the latter is arguably preferable.
Maybe we should pick one.
(string-append #<<ONE #<<TWO) a b c d e f ONE g h i j k l TWO => "a b c\nd e fg h i\nj k l"
\<newline><intraline-whitespace>
may not be necessary.
Meanwhile, unescaped newlines perhaps should be prohibited in strings.
The manipulation of text is a fundamental information processing task. Increasingly it is important for software to process text in a variety of natural languages, possibly multiple languages in the same document. The Unicode standard specifies how the textual data of most of the world's languages is represented and handled. Several operating systems, programming languages, libraries, and software tools have now embraced the Unicode standard. Adding Unicode support to Scheme, as specified by this SRFI, will allow
Unicode defines a standard mapping between sequences of code points (integers in the range 0 to #x10FFFF in the latest version of the standard) and human-readable ``characters.'' More precisely, Unicode distinguishes between glyphs, which are printed for humans to read, and characters, which are abstract entities that map to glyphs (sometimes in a way that's sensitive to surrounding characters). Furthermore, different sequences of code points sometimes correspond to the same character. The relationships among code points, characters, and glyphs are subtle and complex.
Despite this complexity, most things that a literate human would call a ``character'' can be represented by a single code point in Unicode (though there may exist code-point sequences that represent that same character). For example, Roman letters, Cyrillic letters, Hebrew consonants, and most Chinese characters fall into this category. Thus, the ``code point'' approximation of ``character'' works well for many purposes. It is thus appropriate to define Scheme characters as Unicode scalar values, which includes all code points except those designated as surrogates. A surrogate is a code point in the range #xd800 to #xdfff that is used in pairs in the UTF-16 encoding to encode a supplementary character (whose code is in the range #x10000 to #x10ffff).
Many programming languages use a lexical syntax for strings that is similar to the one used by the C language. In particular, Java has extended C's notation for Unicode. Adopting a similar syntax for Scheme has the advantage of making it easier to learn and remember, particularly by programmers accustomed to other languages.
R5RS specifies that the escape sequences \\
and
\"
can be used in string literals to denote the backslash
and doublequote characters respectively. This SRFI introduces new
escape sequences so that any Scheme string can be expressed using the
ASCII subset of Unicode. Also, most C string literals have the same
meaning as a Scheme string literal.
This SRFI also extends the lexical syntax of symbols, and it puts symbols in one-to-one correspondence with immutable strings. In the revised lexical syntax, most Unicode characters can be used directly as symbol characters. Furthermore, an explicitly quoted from for symbols supports an arbitrary sequence of characters in a symbol literal.
Besides printing and reading characters, humans also compare character strings, and humans perform operations such as changing characters to uppercase. To make programs geographically portable, humans must agree to compare or upcase characters consistently, at least in certain contexts. The Unicode standard provides such standard case mappings on scalar values.
In other contexts, global agreement is unnecessary, and the user's culture should determine a string operation, such as when sorting a list of file names, perhaps case-insensitively. A locale captures information about a user's culture-specific interpretation of character sequences. In particular, a locale determines how strings are sorted, how a lowercase character is converted to an uppercase character, and how strings are compared without regard to case.
String operations such as string-ci=?
are not
sensitive to the current locale, because they should be portable. A
future SRFI might define operations like
string-locale-ci=?
to produce results that are consistent
with the current locale as determined by an implementation.
The Scheme character type corresponds to the set of Unicode scalar values. Specifically, each character corresponds to a number in the range [0, #xd7ff] union [#xe000, #x10ffff], and properties of the character are as defined for the corresponding Unicode scalar value.
The integer->char
procedure takes a Unicode scalar
value as an exact integer, it and produces the corresponding
character. The char->integer
procedure takes a
character and produces the corresponding scalar value. It is an error
to call integer->char
with an integer that is not in
the range [0, #xd7ff] union [#xe000, #x10ffff].
Examples:
(integer->char 32) => #\space (char->integer (integer->char 5000)) => 5000 (integer->char #xd800) => *error*
Like in R5RS, a Scheme string is a fixed-length array of Scheme
characters. The procedure call (string-ref str
i)
returns the Scheme character at index
i
in the string str
. The
procedure call (string-set!
str i char)
stores the Scheme character
char
at index i
in the string
str
, and an unspecified value is returned.
A symbol is defined by a sequence of characters. The
string->symbol
procedure works on any string, and
symbol->string
works on any symbol.
R5RS specifies two lexical syntaxes for characters: named characters,
e.g. #\space
, and plain characters, e.g. #\[
.
For consistency with the escape sequences of strings and symbols, the
following set of named characters is defined by this SRFI:
#\nul
: Unicode 0
#\alarm
: Unicode 7
#\backspace
: Unicode 8
#\tab
: Unicode 9
#\linefeed
: Unicode 10
#\newline
: Unicode 10 (as in R5RS)
#\vtab
: Unicode 11
#\page
: Unicode 12
#\return
: Unicode 13
#\esc
: Unicode 27
#\space
: Unicode 32 (as in R5RS)
#\rubout
: Unicode 127
#\linefeed
and #\newline
.
To allow denoting any Unicode character using the ASCII subset of Unicode and for consistency with the string escape sequences, this SRFI specifies these escape character syntaxes:
#\x<x><x>
: where <x>
is a hexadecimal digit and the sequence of two
<x>
s forms a hexadecimal number between 0 and #xff
#\u<x><x><x><x>
: where <x>
is a hexadecimal digit and the
sequence of four <x>
s forms a hexadecimal number between 0 and #xffff excluding the range [#xd800, #xdfff]
#\U<x><x><x><x><x><x><x><x>
: where <x>
is
a hexadecimal digit and the sequence of eight <x>
s forms a hexadecimal number between 0 and #x10ffff
excluding the range [#xd800, #xdfff]; the range restriction implies that the first two <x>
s
are 0
All character syntaxes are case-sensitive, except that
<x>
can be an uppercase or lowercase hexadecimal
digit. Unlike R5RS, every character datum must be followed by a
delimiter.
Examples:
#\xFF ; Unicode 255 #\u03BB ; Unicode 955 #\U00006587 ; Unicode 25991 #\λ ; Unicode 955 #\u006587 ; parse error #\λx ; parse error #\alarmx ; parse error #\alarm x ; Unicode 7 followed by x #\Alarm ; parse error #\alert ; parse error #\xFF ; Unicode 255 #\xff ; Unicode 255 #\x ff ; Unicode 120 followed by ff #\x(ff) ; Unicode 120 followed by a parenthesized ff #\(x) ; parse error #\((x) ; Unicode 40 followed by a parenthesized x
\a
: alarm, Unicode 7
\b
: backspace, Unicode 8
\t
: tab, Unicode 9
\n
: linefeed, Unicode 10
\v
: vertical tab, Unicode 11
\f
: formfeed, Unicode 12
\r
: return, Unicode 13
\"
: doublequote, Unicode 34
\'
: quote, Unicode 39
\?
: question mark, Unicode 63
\\
: backslash, Unicode 92
\|
: vertical bar, Unicode 124
\<newline><intraline-whitespace>
: nothing,
where <newline>
is Unicode 10,
and <intraline-whitespace>
is a sequence of non-newline whitescape characters
(where whitespace is defined in SRFI-14)
\<space>
: space, Unicode 32, where <space>
is the character Unicode 32 (useful for
terminating the previous escape sequence before continuing with whitespace)
\x<x><x>
: where <x>
is a hexadecimal digit and the sequence of two <x>
s forms a hexadecimal number between 0 and #xff
\u<x><x><x><x>
: where <x>
is a hexadecimal digit and the sequence of four <x>
s forms a hexadecimal number between 0 and #xffff excluding the range [#xd800, #xdfff]
\U<x><x><x><x><x><x><x><x>
: where <x>
is a hexadecimal digit
and the sequence of eight <x>
s forms a hexadecimal number between 0 and #x10ffff excluding the range
[#xd800, #xdfff]; the range restriction implies that the first two <x>
s
are 0
<x>
can be an uppercase or lowercase hexadecimal
digit.
Any other character in a string after a backslash is an error. Any
character outside of an escape sequence and not a doublequote stands
for itself in the string literal. For example the single-character string
"λ"
(double quote, a lowercase lambda,
double quote) denotes the same string literal as
"\u03bb"
.
Note that the meaning of
\u<x><x><x><x>
is slightly
different in Java, because Java handles this escape sequence in the
escape-processing phase, which can be viewed as preprocessing that
precedes lexical analysis.
Here strings are a new syntax for denoting string literals
that eliminates the quoting problem inherent in the traditional string
syntax. The sequence #<<
marks the start of a here
string. The part of the line after the #<<
up to
and including the newline character (Unicode 10) is the key.
The first line afterward that matches the key marks the end of the
here string. The string contains all the characters between the start
key and the end key, with the exception of the newline before the end
key. The end key can be terminated by an end-of-file instead of a
newline.
Example:
#<<THE-END printf("hello\n"); THE-END => "printf(\"hello\\n\");" #<<THE-END printf("hello\n"); THE-END => "printf(\"hello\\n\");\n"
Here strings are particularly useful to include verbatim the text of some other document, that has its own syntax that might conflict with the traditional Scheme string syntax, for example a source code fragment or a TeX document.
The here string syntax of this SRFI is based on the line-delimited here strings of Scsh (see Scsh Reference Manual), which in turn resemble ``here documents'' in shells such as bash and csh.
The syntax of symbols extends R5RS in three ways.:
<letter>
, this SRFI allows
any character whose scalar value is greater than 127 and that is
not considered whitescape according to SRFI-14.
Examples:
'Hello => Hello 'λ => λ '|Hello| => Hello (symbol->string '|a "b\" \|c\| \n|) => "a \"b\" |c| \n"
Character-comparison procedures are defined as follows:
;; char-comparator itself is not part of this SRFI; it is ;; used only to define other procedures (define (char-comparator num-comp) (lambda (a-char b-char) (num-comp (char->integer a-char) (char->integer b-char)))) (define char=? (char-comparator =)) (define char<? (char-comparator <)) (define char>? (char-comparator >)) (define char<=? (char-comparator <=)) (define char>=? (char-comparator >=))
Unicode defines locale-independent mappings from scalar values to scalar values for upcase, downcase, and titlecase operations. (These mappings can be extracted from UnicodeData.txt from the Unicode Consortium.) The following Scheme procedures map characters consistent with the Unicode specification:
char-upcase
char-downcase
char-titlecase
Case-insensitive character-comparison procedures are defined as follows:
;; char-ci-comparator itself is not part of this SRFI; it is ;; used only to define other procedures (define (char-ci-comparator cs-comp) (lambda (a-char b-char) (cs-comp (char-downcase a-char) (char-downcase b-char)))) (define char-ci=? (char-ci-comparator char=?)) (define char-ci<? (char-ci-comparator char<?)) (define char-ci>? (char-ci-comparator char>?)) (define char-ci<=? (char-ci-comparator char<=?)) (define char-ci>=? (char-ci-comparator char>=?))Programmers should recognize that these procedures may not produce results that an end-user would consider sensible with a particular locale. This SRFI defines no locale-sensitive comparisons for characters.
The following predicates are as defined by SRFI-14:
char-alphabetic?
char-lower-case?
char-upper-case?
char-title-case?
char-numeric?
char-symbolic?
char-punctuation?
char-graphic?
char-whitespace?
char-blank?
char-iso-control?
The following string procedures are defined as in R5RS, which means that they are defined by pointwise operation on the string's characters:
string<?
string>?
string=?
string<=?
string>=?
string-ci<?
string-ci>?
string-ci=?
string-ci<=?
string-ci>=?
MzScheme version 299.100 and up implements the character and string operations described in this SRFI. A portable implementation using R5RS Scheme (to implement a distinct character type) will accompany the next draft of this SRFI.
The file r6rs-reader.ss implements a
MzScheme-specific reader for the character, string, and symbol syntax
described in this SRFI. This reader uses readtable support from
MzScheme 299.103 and up. It does not fully restrict unquoted symbols
to the syntax of R5RS; for example, ->
is parsed as a
symbol. The reference reader does, however, disable MzScheme's
backslash escape for symbols (making an unquoted backslash
illegal).
The file r6rs-reader-test.ss
contains a test suite.
This SRFI was written in consultation with the full set of R6RS editors: Will Clinger, Kent Dybvig, Marc Feeley, Matthew Flatt, Manuel Serrano, Michael Sperber, and Anton van Straaten.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.