This SRFI is currently in withdrawn status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-52@nospamsrfi.schemers.org
. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.
This SRFI describes how to modify the Revised Report (R5RS) in order to enable conforming implementations to use an extended character set such as (but not limited to) Unicode.
Changes to some requirements of the report are recommended. Currently, the Revised Report contains requirements which are difficult or impossible to satisfy with some extended character sets.
New required procedures are proposed, specified, and included in the reference implementation. These procedures enable portable Scheme programs to manipulate Scheme source texts and source data accurately, even in implementations using extended character sets.
This SRFI concludes with some suggestions for implementors interested in providing good Unicode support, using these suggestions to illustrate how the proposed changes to the Revised Report can "play out" in Unicode-based Scheme.
This SRFI does not attempt to provide a comprehensive library for global text processing. For example, one issue in global text processing is the need for linguistically-sensitive, locale-sensitive procedures for sorting strings. Such procedures are beyond the scope of this SRFI. On the other hand, by making Scheme compatible with extended character sets, this SRFI is a step in the direction of permitting global text processing standard libraries to be developed in a form portable across all conforming implementations.
This SRFI does not propose that implementations be required
to support Unicode or any other extended character set. It does not
specify a representation for Unicode characters or strings. It
does revise the specifications of the report so that
char?
values may be Unicode (or other) characters.
The reference implementation included should prove to be easily ported to and effective for all ASCII-only implementations and for many implementations using an 8-bit character set which is an extension of ASCII (it will require very minor modifications for each particular implementation). Other implementations may need to use a different implementation.
The current edition of the Revised Report effectively defines a portable character set for Scheme. Portable programs should be expressed using only these characters in their source text, character constants, and string constants:
alphabetic letters: a..z A..Z digits: 0..9 punctuation: ( ) # ' ` , @ . " ; $ % & * / : + - ^ _ ~ \ < = > ? whitespace: newline space
In what follows, we will often be considering what happens if a particular implementation permits additional characters. Most importantly, are we able to write portable Scheme programs that behave reasonably even when running on an implementation that uses an extended character set?
char?
, string?
, and
symbol?
The Revised Report imposes some structural requirements on
the char?
, string?
, and symbol?
types, relating these to the syntax of Scheme data representations.
For example, it contains requirements about case-mappings of alphabetic
characters. The primary importance of the structural requirements in
the context of the report is that they enable portable, "metacircular"
Scheme programs. For example, the structural requirements make it
possible to write a portable Scheme program which can accurately
implement a version of the procedure read
which is able
to read data written using only the portable character set.
There are problems with the structural requirements.
The case-mapping problem: The Revised Report's
requirements for case mapping can not be satisfied for some extended
character sets. For example, the requirements state that for a
char-alphabetic?
character the procedure
char-upcase
must return an uppercase character.
Yet for a character set containing the alphabetic character eszett (or
"lowercase sharp S"), that requirement can not necessarily be
satisfied in a pleasing way (if at all).
The portable reader problem: The existing character
class predicates such as char-alphabetic?
are
insufficient for recognizing which extended characters may be part of
an identifier.
The identifier equality and canonicalization problem:
The case mapping and string comparison functions provided by the
Revised Report are insufficient for computing whether two
identifier names differ only by case distinctions. They are not
suitable for converting an identifier name into the name of the symbol
that would yield if read by the native read
procedure of
an implementation using an extended character set.
The identifier concatenation problem: The Revised
Report provides only string-append
for deriving a new
identifier name by concatenating two more existing identifier names.
Unfortunately, string-append
is an inappropriate
operation for concatenating identifiers which may use an extended
character set (as, for example, when sandhi rules apply).
The character and string constant problem The Revised Report provides a syntax for character and string constants however, it does not specify how that syntax should be extended for larger character sets and does not provide sufficient mechanism for a program to convert a character or string source form to an internal representation if the source form contains or refers to extended characters.
Several of those problems (portable reader, identifier equality and canonicalization, identifier concatenation, and string and character constants) could be grouped into a larger, more general category: the metacircularity problem. This SRFI is based in part on the presumption that one should be able to write a portable Scheme program which can accurately read and manipulate source texts in any implementation, even if those source texts contain characters specific to that implementation.
The specification is divided into two parts.
The first part, Changes to the Revised Report, describes how the report should be modified to permit extended character sets.
The second part, New Procedures, specifies the new procedures defined by this report and included in the reference implementation. Because these procedures can not be implemented in a way that is portable to all systems using extended character sets, and because they are essential for solving the metacircularity problem, the author recommends that these procedures be included in future editions of the Revised Report as required procedures.
Rather than:
Upper and lower case forms of a letter are never distinguished except within character and string constants. For example, Foo is the same identifier as FOO, and #x1AB is the same number as #X1ab.say:
Case distinctions are not significant except within character and string constants. For example, Foo is the same identifier as FOO, and #x1AB is the same number as #X1ab.
Rationale: The corrected text is consistent with what was apparently intended however it is more appropriate for extended character sets because in some systems using extended character sets, ignoring distinctions between upper and lower forms of the letters in a string is not the same thing as ignoring case distinctions in the string.
The specification of symbol->string
says:
Returns the name of symbol as a string. If the symbol was part of an object returned as the value of a literal expression (section 4.1.2) or by a call to theread
procedure, and its name contains alphabetic characters, then the string returned will contain characters in the implementation's preferred standard case -- some implementations will prefer upper case, others lower case. If the symbol was returned bystring->symbol
, the case of characters in the string returned will be the same as the case in the string that was passed tostring->symbol
. It is an error to apply mutation procedures likestring-set!
to strings returned by this procedure.
It should say:
Returns the name of symbol as a string. If the symbol was part of an object returned as the value of a literal expression (section 4.1.2) or by a call to theread
procedure, its name will be in the implementation's preferred standard case -- some implementations will prefer upper case, others lower case. If the symbol was returned bystring->symbol
, the string returned will bestring=?
to the string that was passed tostring->symbol
. It is an error to apply mutation procedures likestring-set!
to strings returned by this procedure.
Rationale: (see previous).
The specification of character syntax says:
[...] IfIt should say:<character>
in#\<character>
is alphabetic, then the character following <character> must be a delimiter character such as a space or parenthesis.
[...] If<character>
in#\<character>
is not one of the characters:digits: 0..9 punctuation: ( ) # ' ` , @ . " ; $ % & * / : + - ^ _ ~ \ < = > ? whitespace: newline spacethen the character following <character> must be a delimiter character such as a space or parenthesis.
Rationale: For the portable character set, this change to the Revised Report makes no difference -- the meaning is the same. However, this change makes it possible for a character name to consist of more than one ideographic (hence non-alphabetic) character without creating an ambiguous syntax.
The specification of char<?
and related procedures
says:
- The upper case characters are in order. For example,
(char<? #\A #\B)
returns#t
.- The lower case characters are in order. For example,
(char<? #\a #\b)
returns#t
.- The digits are in order. For example,
(char<? #\0 #\9)
returns#t
.- Either all the digits precede all the upper case letters, or vice versa.
- Either all the digits precede all the lower case letters, or vice versa.
It should say:
- The upper case characters
A..Z
are in order. For example,(char<? #\A #\B)
returns#t
. However, implementations may provide additional upper case letters which are not in order.- The lower case characters
a..z
are in order. For example,(char<? #\a #\b)
returns#t
. However, implementations may provide additional lower case letters which are not in order.- The digits
0..9
are in order. For example,(char<? #\0 #\9)
returns#t
. However, implementations may provide additional digits which are not in order.- Either all the digits
0..9
precede the upper case lettersA..Z
, or vice versa.- Either all the digits
0..9
precede the lower case lettersa..z
, or vice versa.
Rationale: The changes permit implementations to use the
"natural" ordering of an extended character set so long as that order
is consistent with the order required for the small set portable
characters. For example, a Unicode implementation might order
characters by their assigned codepoint values -- but that would result
in (extended character set) upper case letters that follow
a..z
while A..Z
precede
a..z
.
With regard to character class predicates such as
char-alphabetic?
the Revised Report says:
These procedures returnIt should instead say:#t
if their arguments are alphabetic, numeric, whitespace, upper case, or lower case characters, respectively, otherwise they return#f
. The following remarks, which are specific to the ASCII character set, are intended only as a guide: The alphabetic characters are the 52 upper and lower case letters. The numeric characters are the ten decimal digits. The whitespace characters are space, tab, line feed, form feed, and carriage return.
These procedures return#t
if their arguments are alphabetic, numeric, whitespace, upper case, or lower case characters, respectively, otherwise they return#f
. The charactersa..z
andA..Z
must be alphabetic. The digits0..9
must be numeric. Space and newline must be whitespace.The procedure
read
, the syntax accepted by a particular implementation, and the procedurechar-whitespace?
must all agree about whitespace characters. For example, if a character causeschar-whitespace?
to return#t
, then that character must serve as a delimiter.
Rationale: Most of the guidance formerly provided
regarding ASCII should really apply to the portable character set.
This enables portable Scheme programs to use these procedures in a
parser for Scheme data that consists only of portable characters.
The new requirement for char-whitespace?
allows a
portable Scheme program to recognize the same set of whitespace
delimiters as its host implementation.
With regard to case-mapping, the specification of
char-upcase
and char-upcase
says:
These procedures return a characterchar2
such that(char-ci=? char char2)
. In addition, ifchar
is alphabetic, then the result ofchar-upcase
is upper case and the result ofchar-downcase
is lower case.
It should say
These procedures return a characterchar2
such that(char-ci=? char char2)
. In addition,char-upcase
must mapa..z
toA..Z
andchar-downcase
must mapA..Z
toa..z
.
Rationale: In some extended character sets, not all lowercase alphabetic characters have a corresponding uppercase character and not all uppercase alphabetic characters have a corresponding lowercase letter. This change recognizes that while preserving the required behavior of these procedures for the portable character set.
Some of the procedures that operate on strings ignore the difference between upper and lower case. The versions that ignore case have ``-ci'' (for ``case insensitive'') embedded in their names.
It should say:
Some of the procedures that operate on strings ignore the difference between strings in which upper and lower case variants of the same character occur in corresponding positions. The versions that ignore case have ``-ci'' (for ``case insensitive'') embedded in their names.
Rationale: The string ordering predicates, in general,
are based on a lexical ordering induced by the constituent characters
and their order of appearance within the strings. At the same time,
"case insensitive string comparison" has a different meaning
linguistically -- a character-based lexical ordering is not
appropriate. This change simply makes it clear that the simple
character-wise lexical ordering is the one intended. For example,
this change emphasizes that string-ci=?
may be portably
and correctly implemented in terms of char-ci=?
.
This SRFI proposes the addition of a new section to the Revised Report, 6.6.5 Parsing Scheme Data, requiring the functions specified below.
procedure:(string->character string)
If the string formed by(string-append "#\\" string)
, if suitably delimited, would be read byread
as a character constant, then return the character it denotes. Otherwise, return#f
.
procedure:(string->string string)
If the string formed by(string-append "\"" string "\"")
would be read byread
as a string constant, then return a string which isstring=?
to the string it denotes. Otherwise, return#f
.
procedure:(string->symbol-name string)
Ifstring
would, if suitably delimited, be read byread
as an identifier, then return a string which isstring=?
to what would be returned bysymbol->string
for that symbol. Otherwise, return#f
.
procedure:(form-identifier string1 ...)
The arguments must be valid identifiers (see below).This procedure should return a valid identifier (conceptually) formed by concatenating the arguments, then making any adjustments necessary to form a valid identifier.
form-identifier
must preserve the following invariant for all arguments for which it is defined:(string=? (apply form-identifier (map string->symbol-name s1 ...)) (string->symbol-name (form-identifier s1 ...))) => #tA valid identifier for these purposes is any string for which
string->symbol-name
would not return false.
procedure:(char-delimiter? char)
Return#t
ifread
would treatchar
as a delimiter,#f
otherwise.
New Procedures Rationale: These procedures enable programs to parse Scheme data that may use an extended character set. Absent these or equivalent procedures, portable programs can only parse Scheme data written only using the portable character set.
Let us suppose that one wanted to make a Scheme implementation with two properties:
Standard Scheme: The implementation should meet the requirements of the latest edition of the Revised Report.
Global Scheme: The implementation should allow users to write character constants, string constants, and identifier names in their native language. For example, German speaking users should be free to use eszett in their identifier names and Chinese speaking users should be free to use identifier names composed of ideographs.
How might we do accomplish this? Let's assume that the changes to the Revised Report recommended above have been made. This sketch isn't the only way to do it --- just a reasonable way.
Characters as Unicode codepoints One natural approach to
take is to make each Unicode codepoint representable as a Scheme
char?
value. Absent the changes to the Revised
Report we could not easily do this -- for example, R5RS
requirements for the character ordering and case-mapping procedures
would be difficult to satisfy. With the proposed changes, there is no
problem.
Unicode Best Practices for Identifier Equivalence Some of the author's of the Unicode standard and related technical reports have thought very hard about how decide identifier equivalence in programming languages which ignore distinctions of case but allow people to write identifier names in their native languages. (For example, see Annex 7 ("Programming Language Identifiers") of Unicode Technical Report 15 ("Unicode Normalization Forms").
We can adopt those best practices for our Scheme fairly directly.
Most especially, the new procedure string->string
reifies
our concept of identifier equality in a form that will allow portable
Scheme programs to access the identifier equality relation used by our
implementation.
Character Constants and String Constants We can choose
whatever syntax we like for our extended character set Scheme, just so
long as it is consistent with the requirements for delimiters.
Portable programs can use procedures such as string->char
to access our extended namespace of characters.
Ambitious implementations using extended character sets may need to use a different implementation entirely.
;;; SRFI-?? reference implementation ;;; WARNING UNTESTED CODE (define (string->character s) (cond ((= 1 (string-length s)) (string-ref s 0)) ((string-ci=? "newline" s) #\newline) ((string-ci=? "space" s) #\space) (#t #f))) (define (string->string s) (let loop ((todo (string->list s)) (rev-output '())) (cond ((null? todo) (apply string (reverse rev-output))) ((not (char=? #\\ (car todo))) (loop (cdr todo) (cons (car todo) rev-output))) ((or (null? (cdr todo)) (not (char=? #\\ (cadr todo)))) (error "ill formed string")) (#t (loop (cddr todo) (cons #\\ rev-output)))))) (define (string->symbol-name s) (if (char=? #\a (string-ref (symbol-name 'A) 0)) (apply string (map char-downcase (string->list s))) (apply string (map char-upcase (string->list s))))) (define (form-identifier . strings) (apply string-append strings)) (define (char-delimiter? c) (or (char-whitespace? c) (not (not (memv c '(#\( #\) #\" #\;))))))
Copyright (C) Thomas Lord (2004). All Rights Reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.