Title

Permitting and Supporting Extended Character Sets

Author

Thomas Lord

Status

This SRFI is currently in withdrawn status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-52@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Received: 2004-01-25
Draft: 2004-02-09--2004-05-09
Withdrawn: 2004-06-17

Abstract

This SRFI describes how to modify the Revised Report (R5RS) in order to enable conforming implementations to use an extended character set such as (but not limited to) Unicode.

Changes to some requirements of the report are recommended. Currently, the Revised Report contains requirements which are difficult or impossible to satisfy with some extended character sets.

New required procedures are proposed, specified, and included in the reference implementation. These procedures enable portable Scheme programs to manipulate Scheme source texts and source data accurately, even in implementations using extended character sets.

This SRFI concludes with some suggestions for implementors interested in providing good Unicode support, using these suggestions to illustrate how the proposed changes to the Revised Report can "play out" in Unicode-based Scheme.

This SRFI does not attempt to provide a comprehensive library for global text processing. For example, one issue in global text processing is the need for linguistically-sensitive, locale-sensitive procedures for sorting strings. Such procedures are beyond the scope of this SRFI. On the other hand, by making Scheme compatible with extended character sets, this SRFI is a step in the direction of permitting global text processing standard libraries to be developed in a form portable across all conforming implementations.

This SRFI does not propose that implementations be required to support Unicode or any other extended character set. It does not specify a representation for Unicode characters or strings. It does revise the specifications of the report so that char? values may be Unicode (or other) characters.

The reference implementation included should prove to be easily ported to and effective for all ASCII-only implementations and for many implementations using an 8-bit character set which is an extension of ASCII (it will require very minor modifications for each particular implementation). Other implementations may need to use a different implementation.

Issues

The reference implementation is currently untested.

Rationale

The current edition of the Revised Report effectively defines a portable character set for Scheme. Portable programs should be expressed using only these characters in their source text, character constants, and string constants:

        alphabetic letters:    a..z  A..Z

        digits:                0..9

        punctuation:           ( ) # ' ` , @ . " 
                               ; $ % & * / : + -
                               ^ _ ~ \ < = > ?

        whitespace:            newline space

In what follows, we will often be considering what happens if a particular implementation permits additional characters. Most importantly, are we able to write portable Scheme programs that behave reasonably even when running on an implementation that uses an extended character set?

Problems With `char?`, `string?`, and `symbol?`

The Revised Report imposes some structural requirements on the char?, string?, and symbol? types, relating these to the syntax of Scheme data representations. For example, it contains requirements about case-mappings of alphabetic characters. The primary importance of the structural requirements in the context of the report is that they enable portable, "metacircular" Scheme programs. For example, the structural requirements make it possible to write a portable Scheme program which can accurately implement a version of the procedure read which is able to read data written using only the portable character set.

There are problems with the structural requirements.

The case-mapping problem: The Revised Report's requirements for case mapping can not be satisfied for some extended character sets. For example, the requirements state that for a char-alphabetic? character the procedure char-upcase must return an uppercase character. Yet for a character set containing the alphabetic character eszett (or "lowercase sharp S"), that requirement can not necessarily be satisfied in a pleasing way (if at all).

The portable reader problem: The existing character class predicates such as char-alphabetic? are insufficient for recognizing which extended characters may be part of an identifier.

The identifier equality and canonicalization problem: The case mapping and string comparison functions provided by the Revised Report are insufficient for computing whether two identifier names differ only by case distinctions. They are not suitable for converting an identifier name into the name of the symbol that would yield if read by the native read procedure of an implementation using an extended character set.

The identifier concatenation problem: The Revised Report provides only string-append for deriving a new identifier name by concatenating two more existing identifier names. Unfortunately, string-append is an inappropriate operation for concatenating identifiers which may use an extended character set (as, for example, when sandhi rules apply).

The character and string constant problem The Revised Report provides a syntax for character and string constants however, it does not specify how that syntax should be extended for larger character sets and does not provide sufficient mechanism for a program to convert a character or string source form to an internal representation if the source form contains or refers to extended characters.

Several of those problems (portable reader, identifier equality and canonicalization, identifier concatenation, and string and character constants) could be grouped into a larger, more general category: the metacircularity problem. This SRFI is based in part on the presumption that one should be able to write a portable Scheme program which can accurately read and manipulate source texts in any implementation, even if those source texts contain characters specific to that implementation.

Specification

The specification is divided into two parts.

The first part, Changes to the Revised Report, describes how the report should be modified to permit extended character sets.

The second part, New Procedures, specifies the new procedures defined by this report and included in the reference implementation. Because these procedures can not be implemented in a way that is portable to all systems using extended character sets, and because they are essential for solving the metacircularity problem, the author recommends that these procedures be included in future editions of the Revised Report as required procedures.

Changes to the Revised Report

Chapter 2, Introduction:

Rather than:

Upper and lower case forms of a letter are never distinguished except within character and string constants. For example, Foo is the same identifier as FOO, and #x1AB is the same number as #X1ab.

say:

Case distinctions are not significant except within character and string constants. For example, Foo is the same identifier as FOO, and #x1AB is the same number as #X1ab.

Rationale: The corrected text is consistent with what was apparently intended however it is more appropriate for extended character sets because in some systems using extended character sets, ignoring distinctions between upper and lower forms of the letters in a string is not the same thing as ignoring case distinctions in the string.

Section 6.3.3, Symbols:

The specification of symbol->string says:

Returns the name of symbol as a string. If the symbol was part of an object returned as the value of a literal expression (section 4.1.2) or by a call to the read procedure, and its name contains alphabetic characters, then the string returned will contain characters in the implementation's preferred standard case -- some implementations will prefer upper case, others lower case. If the symbol was returned by string->symbol, the case of characters in the string returned will be the same as the case in the string that was passed to string->symbol. It is an error to apply mutation procedures like string-set! to strings returned by this procedure.

It should say:

Returns the name of symbol as a string. If the symbol was part of an object returned as the value of a literal expression (section 4.1.2) or by a call to the read procedure, its name will be in the implementation's preferred standard case -- some implementations will prefer upper case, others lower case. If the symbol was returned by string->symbol, the string returned will be string=? to the string that was passed to string->symbol. It is an error to apply mutation procedures like string-set! to strings returned by this procedure.

Rationale: (see previous).

Section 6.3.4, Characters:

character constant syntax

The specification of character syntax says:

[...] If <character> in #\<character> is alphabetic, then the character following <character> must be a delimiter character such as a space or parenthesis.

It should say:

[...] If <character> in #\<character> is not one of the characters:
        digits:                0..9

        punctuation:           ( ) # ' ` , @ . " 
                               ; $ % & * / : + -
                               ^ _ ~ \ < = > ?

        whitespace:            newline space
then the character following <character> must be a delimiter character such as a space or parenthesis.

Rationale: For the portable character set, this change to the Revised Report makes no difference -- the meaning is the same. However, this change makes it possible for a character name to consist of more than one ideographic (hence non-alphabetic) character without creating an ambiguous syntax.

character order

The specification of char<? and related procedures says:

The upper case characters are in order. For example, (char<? #\A #\B) returns #t.
The lower case characters are in order. For example, (char<? #\a #\b) returns #t.
The digits are in order. For example, (char<? #\0 #\9) returns #t.
Either all the digits precede all the upper case letters, or vice versa.
Either all the digits precede all the lower case letters, or vice versa.

It should say:

The upper case characters A..Z are in order. For example, (char<? #\A #\B) returns #t. However, implementations may provide additional upper case letters which are not in order.
The lower case characters a..z are in order. For example, (char<? #\a #\b) returns #t. However, implementations may provide additional lower case letters which are not in order.
The digits 0..9 are in order. For example, (char<? #\0 #\9) returns #t. However, implementations may provide additional digits which are not in order.
Either all the digits 0..9 precede the upper case letters A..Z, or vice versa.
Either all the digits 0..9 precede the lower case letters a..z, or vice versa.

Rationale: The changes permit implementations to use the "natural" ordering of an extended character set so long as that order is consistent with the order required for the small set portable characters. For example, a Unicode implementation might order characters by their assigned codepoint values -- but that would result in (extended character set) upper case letters that follow a..z while A..Z precede a..z.

character classes

With regard to character class predicates such as char-alphabetic? the Revised Report says:

These procedures return #t if their arguments are alphabetic, numeric, whitespace, upper case, or lower case characters, respectively, otherwise they return #f. The following remarks, which are specific to the ASCII character set, are intended only as a guide: The alphabetic characters are the 52 upper and lower case letters. The numeric characters are the ten decimal digits. The whitespace characters are space, tab, line feed, form feed, and carriage return.

It should instead say:

These procedures return #t if their arguments are alphabetic, numeric, whitespace, upper case, or lower case characters, respectively, otherwise they return #f. The characters a..z and A..Z must be alphabetic. The digits 0..9 must be numeric. Space and newline must be whitespace.
The procedure read, the syntax accepted by a particular implementation, and the procedure char-whitespace? must all agree about whitespace characters. For example, if a character causes char-whitespace? to return #t, then that character must serve as a delimiter.

Rationale: Most of the guidance formerly provided regarding ASCII should really apply to the portable character set. This enables portable Scheme programs to use these procedures in a parser for Scheme data that consists only of portable characters. The new requirement for char-whitespace? allows a portable Scheme program to recognize the same set of whitespace delimiters as its host implementation.

character case-mapping

With regard to case-mapping, the specification of char-upcase and char-upcase says:

These procedures return a character char₂ such that (char-ci=? char char₂). In addition, if char is alphabetic, then the result of char-upcase is upper case and the result of char-downcase is lower case.

It should say

These procedures return a character char₂ such that (char-ci=? char char₂). In addition, char-upcase must map a..z to A..Z and char-downcase must map A..Z to a..z.

Rationale: In some extended character sets, not all lowercase alphabetic characters have a corresponding uppercase character and not all uppercase alphabetic characters have a corresponding lowercase letter. This change recognizes that while preserving the required behavior of these procedures for the portable character set.

Section 6.3.5, Strings:

The introduction to strings says:

Some of the procedures that operate on strings ignore the difference between upper and lower case. The versions that ignore case have ``-ci'' (for ``case insensitive'') embedded in their names.

It should say:

Some of the procedures that operate on strings ignore the difference between strings in which upper and lower case variants of the same character occur in corresponding positions. The versions that ignore case have ``-ci'' (for ``case insensitive'') embedded in their names.

Rationale: The string ordering predicates, in general, are based on a lexical ordering induced by the constituent characters and their order of appearance within the strings. At the same time, "case insensitive string comparison" has a different meaning linguistically -- a character-based lexical ordering is not appropriate. This change simply makes it clear that the simple character-wise lexical ordering is the one intended. For example, this change emphasizes that string-ci=? may be portably and correctly implemented in terms of char-ci=?.

New Procedures

This SRFI proposes the addition of a new section to the Revised Report, 6.6.5 Parsing Scheme Data, requiring the functions specified below.

procedure:(string->character string)

If the string formed by (string-append "#\\" string), if suitably delimited, would be read by read as a character constant, then return the character it denotes. Otherwise, return #f.

procedure:(string->string string)

If the string formed by (string-append "\"" string "\"") would be read by read as a string constant, then return a string which is string=? to the string it denotes. Otherwise, return #f.

procedure:(string->symbol-name string)

If string would, if suitably delimited, be read by read as an identifier, then return a string which is string=? to what would be returned by symbol->string for that symbol. Otherwise, return #f.

procedure:(form-identifier string₁ ...)

The arguments must be valid identifiers (see below).
This procedure should return a valid identifier (conceptually) formed by concatenating the arguments, then making any adjustments necessary to form a valid identifier.
form-identifier must preserve the following invariant for all arguments for which it is defined:
    (string=? (apply form-identifier (map string->symbol-name s₁ ...))
              (string->symbol-name (form-identifier s₁ ...)))
    => #t
A valid identifier for these purposes is any string for which string->symbol-name would not return false.

procedure:(char-delimiter? char)

Return #t if read would treat char as a delimiter, #f otherwise.

New Procedures Rationale: These procedures enable programs to parse Scheme data that may use an extended character set. Absent these or equivalent procedures, portable programs can only parse Scheme data written only using the portable character set.

Illustrating a Unicode-based Scheme

Let us suppose that one wanted to make a Scheme implementation with two properties:

Standard Scheme: The implementation should meet the requirements of the latest edition of the Revised Report.

Global Scheme: The implementation should allow users to write character constants, string constants, and identifier names in their native language. For example, German speaking users should be free to use eszett in their identifier names and Chinese speaking users should be free to use identifier names composed of ideographs.

How might we do accomplish this? Let's assume that the changes to the Revised Report recommended above have been made. This sketch isn't the only way to do it --- just a reasonable way.

Characters as Unicode codepoints One natural approach to take is to make each Unicode codepoint representable as a Scheme char? value. Absent the changes to the Revised Report we could not easily do this -- for example, R5RS requirements for the character ordering and case-mapping procedures would be difficult to satisfy. With the proposed changes, there is no problem.

Unicode Best Practices for Identifier Equivalence Some of the author's of the Unicode standard and related technical reports have thought very hard about how decide identifier equivalence in programming languages which ignore distinctions of case but allow people to write identifier names in their native languages. (For example, see Annex 7 ("Programming Language Identifiers") of Unicode Technical Report 15 ("Unicode Normalization Forms").

We can adopt those best practices for our Scheme fairly directly. Most especially, the new procedure string->string reifies our concept of identifier equality in a form that will allow portable Scheme programs to access the identifier equality relation used by our implementation.

Character Constants and String Constants We can choose whatever syntax we like for our extended character set Scheme, just so long as it is consistent with the requirements for delimiters. Portable programs can use procedures such as string->char to access our extended namespace of characters.

Implementation

The enclosed implementation is suitable for a hypothetical Scheme which supports only the portable character set. The odds are good that it will run correctly with only very minor modifications on most other implementations.

Ambitious implementations using extended character sets may need to use a different implementation entirely.


;;; SRFI-?? reference implementation

;;; WARNING UNTESTED CODE


(define (string->character s)
  (cond
    ((= 1 (string-length s))    (string-ref s 0))
    ((string-ci=? "newline" s)  #\newline)
    ((string-ci=? "space" s)    #\space)
    (#t                         #f)))


(define (string->string s)
  (let loop ((todo              (string->list s))
             (rev-output        '()))

     (cond
       ((null? todo)
        (apply string (reverse rev-output)))

       ((not (char=? #\\ (car todo)))   
         (loop (cdr todo) (cons (car todo) rev-output)))

       ((or (null? (cdr todo))
            (not (char=? #\\ (cadr todo))))
        (error "ill formed string"))

       (#t
        (loop (cddr todo) (cons #\\ rev-output))))))

(define (string->symbol-name s)
  (if (char=? #\a (string-ref (symbol-name 'A) 0))
      (apply string (map char-downcase (string->list s)))
      (apply string (map char-upcase (string->list s)))))

(define (form-identifier . strings)
  (apply string-append strings))

(define (char-delimiter? c)
  (or (char-whitespace? c)
      (not (not (memv c '(#\( #\) #\" #\;))))))

Copyright

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Author: Thomas Lord

Editor: Francisco Solsona

Last modified: Sun Jan 28 13:40:36 MET 2007

Title

Author

Status

Abstract

Issues

Rationale

Problems With char?, string?, and symbol?

Specification

Changes to the Revised Report

Chapter 2, Introduction:

Section 6.3.3, Symbols:

Section 6.3.4, Characters:

character constant syntax

character order

character classes

character case-mapping

Section 6.3.5, Strings:

New Procedures

Illustrating a Unicode-based Scheme

Implementation

Copyright

Problems With `char?`, `string?`, and `symbol?`