[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

(Hopefully) final changes to SRFI-14 (character sets)

This page is part of the web mail archives of SRFI 14 from before July 7th, 2015. The new archives for SRFI 14 contain all messages, not just those from before July 7th, 2015.

As I prepare to conclude work on the SRFI-13 string library, I have reworked
the SRFI-14 character-set spec, principally to get it synced up with the
Unicode world. Mike S will presumably have the new draft available at
(It is also available at

A summary of the changes appears below. I have no further changes I wish
to make to this library. If review does not reveal any problems, we can
put this to bed.

- Added a function for hashing character sets.

- Uniformly extended the char-set constructor procedures to take an optional
  BASE-CS argument; in this case, the procedure adds the requested characters
  to the characters already in BASE-CS. This allows convenient incremental
  construction of heterogeneous character sets, e.g.
      (predicate->char-set vowel?
        (list->char-set '(#\+ #\-)
          (string->char-set "13579")))
  or, more efficiently
      (predicate->char-set! vowel?
        (list->char-set! '(#\+ #\-)
          (string->char-set "13579")))

- I removed the seventeen predicates
    char-lower-case?	char-upper-case?	char-title-case?
    char-letter?	char-digit?		char-letter+digit?
    char-graphic?	char-printing?		char-whitespace?
    char-iso-control?	char-punctuation?	char-symbol?
    char-hex-digit?	char-blank?		char-ascii?
    char-empty?		char-full?
  They belong in a *character* library, not a char-set library.

- I have made pervasive changes to the SRFI to bring it into alignment with
  Unicode concepts:

  - Changed the name ASCII-RANGE->CHAR-SET to the more modern 
    UCS-RANGE->CHAR-SET, and provided a full specification in terms
    of UCS/Unicode.

  - Changed "alphabetic" and "numeric" to Unicode terms "letter" and "digit."

  - Split "symbols" out from "punctuation" characters, in conformance with 

  - Renamed CHAR-SET:CONTROL to CHAR-SET:ISO-CONTROL, to make clear that
    weirdo Unicode control codes are excluded. (This is in alignment with 


  - Specified what the standard character sets are in Unicode, Latin-1
    and ASCII implementations. These definitions are almost completely
    compatible with Java's. (The only real incompatibility is the definition
    of whitespace.) The ASCII/Latin-1/Unicode specs are compatible, so
    that code written using these sets has a good chance of being portable
    across implementations with different underlying character representations.

    Being compatible with Java is occasionally challenging, as the Java
    definitions are not internally consistent. There is discussion of the
    specifics where relevant.