This document provides modern definitions of the standard SRFI 14 character sets in terms of Unicode properties. The definitions in the published SRFI were taken from Java 1.0, which means they reflect data and interpretations as of Unicode 2.0. The Unicode version at the time of writing (2019-11-13) is 12.1, so some corrections and expansions are in order.
Unicode publishes many properties of characters in the Unicode database online. This document recommends that Scheme implementers who wish to provide Unicode versions of SRFI 14 use these to write a program generating a current list of Unicode characters that fit into each of the standard sets. The program can then provide the result in some way that the implementation can use, either as S-expressions, as binary files, or in some other way. This program should be re-run at least every six months as new versions of Unicode are published. However, it is very rare to unheard-of for any character to be removed from any character set. Of course, as new characters are added to Unicode (no characters have been removed since Unicode 2.0), the standard character sets grow over time.
The Unicode files we need are found on the unicode.org web site. The UnicodeData.txt
file contains the General Category, a 2-letter code that groups all
Unicode characters into one of thirty classes. For example,
Lu
means "upper case letter" and Sm
means
"mathematical symbol". The PropList.txt
and DerivedCoreProperties.txt
files provide various properties of either single Unicode
characters or ranges of them. For example, the property
Deprecated
applies to the characters whose existence
is the result of a mistake or whose use is strongly discouraged.
Characters can and generally do, have more than one property.
UnicodeData.txt is a very straightforward file with multiple
fields separated by ;
characters. Field 1 is the
Unicode codepoint in hex: four, five, or six characters. Field 2 is
normally the official name of the character. Field 3 is the General
Category.
However, there is one special convention to make the file
shorter. If the content of field 2 begins with <
and ends with First>
, then it represents the first
codepoint in a range of characters that all have the same
properties and whose names are generated algorithmically. All such
lines are immediately followed by another special line beginning
with <
and ending with Last>
, which
specifies the last codepoint of the range.
For example, the consecutive lines
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;; 9FEF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
mean that all characters from U+4E00 to U+9FEF inclusive belong to category Lo.
The format of PropList.txt and DerivedCoreProperties.txt is more
complicated but more flexible. Comments beginning with
#
may appear on any line and go to the end of the
line; a line beginning with #
is a comment. All such
comments, as well as blank lines, should be completely ignored.
Spaces within lines should also be discarded.
After that, the files contain two fields separated by
;
. Field 1 is either a single hex codepoint or else
two hex codepoints separated by ..
designating the
first and last codepoints in the range. Field 2 is Unicode's
standard name for the property.
Each set is defined as the union of specified general
categories, properties, other character sets, single codepoints,
and ranges of codepoints (using the ..
notation of
PropList.txt).
The definitions are based on comments in DerivedCoreProperties.txt, chapter 2 of the Unicode Standard, and other places, with some help from the C/C++/Posix definitions. Unicode uses the term graphic characters to include whitespace, but here we follow Posix and call them printable characters, restricting the former term to exclude whitespace.
The notation L*
is not a specific category, but
represents the union of all categories beginning with
L
, namely Ll Lu Lo Lt Lm
, and similarly
for all other category codes.
char-set:lower-case = property Lowercase char-set:upper-case = property Uppercase char-set:title-case = category Lt char-set:letter = property Alphabetic char-set:digit = category Nd char-set:letter+digit = property Alphabetic + category Nd char-set:graphic = category L* + category N* + category M* category S* + category P* char-set:printing = char-set:graphic + char-set:whitespace char-set:whitespace = property White_Space char-set:iso-control = 0000..001F + 007F..009F char-set:punctuation = category P* char-set:symbol = category S* char-set:hex-digit = 0030..0039 + 0041..0046 + 0061..0066 char-set:blank = category Zs + 0009 char-set:ascii = 0000..007F