SRFI 175

Title

ASCII character library

Author

Lassi Kortela

Status

This SRFI is currently in final status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-175@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Received: 2019-09-15
Draft #1 published: 2019-09-18
Draft #2 published: 2019-09-19
Draft #3 published: 2019-09-22
Draft #4 published: 2019-09-27
Draft #5 published: 2019-11-30
Draft #6 published: 2019-12-09
Finalized: 2019-12-20

Abstract

This SRFI defines ASCII-only equivalents to many of the character procedures in standard Scheme plus a few extra ones. Recent Scheme standards are based around Unicode but the significant syntactic elements in many file formats and network protocols are all ASCII. Such low-level code can run faster and its behavior can be easier to understand when it uses ASCII primitives.

Rationale
Specification
Examples
Implementation
Acknowledgements

Rationale

Procedures dealing with character objects have been included in standard Scheme since R²RS (1985) with identical arguments and return values. The early Scheme reports did not mandate any particular character set, though in practice most (perhaps all) implementations used extended ASCII. R⁶RS (2007) was the first standard to strongly favor Unicode.

Unicode is a fine choice for high-level work, but is overkill for most low-level work dealing with file formats and network protocols. ASCII-only procedures are much simpler to implement and their behavior is much easier to understand than their Unicode equivalents. They have shorter code paths with fewer and simpler failure modes, and need no lookup tables.

Characters as integers

Scheme has a standard character data type which is very useful for disambiguating between characters and integers. However, code dealing with low-level binary formats typically uses byte ports and bytevectors whose elements are small, exact nonnegative integers. It is convenient to treat those integers as if they were characters (which they often represent, as most binary formats also contain strings of text). For this reason, the procedures in this SRFI taking character objects also accept integers in their place.

This SRFI has been designed with the assumption that codepoints 0..127 correspond to ASCII in the Scheme implementation's native character datatype. We could not come up with any implementations where this is not the case. The only non-ASCII-superset character set we could think of is EBCDIC, which is fringe enough that it does not seem worth worrying about it.

Procedure equivalence

The following table lists all procedures defined in this SRFI that have direct equivalents in the Scheme RⁿRS standards.

This SRFI	RⁿRS	Since
ascii-char?	char?	R²RS
ascii-string?	string?	R²RS
ascii-ci=?	char-ci=?	R²RS
ascii-ci<?	char-ci<?	R²RS
ascii-ci>?	char-ci>?	R²RS
ascii-ci<=?	char-ci<=?	R²RS
ascii-ci>=?	char-ci>=?	R²RS
ascii-string-ci=?	string-ci=?	R²RS
ascii-string-ci<?	string-ci<?	R²RS
ascii-string-ci>?	string-ci>?	R²RS
ascii-string-ci<=?	string-ci<=?	R²RS
ascii-string-ci>=?	string-ci>=?	R²RS
ascii-alphabetic?	char-alphabetic?	R²RS
ascii-numeric?	char-numeric?	R²RS
ascii-whitespace?	char-whitespace?	R²RS
ascii-upper-case?	char-upper-case?	R²RS
ascii-lower-case?	char-lower-case?	R²RS
ascii-upcase	char-upcase	R²RS
ascii-downcase	char-downcase	R²RS
ascii-digit-value	digit-value	R⁷RS^*

^*Note that the ascii-digit-value procedure takes a limit argument that the standard digit-value procedure does not take.

The standard Scheme character procedures listed above require their arguments to be character objects. The equivalents in this SRFI accept integers in addition to character objects. However, ascii-char?, like the standard char?, only tests for a character object.

Capsule history of ASCII

The ASCII (American Standard Code for Information Interchange) character set is standardized by ANSI (American National Standards Institute). The present ASCII standard was first published in 1967. The organization was not yet called ANSI back then; its name was the United States of America Standards Institute (USASI).

Most computers now deal with 8-bit bytes, and ASCII is often thought of as an 8-bit character set. However, it is actually only 7-bit. The 8th bit was left unused because 8-bit hardware was not yet ubiquitous in the sixties. Through the decades many applications have used the 8th bit as a parity or flag bit.

Once international character sets were created, most of them took the 7-bit ASCII code as a basis. 8-bit character sets for alphabets generally took ASCII as the first half, using the other half for national letters as well as typographic elements and more control characters. Multi-byte character sets for complex writing systems are also generally based on ASCII but encoding them into 8-bit bytes is more complex. UTF-8, the dominant encoding of Unicode, is a multi-byte character encoding where 8-bit bytes using only the low 7 bits represent ASCII characters.

More complete histories of ASCII are available on Wikipedia and in numerous other places. Of particular interest is that these histories explain why the allocation of character codes is almost perfectly logical but not quite.

ASCII character table

#x00 NUL  #x10 DLE  #x20    #x30 0  #x40 @  #x50 P  #x60 `  #x70 p
#x01 SOH  #x11 DC1  #x21 !  #x31 1  #x41 A  #x51 Q  #x61 a  #x71 q
#x02 STX  #x12 DC2  #x22 "  #x32 2  #x42 B  #x52 R  #x62 b  #x72 r
#x03 ETX  #x13 DC3  #x23 #  #x33 3  #x43 C  #x53 S  #x63 c  #x73 s
#x04 EOT  #x14 DC4  #x24 $  #x34 4  #x44 D  #x54 T  #x64 d  #x74 t
#x05 ENQ  #x15 NAK  #x25 %  #x35 5  #x45 E  #x55 U  #x65 e  #x75 u
#x06 ACK  #x16 SYN  #x26 &  #x36 6  #x46 F  #x56 V  #x66 f  #x76 v
#x07 BEL  #x17 ETB  #x27 '  #x37 7  #x47 G  #x57 W  #x67 g  #x77 w
#x08 BS   #x18 CAN  #x28 (  #x38 8  #x48 H  #x58 X  #x68 h  #x78 x
#x09 HT   #x19 EM   #x29 )  #x39 9  #x49 I  #x59 Y  #x69 i  #x79 y
#x0a LF   #x1a SUB  #x2a *  #x3a :  #x4a J  #x5a Z  #x6a j  #x7a z
#x0b VT   #x1b ESC  #x2b +  #x3b ;  #x4b K  #x5b [  #x6b k  #x7b {
#x0c FF   #x1c FS   #x2c ,  #x3c <  #x4c L  #x5c \  #x6c l  #x7c |
#x0d CR   #x1d GS   #x2d -  #x3d =  #x4d M  #x5d ]  #x6d m  #x7d }
#x0e SO   #x1e RS   #x2e .  #x3e >  #x4e N  #x5e ^  #x6e n  #x7e ~
#x0f SI   #x1f US   #x2f /  #x3f ?  #x4f O  #x5f _  #x6f o  #x7f DEL

ASCII character classes

#x00..#x1f  control                #x20        space
#x21..#x2f  punctuation/symbol     #x30..#x39  digit
#x3a..#x40  punctuation/symbol     #x41..#x5a  upper-case
#x5b..#x60  punctuation/symbol     #x61..#x7a  lower-case
#x7b..#x7e  punctuation/symbol     #x7f        control

Terminological problems

Graphic, printable and control characters

Intuitively, a graphic character is supposed to be any character that would cause a printer to plot ink on paper. A printable character (or printing character) would be any character that takes up space (whether or not it plots ink). Therefore graphic characters are letters, numbers, punctuation, symbols, etc. Printable characters are the same plus some subset of whitespace characters.

The practical interpretation is not that simple. Depending on who you ask, ASCII space counts as a graphic character, a control character, or neither.

Common Lisp has a standard graphic-char-p predicate that counts space as graphic. The standard also says newline is not a graphic character. It is silent on tab and other whitespace characters, but several implementations say those are not graphic either. Common Lisp does not have a separate notion of printable or control characters, but does talk about non-graphic characters.

Python's str.isprintable() predicate considers space printable but not any other ASCII whitespace. Python does not have a separate notion of graphic characters.

The C iscntrl(), isgraph() and isprint() predicates regard space as a printable character but not a graphic or control character. All other ASCII whitespace characters are regarded as control characters but not graphic or printable.

Since the distinction between graphic and printable characters is confusing to laypeople, most of the world seems to want a predicate to check for graphic characters as well as space, and that set of characters is the complement of the set of control characters, this SRFI specifies an ascii-non-control? predicate as the least ambiguous choice. The predicate is not a simple complement of ascii-control? since the complement of ascii-control? would include non-ASCII characters whereas ascii-non-control? excludes them.

Punctuation and symbol characters

The C standard library's ispunct() predicate considers all non-alphanumeric, non-whitespace ASCII graphic characters to be punctuation. However, Unicode makes a distinction between punctuation and symbol characters. The distinction is roughly that punctuation belongs to a given script whereas symbols are script-independent. Since this is esoteric to laypeople whereas punctuation is ambiguous to Unicode experts, this SRFI avoids both terms and opts for an ascii-other-graphic? predicate.

Horizontal whitespace

ASCII has only two horizontal whitespace characters: space and tab. This SRFI has a ascii-space-or-tab? predicate. While the name is somewhat clumsy, ascii-horizontal-whitespace? would be too verbose.

Letter and number transformations

Many letter and number tasks are naturally expressed by treating decimal digits and the Latin alphabet as integer ranges. Recall that characters themselves are just integer codes under the hood.

Hence by adding a (positive or negative) integer offset we can:

Map letters or digits to numeric values, and vice versa.
Map upper-case letters to lower-case letters and vice versa.
Map digits to letters and vice versa.

Converting letters from upper-case to lower-case or vice versa is a simple matter of checking whether a letter is in the opposite case, and if so, offsetting it onto the case we want.

Converting digits to numbers is a matter of checking that a character is in the ASCII digit range and then offsetting it to map it onto the integers 0..9. Vice versa for numbers to ASCII digits.

We can use only a part of the letter or digit range by specifying a limit. For example, to use the letters abcdef or ABCDEF for hex digits, we’d use a limit of 6 on the upper-case or lower-case range.

For tasks that mix letters and digits, or upper-case and lower-case letters, we have to chain multiple transforms together. Each transform checks the source character to find out whether it matches. If it does, the transformation is performed. Otherwise the job is deferred to the next transformation. In the case of hex conversion, we’d first check whether a character matches the ASCII digit range, and if not, defer to a 6-limited letter range.

To map letters to other letters, it is advantageous to treat the alphabet as a circular range that repeats infinitely in both directions. We can easily perform letter rotations by adding an arbitrary offset and taking the result modulo 26 (the count of letters in the alphabet).

This SRFI wraps the above transformations into reusable combinators. They are specified in the Transformation procedures section. Since there are countless minor variations on real-world transformation tasks such as number parsing, this SRFI doesn’t provide any ready-made parsing procedures. Instead, the combinators have been designed with the goal of making it easy to roll your own. The Examples section will get you started.

To recap the above, each transform:

selects a particular letter or digit range
limits that range
tests whether the source character matches the (limited) range
takes the character’s position in the range and offsets it if it matched
defers to the next transform (if any) if the character did not match

The combinators ascii-upper-case-value and ascii-lower-case-value each do all of the above jobs. The ascii-digit-value combinator does all of them except offsetting, since that is less useful for digits than letters.

The combinators ascii-nth-upper-case and ascii-nth-lower-case do the opposite conversion from numeric values to characters, also handling alphabet rotations. The ascii-nth-digit combinator does not do rotations, since once again those are less useful on digits.

Specification

ASCII and non-ASCII arguments

Callers may freely pass ASCII as well as non-ASCII characters to all procedures defined in this SRFI. The specification is written such that the result is well-defined in both cases.

Numerical limits

Let the char-fix range be an implementation-defined range of exact integer values such that:

The minimum char-fix value is at least as small as the minimum fixnum value.
The maximum char-fix value is at least as large as the maximum fixnum value or the maximum possible return value of char->integer (whichever is larger).

For every procedure in this SRFI:

Any argument named char or char1 or char2 is either a character object or an exact integer in the char-fix range.
Any argument named offset or limit or n is an exact integer in the char-fix range.
If the procedure takes both offset and limit arguments, then it is an error for the caller to pass values such that offset + limit - 1 lies outside the char-fix range.

Hence in a Scheme implementation where all character codepoints fit in a fixnum, the char-fix range can be identical to the fixnum range and this SRFI can be implemented using fast fixnum math. In particular, R⁶RS supplies standard fixnum procedures with the fx prefix. In a Scheme implementation where some codepoints are bigger than a fixnum, generic math has to be used.

Predicates to test for ASCII vs non-ASCII objects

(ascii-codepoint? obj)

Returns #t if obj is an exact integer in the inclusive range #x00..#x7f. Else returns #f.

(ascii-bytevector? obj)

Returns #t if obj is a bytevector and contains no byte value outside the inclusive range #x00..#x7f. Else returns #f.

A zero-length bytevector is considered an ASCII bytevector.

(ascii-char? obj)

Returns #t if obj is a character object whose codepoint lies in the inclusive range #x00..#x7f. Else returns #f.

(ascii-string? obj)

Returns #t if obj is a string and contains no character with a codepoint outside the inclusive range #x00..#x7f. Else returns #f.

A zero-length string is considered an ASCII string.

Predicates to test for subsets of ASCII

(ascii-control? char)

Returns #t if char represents an ASCII character in the control class. Else returns #f.

Note that carriage return, line feed and tab are control characters but space is not.

(ascii-non-control? char)

Returns #t if char represents an ASCII character that is not in the control class. Else returns #f.

The point is that these characters are safe to write to a device that may not be able to sensibly interpret control characters or non-ASCII characters.

(ascii-space-or-tab? char)

Returns #t if char represents an ASCII character with the integer value #x09 (tab) or #x20 (space). Else returns #f.

The point is that space and tab are very often useful to distinguish from other whitespace characters, notably newlines.

(ascii-other-graphic? char)

Returns #t if char represents an ASCII character in the punctuation/symbol class. Else returns #f.

(ascii-alphanumeric? char)

Returns #t if char represents an ASCII character in the upper-case or lower-case or digit class. Else returns #f.

Subset predicates with standard Scheme equivalents

(ascii-alphabetic? char)

Returns #t if char represents an ASCII character in the upper-case or lower-case class. Else returns #f.

(ascii-numeric? char)

Returns #t if char represents an ASCII character in the digit class. Else returns #f.

(ascii-whitespace? char)

Returns #t if char represents an ASCII character with the integer value #x09 (tab) or #x0a (line feed) or #x0b (vertical tab) or #x0c (form feed) or #x0d (carriage return) or #x20 (space). Else returns #f.

Notice how the other whitespace characters form a contiguous range of control characters, but space stands alone as a separate non-control character.

(ascii-upper-case? char)

Returns #t if char represents an ASCII character in the upper-case class. Else returns #f.

(ascii-lower-case? char)

Returns #t if char represents an ASCII character in the lower-case class. Else returns #f.

Case-insensitive character comparison procedures

(ascii-ci=? char1 char2)

(ascii-ci<? char1 char2)

(ascii-ci>? char1 char2)

(ascii-ci<=? char1 char2)

(ascii-ci>=? char1 char2)

These procedures test whether the codepoint of char1 is equal to, less than, greater than, less than or equal to, or greater than or equal to the codepoint of char2.

The comparison is case-insensitive. Specifically, ASCII upper-case letters are converted to their lower-case equivalents before the codepoints are compared. Mapping upper-case to lower-case matches the standard Unicode case-folding algorithm. The direction of folding is important when comparing a letter and a non-letter to find out which is less than the other. These procedures do not apply any case-folding to non-ASCII characters.

Note that char1 and char2 do not need to be of the same type. It is permitted for one of them to be a character object and the other to be an integer.

For case-sensitive comparison, the standard character comparison procedures char=? etc. as well as the standard number and fixnum comparison procedures =, fx= etc. work fine for ASCII; hence this SRFI does not provide case-sensitive equivalents.

Case-insensitive string comparison procedures

(ascii-string-ci=? string1 string2)

(ascii-string-ci<? string1 string2)

(ascii-string-ci>? string1 string2)

(ascii-string-ci<=? string1 string2)

(ascii-string-ci>=? string1 string2)

These procedures test whether string1 is equal to, less than, greater than, less than or equal to, or greater than or equal to string2.

Each pair of adjacent characters between string1 and string2 is compared as with ascii-ci=?, ascii-ci<?, etc. Comparison stops when either string ends, or when an unequal pair of characters is found. If the two strings are of different lengths, and their characters are equal all the way up to the length of the shorter string, then the shorter string is considered less than the longer one. A zero-length string is considered less than a non-zero-length string. Two zero-length strings are considered equal.

For case-sensitive comparison, the standard string=? etc. work fine for ASCII; hence this SRFI does not provide case-sensitive equivalents.

Case conversion procedures

(ascii-upcase char)

If char represents an ASCII character in the lower-case class, returns the same letter from the upper-case class. Else returns char unchanged.