Title

Immutable Texts

Author

William D Clinger

Status

This SRFI is currently in final status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-135@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Received: 2016-06-06
60-day deadline: 2016-08-05
Draft #1 published: 2016-06-06
Draft #2 published: 2016-06-11
Draft #3 published: 2016-06-17
Draft #4 published: 2016-07-09
Finalized: 2016-09-06
Revised to fix errata:
- 2024-09-02 (Revised to fix missing argument in examples of textual-fold and textual-fold-right.)

Abstract

In Scheme, strings are a mutable data type. Although it "is an error" (R5RS and R7RS) to use string-set! on literal strings or on strings returned by symbol->string, and any attempt to do so "should raise an exception" (R6RS), all other strings are mutable.

Although many mutable strings are never actually mutated, the mere possibility of mutation complicates specifications of libraries that use strings, encourages precautionary copying of strings, and precludes structure sharing that could otherwise be used to make procedures such as substring and string-append faster and more space-efficient.

This SRFI specifies a new data type of immutable texts. It comes with efficient and portable sample implementations that guarantee O(1) indexing for both sequential and random access, even in systems whose string-ref procedure takes linear time.

The operations of this new data type include analogues for all of the non-mutating operations on strings specified by the R7RS and most of those specified by SRFI 130, but the immutability of texts and uniformity of character-based indexing simplify the specification of those operations while avoiding several inefficiencies associated with the mutability of Scheme's strings.

Issues

None.

Procedure Index

Here is a list of the procedures provided by this SRFI:

Predicates

text?                 textual?
textual-null? 
textual-every         textual-any

Constructors

make-text             text
text-tabulate
text-unfold           text-unfold-right

Conversion

textual->text
textual->string       textual->vector         textual->list
string->text          vector->text            list->text
reverse-list->text
textual->utf8         textual->utf16be
textual->utf16        textual->utf16le
utf8->text            utf16be->text
utf16->text           utf16le->text

Selection

text-length           textual-length
text-ref              textual-ref
subtext               subtextual
textual-copy
textual-take          textual-take-right
textual-drop          textual-drop-right
textual-pad           textual-pad-right 
textual-trim          textual-trim-right      textual-trim-both

Replacement

textual-replace

Comparison

textual=?             textual-ci=?
textual<?             textual-ci<?
textual>?             textual-ci>?
textual<=?            textual-ci<=?
textual>=?            textual-ci>=?

Prefixes & suffixes

textual-prefix-length textual-suffix-length
textual-prefix?       textual-suffix?

Searching

textual-index         textual-index-right
textual-skip          textual-skip-right
textual-contains      textual-contains-right

Case conversion

textual-upcase        textual-downcase
textual-foldcase      textual-titlecase

Concatenation

textual-append        textual-concatenate     textual-concatenate-reverse
textual-join

Fold & map & friends

textual-fold          textual-fold-right
textual-map           textual-for-each
textual-map-index     textual-for-each-index
textual-count
textual-filter        textual-remove

Replication & splitting

textual-replicate     textual-split

Rationale

The R6RS Rationale identified problems created by the mutability of strings, and several more problems were mentioned by SRFI 130 or came up during its discussion period:

Mutability complicates the specification of higher-order procedures that operate on strings.
Mutability inhibits several compiler optimizations, including common subexpression elimination.
Mutability complicates reasoning about programs that use strings.
Mutations invalidate the string cursors of SRFI 130.
Using a SRFI 130 string cursor that has been invalidated by mutation is an error, but that error is likely to go undetected, making programs harder to test and to debug.
Mutations can be expensive if strings are represented as encapsulated UTF-8 or UTF-16.
Although representations based on UTF-32 provide fast referencing as well as fast mutation, they occupy more space than representations based on UTF-8 or UTF-16.
Mutations preclude sharing of substructure that could save space while making substring and string-append run faster.

Recognizing the first three of these problems, while acknowledging that removing mutable strings from the language would cause "significant compatibility problems for existing code" [R6RS-Rationale], the R6RS standard banished string-set! and string-fill! to a separate (rnrs mutable-strings) library in hope of discouraging and/or deprecating mutation of strings.

The R7RS restored string-set! and string-fill! to the (scheme base) library and added a new mutator, string-copy!. Waiting for some revised standard to make strings immutable is not viable.

We can, however, add a new data type of immutable texts capable of replacing Scheme's string data type for all applications that do not require mutation. Immutable texts do away with the problems listed above while offering these advantages over mutable strings:

space efficiency approaching that of UTF-8 or UTF-16
faster sequential access (if strings use UTF-8 or UTF-16)
faster random access (if strings use UTF-8 or UTF-16)
fast extraction of subtexts
faster concatenation of texts

SRFI 130 aims for the second of those advantages, but cannot achieve that advantage in any portable implementation of its procedures. Furthermore SRFI 130 introduces a new data type (cursors) that is hard to use correctly (because its error situations are likely to go undetected, partly because many of its operations accept both cursors and character indexes, which are allowed but not required to be the same things).

This SRFI offers all five advantages, at the expense of introducing a new data type (immutable texts) that can be implemented portably and efficiently and is easy to use correctly (partly because most of its error situations are detected by its sample implementations, and partly because its character indexes are the same as those used by Scheme's mutable strings).

Specification

Basic concepts

Name of library

The procedures specified by this SRFI are exported by the (srfi 135) library. In R6RS systems that do not yet support R7RS library names, the name of this library is (srfi :135).

It is recommended, but not required, that this library also be made available under the alternative name (srfi 135 texts). That alternative library should export exactly the same bindings as the (srfi 135) library, so libraries and programs can import both libraries without name conflicts.

Conceptual model

Immutable texts are like strings except they can't be mutated.

Immutability makes it easier to use space-efficient representations such as UTF-8 and UTF-16 without incurring the cost of scanning from the beginning when character indexes are used (as with string-ref).

When mutation is not needed, immutable texts are likely to be more efficient than strings with respect to space or time. In some implementations, immutable texts may be more efficient than strings with respect to both space and time.

Subtypes

This SRFI defines two new types:

text is a type consisting of the immutable texts for which text? returns true.
textual is a union type consisting of the texts and strings for which textual? returns true.

The subtypes of the new textual type include the new text type and Scheme's traditional string type, which consists of the values for which string? returns true. The string type includes both mutable strings and the (conceptually) immutable strings that are the values of string literals and calls to symbol->string.

Implementations of this SRFI are free to extend the textual type by adding new subtypes, provided all procedures whose names begin with textual- are extended to accept values of those new subtypes. Implementations of this SRFI should not extend the text type unless its extended values are immutable, are accepted as texts by all procedures of this SRFI (including the text? predicate), and achieve the performance required by this SRFI with respect to both time and space.

External representation

This SRFI does not require any particular external representation for immutable texts, but recommends immutable texts have almost the same external representation as strings, substituting Unicode's left-pointing and right-pointing double angle quotation marks (« and », code points #xab and #xbb) for the double quotes that delimit strings, and allowing those double angle quotation marks to be escaped within the external representations of both texts and strings. That external representation is used by this SRFI's examples.

When feasible, implementations of this SRFI should also consider:

extending the equal? procedure to regard two immutable texts t1 and t2 as equal if and only if (textual=? t1 t2), while regarding an immutable text as unequal to anything that isn't an immutable text.
extending the display procedure to accept immutable texts, treating them the same as strings;
extending the write procedure to generate the external syntax recommended for immutable texts;
extending the read procedure to accept the external syntax recommended for immutable texts;
extending interpreters and compilers to accept quoted literals expressed using the external syntax recommended for immutable texts; R7RS section 4.1.2 mandates this extension if read is extended to accept the external syntax for texts.

Note: Those extensions cannot be implemented portably, so portable code should not rely on them.

Textual input and output ports

Textual input and output ports analogous to string input and output ports would be nice, but they too cannot be implemented portably. Leaving them for another SRFI allows all of this SRFI to be implemented portably with reasonable efficiency.

Shared storage

All strings and other mutable objects returned by the procedures specified within this SRFI are newly allocated and may be mutated with abandon.

No externally visible string ever shares storage with any text. All strings and other mutable objects passed to the procedures specified within this SRFI may be mutated without affecting the behavior of any text.

The immutability of texts allows sharing of substructure, so subtext, textual-append, and similar operations can be faster and more space-efficient than Scheme's substring and string-append operations.

Although distinct texts may share storage internally, this is undetectable because texts are immutable and the procedures that operate on texts do not directly expose any of their internal components.

Implementations that share storage between texts must satisfy the following requirement: There is some reasonably small fixed bound on the ratio of storage used by the shared representation divided by the storage that would be used by an unshared representation.

Example: For the sample implementations with their default configurations, the worst case arises with UTF-8, when a 1-character ASCII text retains up to 127 characters of a text that is no longer reachable, and all 127 of those retained characters lie outside Unicode's Basic Multilingual Plane (BMP). Making reasonable assumptions about the representations of records, vectors, bytevectors, and strings on a 64-bit machine, that shared text would occupy no more than about 16 times the space occupied by an unshared representation. If the retained characters were in the BMP, the shared text would occupy no more than about 8 times the space occupied by an unshared representation. If the retained characters were ASCII, the shared text would occupy no more than about 4 times the space occupied by an unshared representation. The sample implementations can be configured to reduce those worst-case bounds, most obviously by reducing the maximum number of characters that can be shared with a very short text.

Naming conventions

The procedures of this SRFI follow a consistent naming scheme, and are consistent with the conventions developed in SRFI 1 and used in SRFI 13 and SRFI 130. Indeed, most of the names specified here were derived from SRFI 130's names by replacing string with text or textual. As in SRFI 130, procedures that have left/right directional variants use no suffix to specify left-to-right operation, -right to specify right-to-left operation, and -both to specify both.

Note, however, that textual-index, textual-index-right, textual-skip, and textual-skip-right, return #f when no match is found. In SRFI 130, their analogues always return cursors.

The order of common arguments is consistent across the different procedures.

For convenience, most procedures that accept a text as argument will also accept a string. When given a string, those procedures behave as though the string is first converted to a text, so passing a text is likely to be more efficient than passing a string.

Performance requirements

A few procedures are required to execute in O(1) time: text?, textual?, text-length, and text-ref.

If the first two arguments passed to textual-contains and textual-contains-right are texts, then those procedures must run in O(m n) time, where m and n are the lengths of the two subtexts specified by their arguments. If either of the first two arguments is a string, there is no such requirement.

The other procedures specified by this SRFI should run in amortized linear time, not counting time spent in procedures and predicates that were passed as arguments. That is not an absolute requirement, but the sample implementations are designed to deliver that level of performance for most procedures provided none of their textual arguments are strings. When strings are passed as arguments, the running time is unlikely to be linear unless string-ref runs in constant time, and that is not required by any of the Scheme standards.

Indeed, this SRFI was designed to make efficient text processing easier in systems whose string-ref procedure does not run in constant time. For efficiency, portable code should use strings only for fairly short sequences of characters. Representations with guaranteed efficiency (such as the immutable texts of this SRFI) should be used for longer texts.

Note: A procedure that runs in O(1) time does not necessarily take the same time for all inputs. Furthermore O(1) = O(1000), so procedures that run in O(1) time can still be quite slow. The text-ref procedure, for example, may have worst cases for which it is hundreds of times slower than text?. Even the average case for text-ref is likely to be several times as slow as the worst case for text?.

Unicode

During the early development of Unicode, its designers believed a 16-bit character set would suffice, which is why Java's char type has only 16 bits. When Unicode expanded to 1114112 code points, 16 bits were no longer enough to encode all Unicode characters.

The Unicode standard defines three encoding forms for arbitrary sequences of Unicode characters:

UTF-32: is a fixed-width encoding in which every character is represented by a straightforward 32-bit representation of its code point.
UTF-16: is a variable-width encoding in which the most common characters are represented by 16-bit representations of their code points, but characters outside the Basic Multilingual Plane (BMP) are represented by a surrogate pair consisting of two consecutive 16-bit code units.
UTF-8: is a variable-width encoding in which ASCII characters are represented by 8-bit representations of their code points, but other characters are encoded by a sequence of two, three, or four 8-bit code units.

UTF-32 is a convenient internal representation and is used as such by several string libraries for C, C++, and Python, but it is the least compact of the three representations and is seldom used in files. UTF-16 is convenient for applications that use only the BMP, and supports fast sequential processing of arbitrary Unicode; variants of UTF-16 are used by Windows for files and by Java and C# as an internal representation. UTF-8 is upwardly compatible with the ASCII encoding and supports fast sequential processing of arbitrary Unicode; it is widely used for files on non-Windows machines and is also used by some C libraries.

The Scheme programming language does not expose the internal representation of strings. Some implementations of Scheme use UTF-32 or a similar encoding, which makes string-length, string-ref, and string-set! run in O(1) time. Some implementations of Scheme use UTF-16, which saves space at the expense of making string-ref take time proportional to the length of a string. Some implementations of Scheme use UTF-8, which saves even more space for ASCII strings while making string-ref run in linear time.

Although Scheme's string data type allows portable code to use strings independently of their internal representation, the variation in performance between implementations has created a problem for programs that use long strings. In some systems, long strings are inefficient with respect to space; in other systems, long strings are inefficient with respect to time.

The portable solution to this dilemma is to use Scheme's mutable strings only for buffers and other relatively short sequences of characters, while using the immutable texts defined by this SRFI for long sequences of characters.

Note: SRFI 130 suggests an alternative solution: Portable code should process strings sequentially using cursors instead of indexes, and should avoid mutation of strings by using vectors of characters instead, while hoping all major implementations of Scheme will soon convert their strings to use compact internal representations such as UTF-8 or UTF-16. That hope is unlikely to be realized, because a lot of legacy code assumes string-ref runs in O(1) time, as recommended by the R6RS, and mutable strings represented in UTF-32 or similar are more efficient than vectors of characters with respect to both time and space. At present, several implementations of Scheme support Unicode while providing string-ref and string-set! procedures that run in O(1) time; making those operations run asymptotically slower would displease some users of those systems.

Notation

In the following procedure specifications:

A text argument is an immutable text.
A textual argument is an immutable text or a string.
A char argument is a character.
An idx argument is an exact non-negative integer specifying a valid character index into a text or string. The valid character indexes of a text or string textual of length n are the exact integers idx satisfying 0 <= idx < n.
A k argument or result is a position: an exact non-negative integer that is either a valid character index for one of the textual arguments or is the length of a textual argument.
start and end arguments are positions specifying a half-open interval of indexes for a subtext or substring. When omitted, start defaults to 0 and end to the length of the corresponding textual argument. It is an error unless 0 <= start <= end <= (textual-length textual); the sample implementations detect that error and raise an exception.
A len or nchars argument is an exact non-negative integer specifying some number of characters, usually the length of a text or string.
A pred argument is a unary character predicate, taking a character as its one argument and returning a value that will be interpreted as true or false. Unless noted otherwise, as with textual-every and textual-any, all predicates passed to procedures specified in this SRFI may be called in any order and any number of times. It is an error if pred has side effects or does not behave functionally (returning the same result whenever it is called with the same character); the sample implementations do not detect those errors.
An obj argument may be any value at all.

It is an error to pass values that violate the specification above.

Arguments given in square brackets are optional. Unless otherwise noted in the text describing the procedure, any prefix of these optional arguments may be supplied, from zero arguments to the full list. When a procedure returns multiple values, this is shown by listing the return values in square brackets, as well. So, for example, the procedure with signature

halts? f [x init-store] → [boolean integer]

would take one (f), two (f, x) or three (f, x, init-store) input arguments, and return two values, a boolean and an integer.

An argument followed by "..." means zero or more elements. So the procedure with the signature

sum-squares x ...  → number

takes zero or more arguments (x ...), while the procedure with signature

spell-check doc dict₁ dict₂ ... → string-list

takes two required arguments (doc and dict₁) and zero or more optional arguments (dict₂ ...).

If a procedure's return value is said to be "unspecified," the procedure returns a single result whose value is unconstrained and might even vary from call to call.

Procedures

Predicates

text? obj → boolean

Is obj an immutable text? In particular, (text? obj) returns false if (string? obj) returns true, which implies string? returns false if text? returns true. Must execute in O(1) time.

textual? obj → boolean

Returns true if and only if obj is an immutable text or a string. Must execute in O(1) time.

textual-null? textual → boolean

Is textual the empty text or the empty string? Must execute in O(1) time.

textual-every pred textual [start end] → value

textual-any pred textual [start end] → value

Checks to see if every/any character in textual satisfies pred, proceeding from left (index start) to right (index end). textual-every These procedures are short-circuiting: if pred returns false, textual-every does not call pred on subsequent characters; if pred returns true, textual-any does not call pred on subsequent characters; Both procedures are "witness-generating":

If textual-every is given an empty interval (with start = end), it returns #t.
If textual-every returns true for a non-empty interval (with start < end), the returned true value is the one returned by the final call to the predicate on (text-ref (textual-copy text) (- end 1)).
If textual-any returns true, the returned true value is the one returned by the predicate.

Note: The names of these procedures do not end with a question mark. This indicates a general value is returned instead of a simple boolean (#t or #f).

Constructors

make-text len char → text

Returns a text of the given length filled with the given character.

text char ... → text

Returns a text consisting of the given characters.

text-tabulate proc len → text

Proc is a procedure that accepts an exact integer as its argument and returns a character. Constructs a text of size len by calling proc on each value from 0 (inclusive) to len (exclusive) to produce the corresponding element of the text. The order in which proc is called on those indexes is not specified.

Rationale: Although text-unfold is more general, text-tabulate is likely to run faster for the common special case it implements.

text-unfold stop? mapper successor seed [base make-final] → text

This is a fundamental constructor for texts.

successor is used to generate a series of "seed" values from the initial seed:
seed, (successor seed), (successor² seed), (successor³ seed), ...
stop? tells us when to stop — when it returns true when applied to one of these seed values.
mapper maps each seed value to the corresponding character(s) in the result text, which are assembled into that text in left-to-right order. It is an error for mapper to return anything other than a character, string, or text.
base is the optional initial/leftmost portion of the constructed text, which defaults to the empty text (text). It is an error if base is anything other than a character, string, or text.
make-final is applied to the terminal seed value (on which stop? returns true) to produce the final/rightmost portion of the constructed text. It defaults to (lambda (x) (text)). It is an error for make-final to return anything other than a character, string, or text.

text-unfold is a fairly powerful text constructor. You can use it to convert a list to a text, read a port into a text, reverse a text, copy a text, and so forth. Examples:

(port->text p) = (text-unfold eof-object?
                           values
                           (lambda (x) (read-char p))
                           (read-char p))

(list->text lis) = (text-unfold null? car cdr lis)

(text-tabulate f size) = (text-unfold (lambda (i) (= i size)) f add1 0)

To map f over a list lis, producing a text:

(text-unfold null? (compose f car) cdr lis)

Interested functional programmers may enjoy noting that textual-fold-right and text-unfold are in some sense inverses. That is, given operations knull?, kar, kdr, kons, and knil satisfying

(kons (kar x) (kdr x)) = x  and  (knull? knil) = #t

then

(textual-fold-right kons knil (text-unfold knull? kar kdr x)) = x

and

(text-unfold knull? kar kdr (textual-fold-right kons knil text)) = text.

This combinator pattern is sometimes called an "anamorphism."

Note: Implementations should not allow the size of texts created by text-unfold to be limited by limits on stack size.

text-unfold-right stop? mapper successor seed [base make-final] → text

This is a fundamental constructor for texts. It is the same as text-unfold except the results of mapper are assembled into the text in right-to-left order, base is the optional rightmost portion of the constructed text, and make-final produces the leftmost portion of the constructed text.

(text-unfold-right (lambda (n) (< n (char->integer #\A)))
                   (lambda (n) (char-downcase (integer->char n)))
                   (lambda (n) (- n 1))
                   (char->integer #\Z)
                   #\space
                   (lambda (n) " The English alphabet: "))
    => « The English alphabet: abcdefghijklmnopqrstuvwxyz »

Conversion

textual->text textual → text

When given a text, textual->text just returns that text. When given a string, textual->text returns the result of calling string->text on that string. Signals an error when its argument is neither string nor text.

textual->string textual [start end] → string

textual->vector textual [start end] → char-vector

textual->list textual [start end] → char-list

textual->string, textual->vector, and textual->list return a newly allocated (unless empty) mutable string, vector, or list of the characters that make up the given subtext or substring.

string->text string [start end] → text

vector->text char-vector [start end] → text

list->text char-list [start end] → text

These procedures return a text containing the characters of the given substring, subvector, or sublist. The behavior of the text will not be affected by subsequent mutation of the given string, vector, or list.

reverse-list->text char-list → text

An efficient implementation of (compose list->text reverse):

(reverse-list->text '(#\a #\B #\c)) → «cBa»

This is a common idiom in the epilogue of text-processing loops that accumulate their result using a list in reverse order. (See also textual-concatenate-reverse for the "chunked" variant.)

textual->utf8 textual [start end] → bytevector

textual->utf16 textual [start end] → bytevector

textual->utf16be textual [start end] → bytevector

textual->utf16le textual [start end] → bytevector

These procedures return a newly allocated (unless empty) bytevector containing a UTF-8 or UTF-16 encoding of the given subtext or substring.

The bytevectors returned by textual->utf8, textual->utf16be, and textual->utf16le do not contain a byte-order mark (BOM). textual->utf16be returns a big-endian encoding, while textual->utf16le returns a little-endian encoding.

The bytevectors returned by textual->utf16 begin with a BOM that declares an implementation-dependent endianness, and the bytevector elements following that BOM encode the given subtext or substring using that endianness.

Rationale: These procedures are consistent with the Unicode standard. Unicode suggests UTF-16 should default to big-endian, but Microsoft prefers little-endian.

utf8->text bytevector [start end] → text

utf16->text bytevector [start end] → text

utf16be->text bytevector [start end] → text

utf16le->text bytevector [start end] → text

These procedures interpret their bytevector argument as a UTF-8 or UTF-16 encoding of a sequence of characters, and return a text containing that sequence.

The bytevector subrange given to utf16->text may begin with a byte order mark (BOM); if so, that BOM determines whether the rest of the subrange is to be interpreted as big-endian or little-endian; in either case, the BOM will not become a character in the returned text. If the subrange does not begin with a BOM, it is decoded using the same implementation-dependent endianness used by textual->utf16.

The utf16be->text and utf16le->text procedures interpret their inputs as big-endian or little-endian, respectively. If a BOM is present, it is treated as a normal character and will become part of the result.

It is an error if the bytevector subrange given to utf8->text contains invalid UTF-8 byte sequences. For the other three procedures, it is an error if start or end are odd, or if the bytevector subrange contains invalid UTF-16 byte sequences.

Selection

text-length text → len

Returns the number of characters within the given text. Must execute in O(1) time.

text-ref text idx → char

Returns character text[idx], using 0-origin indexing. Must execute in O(1) time.

textual-length textual → len

textual-ref textual idx → char

textual-length returns the number of characters in textual, and textual-ref returns the character at character index idx, using 0-origin indexing. These procedures are the generalizations of text-length and text-ref to accept strings as well as texts. If textual is a text, they must execute in O(1) time, but there is no such requirement if textual is a string.

Rationale: These procedures may be more convenient than the text-only versions, but compilers may generate faster code for calls to the text-only versions.

subtext text start end → text

subtextual textual start end → text

These procedures return a text containing the characters of text or textual beginning with index start (inclusive) and ending with index end (exclusive).

If textual is a string, then that string does not share any storage with the result, so subsequent mutation of that string will not affect the text returned by subtextual. When the first argument is a text, as is required by subtext, implementations are encouraged to return a result that shares storage with that text, to whatever extent sharing is possible while maintaining some small fixed bound on the ratio of storage used by the shared representation divided by the storage that would be used by an unshared representation. In particular, these procedures should just return their first argument when that argument is a text, start is 0, and end is the length of that text.

textual-copy textual [start end] → text

Returns a text containing the characters of textual beginning with index start (inclusive) and ending with index end (exclusive).

Unlike subtext and subtextual, the result of textual-copy never shares substructures that would retain characters or sequences of characters that are substructures of its first argument or previously allocated objects.

If textual-copy returns an empty text, that empty text may be eq? or eqv? to the text returned by (text). If the text returned by textual-copy is non-empty, then it is not eqv? to any previously extant object.

textual-take textual nchars → text

textual-drop textual nchars → text

textual-take-right textual nchars → text

textual-drop-right textual nchars → text

textual-take returns a text containing the first nchars of textual; textual-drop returns a text containing all but the first nchars of textual. textual-take-right returns a text containing the last nchars of textual; textual-drop-right returns a text containing all but the last nchars of textual.

If textual is a string, then that string does not share any storage with the result, so subsequent mutation of that string will not affect the text returned by these procedures. If textual is a text, implementations are encouraged to return a result that shares storage with that text (which is easily accomplished by using subtext to create the result).

(textual-take "Pete Szilagyi" 6) => «Pete S»
(textual-drop "Pete Szilagyi" 6) => «zilagyi»

(textual-take-right "Beta rules" 5) => «rules»
(textual-drop-right "Beta rules" 5) => «Beta »

It is an error to take or drop more characters than are in the text:

(textual-take "foo" 37) => error

textual-pad textual len [char start end] → text

textual-pad-right textual len [char start end] → text

Returns a text of length len comprised of the characters drawn from the given subrange of textual, padded on the left (right) by as many occurrences of the character char as needed. If textual has more than len chars, it is truncated on the left (right) to length len. char defaults to #\space.

(textual-pad     "325" 5) => «  325»
(textual-pad   "71325" 5) => «71325»
(textual-pad "8871325" 5) => «71325»

textual-trim textual [pred start end] → text

textual-trim-right textual [pred start end] → text

textual-trim-both textual [pred start end] → text

Returns a text obtained from the given subrange of textual by skipping over all characters on the left / on the right / on both sides that satisfy the second argument pred: pred defaults to char-whitespace?.

(textual-trim-both "  The outlook wasn't brilliant,  \n\r")
    => «The outlook wasn't brilliant,»

Replacement

textual-replace textual1 textual2 start1 end1 [start2 end2] → text

Returns

(textual-append (subtextual textual1 0 start1)
                (subtextual textual2 start2 end2)
                (subtextual textual1 end1 (textual-length textual1)))

That is, the segment of characters in textual1 from start1 to end1 is replaced by the segment of characters in textual2 from start2 to end2. If start1=end1, this simply splices the characters drawn from textual2 into textual1 at that position.

Examples:

(textual-replace "The TCL programmer endured daily ridicule."
                 "another miserable Perl drone" 4 7 8 22)
    => «The miserable Perl programmer endured daily ridicule.»

(textual-replace "It's easy to code it up in Scheme." "lots of fun" 5 9)
    => «It's lots of fun to code it up in Scheme.»

(define (textual-insert s i t) (textual-replace s t i i))

(textual-insert "It's easy to code it up in Scheme." 5 "really ")
    => «It's really easy to code it up in Scheme.»

(define (textual-set s i c) (textual-replace s (text c) i (+ i 1)))

(textual-set "Text-ref runs in O(n) time." 19 #\1)
    => «Text-ref runs in O(1) time.»

Comparison

textual=? textual1 textual2 textual3 ... → boolean

Returns #t if all the texts have the same length and contain exactly the same characters in the same positions; otherwise returns #f.

textual<? textual1 textual2 textual3 ... → boolean

textual>? textual1 textual2 textual3 ... → boolean

textual<=? textual1 textual2 textual3 ... → boolean

textual>=? textual1 textual2 textual3 ... → boolean

These procedures return #t if their arguments are (respectively): monotonically increasing, monotonically decreasing, monotonically non-decreasing, or monotonically non-increasing.

These comparison predicates are required to be transitive.

These procedures compare texts in an implementation-defined way. One approach is to make them the lexicographic extensions to texts of the corresponding orderings on characters. In that case, text<? would be the lexicographic ordering on texts induced by the ordering char<? on characters, and if two texts differ in length but are the same up to the length of the shorter text, the shorter text would be considered to be lexicographically less than the longer string. However, implementations are also allowed to use more sophisticated locale-specific orderings.

In all cases, a pair of texts must satisfy exactly one of textual<?, textual=?, and textual>?, must satisfy textual<=? if and only if they do not satisfy textual>?, and must satisfy textual>=? if and only if they do not satisfy textual<?.

Note: Implementations are encouraged to use the same orderings for texts as are used by the corresponding comparisons on strings, but are allowed to use different orderings.

Rationale: The only portable way to ensure these comparison predicates use the same orderings used by the corresponding comparisons on strings is to convert all texts to strings, which would be unacceptably inefficient.

textual-ci=? textual1 textual2 textual3 ... → boolean

Returns #t if, after calling textual-foldcase on each of the arguments, all of the case-folded texts would have the same length and contain the same characters in the same positions; otherwise returns #f.

textual-ci<? textual1 textual2 textual3 ... → boolean

textual-ci>? textual1 textual2 textual3 ... → boolean

textual-ci<=? textual1 textual2 textual3 ... → boolean

textual-ci>=? textual1 textual2 textual3 ... → boolean

These procedures behave as though they had called textual-foldcase on their arguments before applying the corresponding procedures without "-ci".

Prefixes & suffixes

textual-prefix-length textual1 textual2 [start1 end1 start2 end2] → integer
textual-suffix-length textual1 textual2 [start1 end1 start2 end2] → integer: Return the length of the longest common prefix/suffix of textual1 and textual2. For prefixes, this is equivalent to their "mismatch index" (relative to the start indexes).
The optional start/end indexes restrict the comparison to the indicated subtexts of textual1 and textual2.
textual-prefix? textual1 textual2 [start1 end1 start2 end2] → boolean
textual-suffix? textual1 textual2 [start1 end1 start2 end2] → boolean: Is textual1 a prefix/suffix of textual2?
The optional start/end indexes restrict the comparison to the indicated subtexts of textual1 and textual2.

Searching

textual-index textual pred [start end] → idx-or-false

textual-index-right textual pred [start end] → idx-or-false

textual-skip textual pred [start end] → idx-or-false

textual-skip-right textual pred [start end] → idx-or-false

textual-index searches through the given subtext or substring from the left, returning the index of the leftmost character satisfying the predicate pred. textual-index-right searches from the right, returning the index of the rightmost character satisfying the predicate pred. If no match is found, these procedures return #f.

Rationale: The SRFI 130 analogues of these procedures return cursors, even when no match is found, and SRFI 130's string-index-right returns the successor of the cursor for the first character that satisfies the predicate. As there are no cursors in this SRFI, it seems best to follow the more intuitive and long-standing precedent set by SRFI 13.

The start and end arguments specify the beginning and end of the search; the valid indexes relevant to the search include start but exclude end. Beware of "fencepost" errors: when searching right-to-left, the first index considered is (- end 1), whereas when searching left-to-right, the first index considered is start. That is, the start/end indexes describe the same half-open interval [start,end) in these procedures that they do in all other procedures specified by this SRFI.

The skip functions are similar, but use the complement of the criterion: they search for the first char that doesn't satisfy pred. To skip over initial whitespace, for example, say

(subtextual text
            (or (textual-skip text char-whitespace?)
                (textual-length text))
            (textual-length text))

These functions can be trivially composed with textual-take and textual-drop to produce take-while, drop-while, span, and break procedures without loss of efficiency.

textual-contains textual1 textual2 [start1 end1 start2 end2] → idx-or-false

textual-contains-right textual1 textual2 [start1 end1 start2 end2] → idx-or-false

Does the subtext of textual1 specified by start1 and end1 contain the sequence of characters given by the subtext of textual2 specified by start2 and end2?

Returns #f if there is no match. If start2 = end2, textual-contains returns start1 but textual-contains-right returns end1. Otherwise returns the index in textual1 for the first character of the first/last match; that index lies within the half-open interval [start1,end1), and the match lies entirely within the [start1,end1) range of textual1.

(textual-contains "eek -- what a geek." "ee" 12 18) ; Searches "a geek"
    => 15

Note: The names of these procedures do not end with a question mark. This indicates a useful value is returned when there is a match.

Case conversion

textual-upcase textual → text
textual-downcase textual → text
textual-foldcase textual → text
textual-titlecase textual → text: These procedures return the text obtained by applying Unicode's full uppercasing, lowercasing, case-folding, or title-casing algorithms to their argument. In some cases, the length of the result may be different from the length of the argument. Note that language-sensitive mappings and foldings are not used.

Concatenation

textual-append textual ... → text

Returns a text whose sequence of characters is the concatenation of the sequences of characters in the given arguments.

textual-concatenate textual-list → text

Concatenates the elements of textual-list together into a single text.

If any elements of textual-list are strings, then those strings do not share any storage with the result, so subsequent mutation of those string will not affect the text returned by this procedure. Implementations are encouraged to return a result that shares storage with some of the texts in the list if that sharing would be space-efficient.

Rationale: Some implementations of Scheme limit the number of arguments that may be passed to an n-ary procedure, so the (apply textual-append textual-list) idiom, which is otherwise equivalent to using this procedure, is not as portable.

textual-concatenate-reverse textual-list [final-textual end] → text

With no optional arguments, calling this procedure is equivalent to

(textual-concatenate (reverse textual-list))

If the optional argument final-textual is specified, it is effectively consed onto the beginning of textual-list before performing the list-reverse and textual-concatenate operations.

If the optional argument end is given, only the characters up to but not including end in final-textual are added to the result, thus producing

(textual-concatenate 
  (reverse (cons (subtext final-textual 0 end)
                 textual-list)))

For example:

(textual-concatenate-reverse '(" must be" "Hello, I") " going.XXXX" 7)
  => «Hello, I must be going.»

Rationale: This procedure is useful when constructing procedures that accumulate character data into lists of textual buffers, and wish to convert the accumulated data into a single text when done. The optional end argument accommodates that use case when final-textual is a mutable string, and is allowed (for uniformity) when final-textual is an immutable text.

textual-join textual-list [delimiter grammar] → text

This procedure is a simple unparser; it pastes texts together using the delimiter text.

textual-list is a list of texts and/or strings. delimiter is a text or a string. The grammar argument is a symbol that determines how the delimiter is used, and defaults to 'infix. It is an error for grammar to be any symbol other than these four:

'infix means an infix or separator grammar: insert the delimiter between list elements. An empty list will produce an empty text.
'strict-infix means the same as 'infix if the textual-list is non-empty, but will signal an error if given an empty list. (This avoids an ambiguity shown in the examples below.)
'suffix means a suffix or terminator grammar: insert the delimiter after every list element.
'prefix means a prefix grammar: insert the delimiter before every list element.

The delimiter is the text used to delimit elements; it defaults to a single space " ".

(textual-join '("foo" "bar" "baz"))
         => «foo bar baz»
(textual-join '("foo" "bar" "baz") "")
         => «foobarbaz»
(textual-join '("foo" "bar" "baz") «:»)
         => «foo:bar:baz»
(textual-join '("foo" "bar" "baz") ":" 'suffix)
         => «foo:bar:baz:»

;; Infix grammar is ambiguous wrt empty list vs. empty text:
(textual-join '()   ":") => «»
(textual-join '("") ":") => «»

;; Suffix and prefix grammars are not:
(textual-join '()   ":" 'suffix)) => «»
(textual-join '("") ":" 'suffix)) => «:»

Fold & map & friends

textual-fold kons knil textual [start end] → value

textual-fold-right kons knil textual [start end] → value

These are the fundamental iterators for texts.

The textual-fold procedure maps the kons procedure across the given text or string from left to right:

(... (kons textual[2] (kons textual[1] (kons textual[0] knil))))

In other words, textual-fold obeys the (tail) recursion

  (textual-fold kons knil textual start end)
= (textual-fold kons (kons textual[start] knil) textual start+1 end)

The textual-fold-right procedure maps kons across the given text or string from right to left:

(kons textual[0]
      (... (kons textual[end-3]
                 (kons textual[end-2]
                       (kons textual[end-1]
                             knil)))))

obeying the (tail) recursion

  (textual-fold-right kons knil textual start end)
= (textual-fold-right kons (kons textual[end-1] knil) textual start end-1)

Examples:

;;; Convert a text or string to a list of chars.
(textual-fold-right cons '() textual)

;;; Count the number of lower-case characters in a text or string.
(textual-fold (lambda (c count)
                (if (char-lower-case? c)
                    (+ count 1)
                    count))
              0
              textual)

The textual-fold-right combinator is sometimes called a "catamorphism."

textual-map proc textual1 textual2 ... → text

It is an error if proc does not accept as many arguments as the number of textual arguments passed to textual-map, does not accept characters as arguments, or returns a value that is not a character, string, or text.

The textual-map procedure applies proc element-wise to the characters of the textual arguments, converts each value returned by proc to a text, and returns the concatenation of those texts. If more than one textual argument is given and not all have the same length, then textual-map terminates when the shortest textual argument runs out. The dynamic order in which proc is called on the characters of the textual arguments is unspecified, as is the dynamic order in which the coercions are performed. If any strings returned by proc are mutated after they have been returned and before the call to textual-map has returned, then textual-map returns a text with unspecified contents; the textual-map procedure itself does not mutate those strings.

Example:

(textual-map (lambda (c0 c1 c2)
               (case c0
                ((#\1) c1)
                ((#\2) (string c2))
                ((#\-) (text #\- c1))))
             (string->text "1222-1111-2222")
             (string->text "Hi There!")
             (string->text "Dear John"))
     => «Hear-here!»

textual-for-each proc textual1 textual2 ... → unspecified

It is an error if proc does not accept as many arguments as the number of textual arguments passed to textual-map or does not accept characters as arguments.

The textual-for-each procedure applies proc element-wise to the characters of the textual arguments, going from left to right. If more than one textual argument is given and not all have the same length, then textual-for-each terminates when the shortest textual argument runs out.

textual-map-index proc textual [start end] → text

Calls proc on each valid index of the specified subtext or substring, converts the results of those calls into texts, and returns the concatenation of those texts. It is an error for proc to return anything other than a character, string, or text. The dynamic order in which proc is called on the indexes is unspecified, as is the dynamic order in which the coercions are performed. If any strings returned by proc are mutated after they have been returned and before the call to textual-map-index has returned, then textual-map-index returns a text with unspecified contents; the textual-map-index procedure itself does not mutate those strings.

textual-for-each-index proc textual [start end] → unspecified

Calls proc on each valid index of the specified subtext or substring, in increasing order, discarding the results of those calls. This is simply a safe and correct way to loop over a subtext or substring.

Example:

(let ((txt (string->text "abcde"))
      (v '()))
  (textual-for-each-index
    (lambda (cur) (set! v (cons (char->integer (text-ref txt cur)) v)))
    txt)
  v) => (101 100 99 98 97)

textual-count textual pred [start end] → integer

Returns a count of the number of characters in the specified subtext of textual that satisfy the given predicate.

textual-filter pred textual [start end] → text

textual-remove pred textual [start end] → text

Filter the given subtext of textual, retaining only those characters that satisfy / do not satisfy pred.

Replication & splitting

textual-replicate textual from to [start end] → text

This is an "extended subtext" procedure that implements replicated copying of a subtext or substring.

textual is a text or string; start and end are optional arguments that specify a subtext of textual, defaulting to 0 and the length of textual. This subtext is conceptually replicated both up and down the index space, in both the positive and negative directions. For example, if textual is "abcdefg", start is 3, and end is6, then we have the conceptual bidirectionally-infinite text

    ...  d  e  f  d  e  f  d  e  f  d  e  f  d  e  f  d  e  f  d ...
        -9 -8 -7 -6 -5 -4 -3 -2 -1  0 +1 +2 +3 +4 +5 +6 +7 +8 +9

textual-replicate returns the subtext of this text beginning at index from, and ending at to. It is an error if from is greater than to.

You can use textual-replicate to perform a variety of tasks:

To rotate a text left: (textual-replicate "abcdef" 2 8) => «cdefab»
To rotate a text right: (textual-replicate "abcdef" -2 4) => «efabcd»
To replicate a text: (textual-replicate "abc" 0 7) => «abcabca»

Note that

The from/to arguments give a half-open range containing the characters from index from up to, but not including, index to.
The from/to indexes are not expressed in the index space of textual. They refer instead to the replicated index space of the subtext defined by textual, start, and end.

It is an error if start=end, unless from=to, which is allowed as a special case.

textual-split textual delimiter [grammar limit start end] → list

Returns a list of texts representing the words contained in the subtext of textual from start (inclusive) to end (exclusive). The delimiter is a text or string to be used as the word separator. This will often be a single character, but multiple characters are allowed for use cases such as splitting on "\r\n". The returned list will have one more item than the number of non-overlapping occurrences of the delimiter in the text. If delimiter is an empty text, then the returned list contains a list of texts, each of which contains a single character.

The grammar is a symbol with the same meaning as in the textual-join procedure. If it is infix, which is the default, processing is done as described above, except an empty textual produces the empty list; if grammar is strict-infix, then an empty textual signals an error. The values prefix and suffix cause a leading/trailing empty text in the result to be suppressed.

If limit is a non-negative exact integer, at most that many splits occur, and the remainder of textual is returned as the final element of the list (so the result will have at most limit+1 elements). If limit is not specified or is #f, then as many splits as possible are made. It is an error if limit is any other value.

To split on a regular expression re, use SRFI 115's regexp-split procedure:

(map string->text (regexp-split re (textual->string txt)))

Rationale: Although it would be more efficient to have a version of regexp-split that operates on texts directly, the scope of this SRFI is limited to specifying operations on texts analogous to those specified for strings by R7RS and SRFI 130.

Sample implementation

This SRFI comes with sample implementations organized as a representation-independent library that imports one of three kernel libraries:

kernel16 uses an internal representation based on UTF-16, which performs well when strings can represent any Unicode text and non-ASCII characters are common.
kernel8 uses an internal representation based on UTF-8, which performs well when strings can represent any Unicode text but most texts consist of ASCII characters.
kernel0 uses an internal representation based on Scheme strings, which performs well if strings are acceptably space-efficient and the string-ref procedure runs in constant time. It also performs well in interpreted systems even when string-ref takes linear time, because the built-in string-ref is likely to run faster on short strings than any UTF-8 or UTF-16 scanner that could be written in Scheme.

All three kernels implement a text-ref procedure that runs in O(1) time. All three kernels use shared substructures to improve both space efficiency and running time.

The sample implementations come with a black-box test program derived from a black-box test program written for SRFI 130.

There is also a program that compares the performance of strings and texts on a number of micro-benchmarks. These benchmarks are hardly typical, but they provide a rational basis for discussing performance tradeoffs between immutable texts, mutable strings with SRFI 130 cursors and operations, and the standard R7RS operations on strings.

Acknowledgements

For three decades, I have been hoping the Scheme standards would either make strings immutable or add a new data type of immutable texts; with Unicode, that hope became more urgent. During that time, I have discussed this with far more people than I can now remember. Most of those I do remember are among those acknowledged below by John Cowan or Olin Shivers, but I am pleased to add Lars T Hansen, Chris Hanson, Felix Klock, and Jonathan Rees to the list of those whose ideas (and counter-arguments!) have contributed to this SRFI.

John Cowan, the author of SRFI 130, deserves special thanks for blessing my desire to use SRFI 130 as the starting point for this SRFI, for designing the spans API whose implementations tested the key ideas of this SRFI's sample implementations, for chairing Working Group 2, and for a lot more I won't mention here.

To acknowledge all those who contributed to SRFI 130 and to its predecessor SRFI 13, written by Olin Shivers, I hereby reproduce John Cowan's acknowledgements from SRFI 130:

Thanks to the members of the SRFI 130 mailing list who made this SRFI what it now is, including Per Bothner, Arthur Gleckler, Shiro Kawai, Jim Rees, and especially Alex Shinn, whose idea it was to make cursors and indexes disjoint, and who provided the foof implementation. The following acknowledgements by Olin Shivers are taken from SRFI 13:

The design of this library benefited greatly from the feedback provided during the SRFI discussion phase. Among those contributing thoughtful commentary and suggestions, both on the mailing list and by private discussion, were Paolo Amoroso, Lars Arvestad, Alan Bawden, Jim Bender, Dan Bornstein, Per Bothner, Will Clinger, Brian Denheyer, Mikael Djurfeldt, Kent Dybvig, Sergei Egorov, Marc Feeley, Matthias Felleisen, Will Fitzgerald, Matthew Flatt, Arthur A. Gleckler, Ben Goetter, Sven Hartrumpf, Erik Hilsdale, Richard Kelsey, Oleg Kiselyov, Bengt Kleberg, Donovan Kolbly, Bruce Korb, Shriram Krishnamurthi, Bruce Lewis, Tom Lord, Brad Lucier, Dave Mason, David Rush, Klaus Schilling, Jonathan Sobel, Mike Sperber, Mikael Staldal, Vladimir Tsyshevsky, Donald Welsh, and Mike Wilson. I am grateful to them for their assistance.

I am also grateful to the authors, implementors and documentors of all the systems mentioned in the introduction. Aubrey Jaffer and Kent Pitman should be noted for their work in producing Web-accessible versions of the R5RS and Common Lisp spec, which was a tremendous aid.

This is not to imply that these individuals necessarily endorse the final results, of course.

During this document's long development period, great patience was exhibited by Mike Sperber, who is the editor for the SRFI, and by Hillary Sullivan, who is not.

As Olin said, we should not assume any of those individuals endorse this SRFI.

References & links

[CommonLisp]: Common Lisp: the Language.
Guy L. Steele Jr. (editor).
Digital Press, Maynard, Mass., second edition 1990.
Available at http://www.elwood.com/alu/table/references.htm#cltl2.
The Common Lisp "HyperSpec," produced by Kent Pitman, is essentially the ANSI spec for Common Lisp: http://www.lispworks.com/documentation/HyperSpec/Front/index.htm.
[MIT-Scheme]: http://www.swiss.ai.mit.edu/projects/scheme/
[R5RS]: Revised⁵ report on the algorithmic language Scheme.
R. Kelsey, W. Clinger, J. Rees (editors).
Higher-Order and Symbolic Computation, Vol. 11, No. 1, September, 1998.
and ACM SIGPLAN Notices, Vol. 33, No. 9, October, 1998.
Available at http://www.schemers.org/Documents/Standards/.
[R6RS]: Revised⁶ report on the algorithmic language Scheme.
M. Sperber, R. K. Dybvig, M. Flatt, A. van Straaten (editors).
Available at http://r6rs.org.
[R6RSlibraries]: Revised⁶ report on the algorithmic language Scheme — Standard Libraries.
M. Sperber, R. K. Dybvig, M. Flatt, A. van Straaten (editors).
Available at http://r6rs.org.
[R6RS-Rationale]: Revised⁶ report on the algorithmic language Scheme — Rationale.
M. Sperber, R. K. Dybvig, M. Flatt, A. van Straaten (editors).
Available at http://r6rs.org.
[R7RS]: Revised⁷ report on the algorithmic language Scheme.
A. Shinn, J. Cowan, A. Gleckler (editors).
Available at http://r7rs.org.
[SRFI]: The SRFI web site.
http://srfi.schemers.org/
[SRFI-13]: O. Shivers.
SRFI-13: String libraries.
http://srfi.schemers.org/srfi-13/
[SRFI-130]: J. Cowan.
SRFI-130: Cursor-based string library.
http://srfi.schemers.org/srfi-130/
[DesignNotes]: W. D. Clinger.
Immutable texts.
(This reference consists of rough design notes for the sample implementations. This reference should be removed before the SRFI is finalized.)
[Unicode]: The Unicode Consortium. Unicode. http://unicode.org/

Copyright

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Editor: Arthur A. Gleckler

Title

Author

Status

Table of contents

Basic concepts