This SRFI is currently in final status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-135@nospamsrfi.schemers.org
. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.
textual-fold
and textual-fold-right
.)
In Scheme, strings are a mutable data type.
Although it "is an error"
(R5RS
and
R7RS)
to use string-set!
on literal strings or on strings returned by symbol->string
,
and any attempt to do so "should raise an exception"
(R6RS),
all other strings are mutable.
Although many mutable strings are never actually mutated, the mere
possibility of mutation complicates specifications of libraries that
use strings, encourages precautionary copying of strings, and precludes
structure sharing that could otherwise be used to make procedures such
as substring
and string-append
faster and
more space-efficient.
This
SRFI
specifies a new data type of immutable texts.
It comes with efficient and portable sample implementations
that guarantee O(1) indexing
for both sequential and random access, even in systems whose
string-ref
procedure takes linear time.
The operations of this new data type include analogues for all of the non-mutating operations on strings specified by the R7RS and most of those specified by SRFI 130, but the immutability of texts and uniformity of character-based indexing simplify the specification of those operations while avoiding several inefficiencies associated with the mutability of Scheme's strings.
None.
Here is a list of the procedures provided by this SRFI:
text? textual? textual-null? textual-every textual-any
make-text text text-tabulate text-unfold text-unfold-right
textual->text textual->string textual->vector textual->list string->text vector->text list->text reverse-list->text textual->utf8 textual->utf16be textual->utf16 textual->utf16le utf8->text utf16be->text utf16->text utf16le->text
text-length textual-length text-ref textual-ref subtext subtextual textual-copy textual-take textual-take-right textual-drop textual-drop-right textual-pad textual-pad-right textual-trim textual-trim-right textual-trim-both
textual-replace
textual=? textual-ci=? textual<? textual-ci<? textual>? textual-ci>? textual<=? textual-ci<=? textual>=? textual-ci>=?
textual-prefix-length textual-suffix-length textual-prefix? textual-suffix?
textual-index textual-index-right textual-skip textual-skip-right textual-contains textual-contains-right
textual-upcase textual-downcase textual-foldcase textual-titlecase
textual-append textual-concatenate textual-concatenate-reverse textual-join
textual-fold textual-fold-right textual-map textual-for-each textual-map-index textual-for-each-index textual-count textual-filter textual-remove
textual-replicate textual-split
The R6RS Rationale identified problems created by the mutability of strings, and several more problems were mentioned by SRFI 130 or came up during its discussion period:
substring
and string-append
run faster.
Recognizing the first three of these problems, while acknowledging
that removing mutable strings from the language would cause
"significant compatibility problems for existing code"
[R6RS-Rationale],
the R6RS standard banished
string-set!
and string-fill!
to a separate (rnrs mutable-strings)
library
in hope of discouraging and/or deprecating mutation of strings.
The R7RS restored string-set!
and string-fill!
to the
(scheme base)
library and added a new mutator,
string-copy!
.
Waiting for some revised standard
to make strings immutable is not viable.
We can, however, add a new data type of immutable texts capable of replacing Scheme's string data type for all applications that do not require mutation. Immutable texts do away with the problems listed above while offering these advantages over mutable strings:
SRFI 130 aims for the second of those advantages, but cannot achieve that advantage in any portable implementation of its procedures. Furthermore SRFI 130 introduces a new data type (cursors) that is hard to use correctly (because its error situations are likely to go undetected, partly because many of its operations accept both cursors and character indexes, which are allowed but not required to be the same things).
This SRFI offers all five advantages, at the expense of introducing a new data type (immutable texts) that can be implemented portably and efficiently and is easy to use correctly (partly because most of its error situations are detected by its sample implementations, and partly because its character indexes are the same as those used by Scheme's mutable strings).
See also the discussion of Unicode.
This SRFI is based upon SRFI 130, copying much of its structure and wording, which should make it easier to compare this SRFI against SRFI 130 and to convert programs using SRFI 130 to use immutable texts instead.
The procedures specified by this SRFI are exported by the
(srfi 135)
library.
In R6RS systems that do not yet support R7RS library names,
the name of this library is (srfi :135)
.
It is recommended, but not required, that this library also
be made available under the alternative name
(srfi 135 texts)
. That alternative library
should export exactly the same bindings as the
(srfi 135)
library, so libraries and programs
can import both libraries without name conflicts.
Immutable texts are like strings except they can't be mutated.
Immutability makes it easier to use space-efficient
representations such as UTF-8 and UTF-16 without incurring the
cost of scanning from the beginning when character indexes are
used (as with string-ref
).
When mutation is not needed, immutable texts are likely to be more efficient than strings with respect to space or time. In some implementations, immutable texts may be more efficient than strings with respect to both space and time.
This SRFI defines two new types:
text?
returns true.
textual?
returns true.
The subtypes of the new textual type include
the new text type and
Scheme's traditional string type, which consists
of the values for which string?
returns true.
The string type includes both mutable strings and the
(conceptually) immutable strings that are the values of string
literals and calls to symbol->string
.
Implementations of this SRFI are free to extend the textual
type by adding new subtypes, provided all procedures whose names
begin with textual-
are extended to accept values of
those new subtypes.
Implementations of this SRFI should not extend the text
type unless its extended values are immutable, are accepted as
texts by all procedures of this SRFI (including the text?
predicate), and achieve the
performance required by this SRFI
with respect to both time and space.
This SRFI does not require any particular external representation
for immutable texts, but recommends immutable texts have almost
the same external representation as strings, substituting Unicode's
left-pointing and right-pointing double angle quotation marks
(« and », code points #xab
and #xbb
)
for the double quotes that delimit strings,
and allowing those double angle quotation marks to be escaped
within the external representations of both texts and strings.
That external representation is used by this SRFI's examples.
When feasible, implementations of this SRFI should also consider:
equal?
procedure to regard two
immutable texts t1 and t2 as equal
if and only if
(textual=? t1 t2)
,
while regarding an immutable text as unequal to anything
that isn't an immutable text.display
procedure to accept
immutable texts, treating them the same as strings;write
procedure to generate the
external syntax recommended for immutable texts;read
procedure to accept the
external syntax recommended for immutable texts;read
is extended to accept
the external syntax for texts.Note: Those extensions cannot be implemented portably, so portable code should not rely on them.
Textual input and output ports analogous to string input and output ports would be nice, but they too cannot be implemented portably. Leaving them for another SRFI allows all of this SRFI to be implemented portably with reasonable efficiency.
All strings and other mutable objects returned by the procedures specified within this SRFI are newly allocated and may be mutated with abandon.
No externally visible string ever shares storage with any text. All strings and other mutable objects passed to the procedures specified within this SRFI may be mutated without affecting the behavior of any text.
The immutability of texts allows sharing of substructure, so
subtext
, textual-append
, and similar
operations can be faster and more space-efficient than Scheme's
substring
and string-append
operations.
Although distinct texts may share storage internally, this is undetectable because texts are immutable and the procedures that operate on texts do not directly expose any of their internal components.
Implementations that share storage between texts must satisfy the following requirement: There is some reasonably small fixed bound on the ratio of storage used by the shared representation divided by the storage that would be used by an unshared representation.
Example: For the sample implementations with their default configurations, the worst case arises with UTF-8, when a 1-character ASCII text retains up to 127 characters of a text that is no longer reachable, and all 127 of those retained characters lie outside Unicode's Basic Multilingual Plane (BMP). Making reasonable assumptions about the representations of records, vectors, bytevectors, and strings on a 64-bit machine, that shared text would occupy no more than about 16 times the space occupied by an unshared representation. If the retained characters were in the BMP, the shared text would occupy no more than about 8 times the space occupied by an unshared representation. If the retained characters were ASCII, the shared text would occupy no more than about 4 times the space occupied by an unshared representation. The sample implementations can be configured to reduce those worst-case bounds, most obviously by reducing the maximum number of characters that can be shared with a very short text.
The procedures of this SRFI follow
a consistent naming scheme, and are consistent with the conventions
developed in SRFI 1 and used in SRFI 13 and SRFI 130.
Indeed, most of the names specified here were derived from SRFI 130's
names by replacing string
with text
or
textual
.
As in SRFI 130,
procedures that have left/right directional variants
use no suffix to specify left-to-right operation,
-right
to specify
right-to-left operation, and -both
to specify both.
Note, however, that textual-index
,
textual-index-right
,
textual-skip
, and
textual-skip-right
,
return #f
when no match is found.
In SRFI 130, their analogues always return cursors.
The order of common arguments is consistent across the different procedures.
For convenience, most procedures that accept a text as argument will also accept a string. When given a string, those procedures behave as though the string is first converted to a text, so passing a text is likely to be more efficient than passing a string.
A few procedures are required to execute in O(1) time:
text?
, textual?
,
text-length
, and text-ref
.
If the first two arguments passed to textual-contains
and
textual-contains-right
are texts, then those procedures
must run in O(m n) time, where m
and n are the lengths of the two subtexts specified by
their arguments.
If either of the first two arguments is a string, there is no such
requirement.
The other procedures specified by this SRFI should run in
amortized linear time, not counting time spent in procedures and
predicates that were passed as arguments.
That is not an absolute requirement, but the sample implementations
are designed to deliver that level of performance for most procedures
provided none of their textual arguments are strings.
When strings are passed as arguments, the running time is unlikely
to be linear unless string-ref
runs in constant time,
and that is not required by any of the Scheme standards.
Indeed, this SRFI was designed to make efficient text processing
easier in systems whose string-ref
procedure does not
run in constant time. For efficiency, portable code should use
strings only for fairly short sequences of characters.
Representations with guaranteed efficiency (such as the immutable
texts of this SRFI) should be used for longer texts.
Note:
A procedure that runs in O(1) time does not necessarily take the
same time for all inputs.
Furthermore O(1) = O(1000), so procedures that run in O(1) time
can still be quite slow.
The text-ref
procedure, for example, may have worst
cases for which it is hundreds of times slower than text?
.
Even the average case for text-ref
is likely to be
several times as slow as the worst case for text?
.
During the early development of Unicode, its designers believed
a 16-bit character set would suffice,
which is why Java's char
type has only 16 bits.
When Unicode expanded to 1114112 code points, 16 bits were
no longer enough to encode all Unicode characters.
The Unicode standard defines three encoding forms for arbitrary sequences of Unicode characters:
UTF-32 is a convenient internal representation and is used as such by several string libraries for C, C++, and Python, but it is the least compact of the three representations and is seldom used in files. UTF-16 is convenient for applications that use only the BMP, and supports fast sequential processing of arbitrary Unicode; variants of UTF-16 are used by Windows for files and by Java and C# as an internal representation. UTF-8 is upwardly compatible with the ASCII encoding and supports fast sequential processing of arbitrary Unicode; it is widely used for files on non-Windows machines and is also used by some C libraries.
The Scheme programming language does not expose the internal
representation of strings.
Some implementations of Scheme use UTF-32 or a similar encoding,
which makes string-length
, string-ref
,
and string-set!
run in O(1) time.
Some implementations of Scheme use UTF-16, which saves space
at the expense of making string-ref
take
time proportional to the length of a string.
Some implementations of Scheme use UTF-8, which saves even more space
for ASCII strings while making string-ref
run in
linear time.
Although Scheme's string data type allows portable code to use strings independently of their internal representation, the variation in performance between implementations has created a problem for programs that use long strings. In some systems, long strings are inefficient with respect to space; in other systems, long strings are inefficient with respect to time.
The portable solution to this dilemma is to use Scheme's mutable strings only for buffers and other relatively short sequences of characters, while using the immutable texts defined by this SRFI for long sequences of characters.
Note:
SRFI 130 suggests an alternative solution: Portable code should
process strings sequentially using cursors instead of indexes,
and should avoid mutation of strings by using vectors of characters
instead,
while hoping all major implementations of Scheme will soon convert
their strings to use compact internal representations such as UTF-8
or UTF-16.
That hope is unlikely to be realized, because a lot of legacy code
assumes string-ref
runs in O(1) time, as recommended
by the R6RS, and mutable strings represented in UTF-32 or similar
are more efficient than vectors of characters with respect to both
time and space.
At present, several implementations of Scheme support Unicode while
providing string-ref
and string-set!
procedures
that run in O(1) time; making those operations run asymptotically
slower would displease some users of those systems.
In the following procedure specifications:
(textual-length textual)
;
the sample implementations detect that error and raise an exception.
textual-every
and
textual-any
,
all predicates passed to procedures specified in this SRFI may be
called in any order and any number of times.
It is an error if pred has side effects or
does not behave functionally (returning the same result whenever
it is called with the same character);
the sample implementations do not detect those errors.
It is an error to pass values that violate the specification above.
Arguments given in square brackets are optional. Unless otherwise noted in the text describing the procedure, any prefix of these optional arguments may be supplied, from zero arguments to the full list. When a procedure returns multiple values, this is shown by listing the return values in square brackets, as well. So, for example, the procedure with signature
halts? f [x init-store] → [boolean integer]
would take one (f), two (f, x) or three (f, x, init-store) input arguments, and return two values, a boolean and an integer.
An argument followed by "...
" means zero or more elements.
So the procedure with the signature
sum-squares x ... → numbertakes zero or more arguments (x ...), while the procedure with signature
spell-check doc dict1 dict2 ... → string-list
takes two required arguments (doc and dict1) and zero or more optional arguments (dict2 ...).
If a procedure's return value is said to be "unspecified," the procedure returns a single result whose value is unconstrained and might even vary from call to call.
text?
obj → boolean
(text? obj)
returns false if
(string? obj)
returns true,
which implies string?
returns false
if text?
returns true.
Must execute in O(1) time.
textual?
obj → boolean
textual-null?
textual → boolean
textual-every
pred textual [start end] → value
textual-any
pred textual [start end] → value
Checks to see if every/any character in textual
satisfies pred,
proceeding from left (index start)
to right (index end).
textual-every
These procedures are short-circuiting:
if pred returns false, textual-every
does not call pred on subsequent characters;
if pred returns true, textual-any
does not call pred on subsequent characters;
Both procedures are "witness-generating":
textual-every
is given an empty interval
(with start = end),
it returns #t
.textual-every
returns true for a non-empty
interval (with start < end),
the returned true value is the one returned by the final call to the
predicate on
(text-ref (textual-copy text) (- end 1))
.textual-any
returns true,
the returned true value is the one returned by the predicate.
Note:
The names of these procedures do not end with a question mark.
This indicates a general value is returned instead of a simple boolean
(#t
or #f
).
make-text
len char → text
text
char ... → text
text-tabulate
proc len → text
Rationale:
Although text-unfold
is more general,
text-tabulate
is likely to run faster
for the common special case it implements.
text-unfold
stop? mapper successor seed [base make-final] → text
(text)
.
It is an error if base is anything other than a character,
string, or text.(lambda (x) (text))
.
It is an error for make-final to return anything other
than a character, string, or text.
text-unfold
is a fairly powerful text constructor.
You can use it to
convert a list to a text, read a port into a text, reverse a text,
copy a text, and so forth. Examples:
(port->text p) = (text-unfold eof-object? values (lambda (x) (read-char p)) (read-char p)) (list->text lis) = (text-unfold null? car cdr lis) (text-tabulate f size) = (text-unfold (lambda (i) (= i size)) f add1 0)
To map f over a list lis, producing a text:
(text-unfold null? (compose f car) cdr lis)
Interested functional programmers may enjoy noting that
textual-fold-right
and text-unfold
are in some sense inverses.
That is, given operations
knull?, kar, kdr, kons,
and knil satisfying
(kons (kar x) (kdr x)) = x and (knull? knil) = #t
then
(textual-fold-right kons knil (text-unfold knull? kar kdr x)) = xand
(text-unfold knull? kar kdr (textual-fold-right kons knil text)) = text.
This combinator pattern is sometimes called an "anamorphism."
Note: Implementations should not allow the size of texts created
by text-unfold
to be limited by limits on stack size.
text-unfold-right
stop? mapper successor seed [base make-final] → text
text-unfold
except the results of mapper are assembled into the
text in right-to-left order,
base is the optional rightmost portion
of the constructed text, and make-final
produces the leftmost portion of the constructed text.
(text-unfold-right (lambda (n) (< n (char->integer #\A))) (lambda (n) (char-downcase (integer->char n))) (lambda (n) (- n 1)) (char->integer #\Z) #\space (lambda (n) " The English alphabet: ")) => « The English alphabet: abcdefghijklmnopqrstuvwxyz »
textual->text
textual → text
textual->text
just returns that text.
When given a string, textual->text
returns the result
of calling string->text
on that string.
Signals an error when its argument is neither string nor text.
textual->string
textual [start end] → string
textual->vector
textual [start end] → char-vector
textual->list
textual [start end] → char-list
textual->string
,
textual->vector
,
and textual->list
return a newly allocated (unless empty) mutable string, vector, or list
of the characters that make up the given subtext or substring.
string->text
string [start end] → text
vector->text
char-vector [start end] → text
list->text
char-list [start end] → text
reverse-list->text
char-list → text
(compose list->text reverse)
:
(reverse-list->text '(#\a #\B #\c)) → «cBa»This is a common idiom in the epilogue of text-processing loops that accumulate their result using a list in reverse order. (See also
textual-concatenate-reverse
for the "chunked" variant.)
textual->utf8
textual [start end] → bytevector
textual->utf16
textual [start end] → bytevector
textual->utf16be
textual [start end] → bytevector
textual->utf16le
textual [start end] → bytevector
The bytevectors returned by textual->utf8
,
textual->utf16be
, and textual->utf16le
do not contain a byte-order mark (BOM).
textual->utf16be
returns a big-endian encoding,
while textual->utf16le
returns a little-endian
encoding.
The bytevectors returned by textual->utf16
begin with a BOM that declares an implementation-dependent
endianness, and the bytevector elements following that BOM
encode the given subtext or substring using that endianness.
Rationale: These procedures are consistent with the Unicode standard. Unicode suggests UTF-16 should default to big-endian, but Microsoft prefers little-endian.
utf8->text
bytevector [start end] → text
utf16->text
bytevector [start end] → text
utf16be->text
bytevector [start end] → text
utf16le->text
bytevector [start end] → text
The bytevector subrange given to utf16->text
may begin with a byte order mark (BOM); if so, that BOM
determines whether the rest of the subrange is to be
interpreted as big-endian or little-endian; in either case,
the BOM will not become a character in the returned text.
If the subrange does not begin with a BOM, it is decoded using
the same implementation-dependent endianness used by
textual->utf16
.
The utf16be->text
and utf16le->text
procedures interpret their inputs as big-endian or little-endian,
respectively. If a BOM is present, it is treated as a normal
character and will become part of the result.
It is an error if the bytevector subrange given to
utf8->text
contains invalid UTF-8 byte sequences.
For the other three procedures, it is an error if start
or end are odd, or if the bytevector subrange contains
invalid UTF-16 byte sequences.
text-length
text → len
text-ref
text idx → char
textual-length
textual → len
textual-ref
textual idx → char
textual-length
returns the number of characters in
textual, and
textual-ref
returns the character at character index
idx, using 0-origin indexing.
These procedures are the generalizations of
text-length
and text-ref
to accept strings as well as texts.
If textual is a text, they must execute in O(1) time,
but there is no such requirement if textual is a string.
Rationale: These procedures may be more convenient than the text-only versions, but compilers may generate faster code for calls to the text-only versions.
subtext
text start end → text
subtextual
textual start end → text
If textual is a string, then that string does not share any
storage with the result, so subsequent mutation of that string
will not affect the text returned by subtextual
.
When the first argument is a text, as is required by subtext
,
implementations are encouraged to return a result that shares storage with
that text,
to whatever extent sharing is possible while maintaining some
small fixed bound on the ratio of storage used by the shared
representation divided by the storage that would be used by
an unshared representation.
In particular, these procedures should just return their first
argument when that argument is a text, start is 0, and
end is the length of that text.
textual-copy
textual [start end] → text
Unlike subtext
and subtextual
,
the result of textual-copy
never shares substructures
that would retain characters or sequences of characters that are
substructures of its first argument or previously allocated objects.
If textual-copy
returns an empty text, that empty
text may be eq?
or eqv?
to the text
returned by (text)
.
If the text returned by textual-copy
is non-empty,
then it is not eqv?
to any previously extant object.
textual-take
textual nchars → text
textual-drop
textual nchars → text
textual-take-right
textual nchars → text
textual-drop-right
textual nchars → text
textual-take
returns a text containing the first
nchars of textual;
textual-drop
returns a text containing all but the
first nchars of textual.
textual-take-right
returns a text containing the
last nchars of textual;
textual-drop-right
returns a text containing all
but the last nchars of textual.
If textual is a string, then that string does not share any
storage with the result, so subsequent mutation of that string
will not affect the text returned by these procedures.
If textual is a text, implementations are
encouraged to return a result that shares storage with that text
(which is easily accomplished by using subtext
to
create the result).
(textual-take "Pete Szilagyi" 6) => «Pete S» (textual-drop "Pete Szilagyi" 6) => «zilagyi» (textual-take-right "Beta rules" 5) => «rules» (textual-drop-right "Beta rules" 5) => «Beta »
It is an error to take or drop more characters than are in the text:
(textual-take "foo" 37) => error
textual-pad
textual len [char start end] → text
textual-pad-right
textual len [char start end] → text
#\space
.
If textual is a string, then that string does not share any storage with the result, so subsequent mutation of that string will not affect the text returned by these procedures. If textual is a text, implementations are encouraged to return a result that shares storage with that text whenever sharing would be space-efficient.
(textual-pad "325" 5) => « 325» (textual-pad "71325" 5) => «71325» (textual-pad "8871325" 5) => «71325»
textual-trim
textual [pred start end] → text
textual-trim-right
textual [pred start end] → text
textual-trim-both
textual [pred start end] → text
char-whitespace?
.
If textual is a string, then that string does not share any storage with the result, so subsequent mutation of that string will not affect the text returned by these procedures. If textual is a text, implementations are encouraged to return a result that shares storage with that text whenever sharing would be space-efficient.
(textual-trim-both " The outlook wasn't brilliant, \n\r") => «The outlook wasn't brilliant,»
textual-replace
textual1 textual2 start1 end1 [start2 end2] → text
(textual-append (subtextual textual1 0 start1) (subtextual textual2 start2 end2) (subtextual textual1 end1 (textual-length textual1)))
That is, the segment of characters in textual1 from start1 to end1 is replaced by the segment of characters in textual2 from start2 to end2. If start1=end1, this simply splices the characters drawn from textual2 into textual1 at that position.
Examples:
(textual-replace "The TCL programmer endured daily ridicule." "another miserable Perl drone" 4 7 8 22) => «The miserable Perl programmer endured daily ridicule.» (textual-replace "It's easy to code it up in Scheme." "lots of fun" 5 9) => «It's lots of fun to code it up in Scheme.» (define (textual-insert s i t) (textual-replace s t i i)) (textual-insert "It's easy to code it up in Scheme." 5 "really ") => «It's really easy to code it up in Scheme.» (define (textual-set s i c) (textual-replace s (text c) i (+ i 1))) (textual-set "Text-ref runs in O(n) time." 19 #\1) => «Text-ref runs in O(1) time.»
textual=?
textual1 textual2 textual3 ... → boolean
#t
if all the texts have the same length
and contain exactly the same characters in the same positions;
otherwise returns #f
.
textual<?
textual1 textual2 textual3 ... → boolean
textual>?
textual1 textual2 textual3 ... → boolean
textual<=?
textual1 textual2 textual3 ... → boolean
textual>=?
textual1 textual2 textual3 ... → boolean
#t
if their arguments
are (respectively): monotonically increasing, monotonically decreasing,
monotonically non-decreasing, or monotonically non-increasing.
These comparison predicates are required to be transitive.
These procedures compare texts in an implementation-defined way.
One approach is to make them the lexicographic extensions to texts
of the corresponding orderings on characters. In that case,
text<?
would be the lexicographic ordering on
texts induced by the ordering char<?
on characters,
and if two texts differ in length but are the same up to the length
of the shorter text, the shorter text would be considered to be
lexicographically less than the longer string.
However, implementations are also allowed to use more sophisticated
locale-specific orderings.
In all cases, a pair of texts must satisfy exactly one of
textual<?
, textual=?
, and
textual>?
,
must satisfy textual<=?
if and only if
they do not satisfy textual>?
, and
must satisfy textual>=?
if and only if
they do not satisfy textual<?
.
Note: Implementations are encouraged to use the same orderings for texts as are used by the corresponding comparisons on strings, but are allowed to use different orderings.
Rationale: The only portable way to ensure these comparison predicates use the same orderings used by the corresponding comparisons on strings is to convert all texts to strings, which would be unacceptably inefficient.
textual-ci=?
textual1 textual2 textual3 ... → boolean
#t
if,
after calling textual-foldcase
on each of the arguments,
all of the case-folded texts would have the same length
and contain the same characters in the same positions;
otherwise returns #f
.
textual-ci<?
textual1 textual2 textual3 ... → boolean
textual-ci>?
textual1 textual2 textual3 ... → boolean
textual-ci<=?
textual1 textual2 textual3 ... → boolean
textual-ci>=?
textual1 textual2 textual3 ... → boolean
textual-foldcase
on their arguments
before applying the corresponding procedures without "-ci
".
textual-prefix-length
textual1 textual2 [start1 end1 start2 end2] → integer
textual-suffix-length
textual1 textual2 [start1 end1 start2 end2] → integer
The optional start/end indexes restrict the comparison to the indicated subtexts of textual1 and textual2.
textual-prefix?
textual1 textual2 [start1 end1 start2 end2] → boolean
textual-suffix?
textual1 textual2 [start1 end1 start2 end2] → boolean
The optional start/end indexes restrict the comparison to the indicated subtexts of textual1 and textual2.
textual-index
textual pred [start end] → idx-or-false
textual-index-right
textual pred [start end] → idx-or-false
textual-skip
textual pred [start end] → idx-or-false
textual-skip-right
textual pred [start end] → idx-or-false
textual-index
searches through the given subtext or substring
from the left, returning the index of the leftmost character
satisfying the predicate pred.
textual-index-right
searches from the
right, returning the index of the rightmost character
satisfying the predicate pred.
If no match is found, these procedures return #f
.
Rationale:
The SRFI 130 analogues of these procedures return cursors,
even when no match is found, and
SRFI 130's string-index-right
returns the successor
of the cursor for the first character that satisfies the predicate.
As there are no cursors in this SRFI, it seems best to follow the
more intuitive and long-standing precedent set by SRFI 13.
The start and end arguments specify the
beginning and end of the search; the valid indexes relevant to
the search include start but exclude end.
Beware of "fencepost" errors: when searching right-to-left,
the first index considered is
(- end 1)
,
whereas when searching left-to-right, the first index considered is
start.
That is, the start/end indexes describe the same half-open interval
[start,end) in these procedures that they do
in all other procedures specified by this SRFI.
The skip functions are similar, but use the complement of the criterion: they search for the first char that doesn't satisfy pred. To skip over initial whitespace, for example, say
(subtextual text (or (textual-skip text char-whitespace?) (textual-length text)) (textual-length text))
These functions can be trivially composed with textual-take
and
textual-drop
to produce take-while, drop-while, span, and break
procedures without loss of efficiency.
textual-contains
textual1 textual2 [start1 end1 start2 end2] → idx-or-false
textual-contains-right
textual1 textual2 [start1 end1 start2 end2] → idx-or-false
Returns #f
if there is no match.
If start2 = end2,
textual-contains
returns start1 but
textual-contains-right
returns end1.
Otherwise returns the index in textual1
for the first character of the first/last match;
that index lies within the half-open interval
[start1,end1),
and the match lies entirely within the
[start1,end1) range of textual1.
(textual-contains "eek -- what a geek." "ee" 12 18) ; Searches "a geek" => 15
Note: The names of these procedures do not end with a question mark. This indicates a useful value is returned when there is a match.
textual-upcase
textual → text
textual-downcase
textual → text
textual-foldcase
textual → text
textual-titlecase
textual → text
textual-append
textual ... → text
textual-concatenate
textual-list → text
textual-list
together
into a single text.
If any elements of textual-list are strings, then those strings do not share any storage with the result, so subsequent mutation of those string will not affect the text returned by this procedure. Implementations are encouraged to return a result that shares storage with some of the texts in the list if that sharing would be space-efficient.
Rationale:
Some implementations of Scheme
limit the number of arguments that may be passed to an n-ary procedure,
so the (apply textual-append textual-list)
idiom,
which is otherwise equivalent to using this procedure, is not as
portable.
textual-concatenate-reverse
textual-list [final-textual end] → text
(textual-concatenate (reverse textual-list))
If the optional argument final-textual is specified,
it is effectively consed
onto the beginning of textual-list
before performing the list-reverse
and
textual-concatenate
operations.
If the optional argument end is given, only the characters up to but not including end in final-textual are added to the result, thus producing
(textual-concatenate (reverse (cons (subtext final-textual 0 end) textual-list)))For example:
(textual-concatenate-reverse '(" must be" "Hello, I") " going.XXXX" 7) => «Hello, I must be going.»
Rationale: This procedure is useful when constructing procedures that accumulate character data into lists of textual buffers, and wish to convert the accumulated data into a single text when done. The optional end argument accommodates that use case when final-textual is a mutable string, and is allowed (for uniformity) when final-textual is an immutable text.
textual-join
textual-list [delimiter grammar] → text
textual-list is a list of texts and/or strings.
delimiter is a text or a string.
The grammar argument is a symbol that determines
how the delimiter is
used, and defaults to 'infix
.
It is an error for grammar to be any symbol other
than these four:
'infix
means an infix or separator grammar:
insert the delimiter
between list elements. An empty list will produce an empty text.
'strict-infix
means the same as 'infix
if the textual-list is non-empty,
but will signal an error if given an empty list.
(This avoids an ambiguity shown in the examples below.)
'suffix
means a suffix or terminator grammar:
insert the delimiter
after every list element.
'prefix
means a prefix grammar: insert the delimiter
before every list element.
The delimiter is the text used to delimit elements; it defaults to a single space " ".
(textual-join '("foo" "bar" "baz")) => «foo bar baz» (textual-join '("foo" "bar" "baz") "") => «foobarbaz» (textual-join '("foo" "bar" "baz") «:») => «foo:bar:baz» (textual-join '("foo" "bar" "baz") ":" 'suffix) => «foo:bar:baz:» ;; Infix grammar is ambiguous wrt empty list vs. empty text: (textual-join '() ":") => «» (textual-join '("") ":") => «» ;; Suffix and prefix grammars are not: (textual-join '() ":" 'suffix)) => «» (textual-join '("") ":" 'suffix)) => «:»
textual-fold
kons knil textual [start end] → value
textual-fold-right
kons knil textual [start end] → value
The textual-fold
procedure maps the kons procedure
across the given text or string from left to right:
(... (kons textual[2] (kons textual[1] (kons textual[0] knil))))
In other words, textual-fold
obeys the (tail) recursion
(textual-fold kons knil textual start end) = (textual-fold kons (kons textual[start] knil) textual start+1 end)
The textual-fold-right
procedure maps kons across the
given text or string from right to left:
(kons textual[0] (... (kons textual[end-3] (kons textual[end-2] (kons textual[end-1] knil)))))
obeying the (tail) recursion
(textual-fold-right kons knil textual start end) = (textual-fold-right kons (kons textual[end-1] knil) textual start end-1)
Examples:
;;; Convert a text or string to a list of chars. (textual-fold-right cons '() textual) ;;; Count the number of lower-case characters in a text or string. (textual-fold (lambda (c count) (if (char-lower-case? c) (+ count 1) count)) 0 textual)
The textual-fold-right
combinator is sometimes called a "catamorphism."
textual-map
proc textual1 textual2 ... → text
textual-map
,
does not accept characters as arguments,
or returns a value that is not a character, string, or text.
The textual-map
procedure applies proc element-wise
to the characters of the textual arguments, converts each value
returned by proc to a text, and returns the concatenation of
those texts.
If more than one textual argument is given and not all have
the same length, then textual-map
terminates when the shortest
textual argument runs out.
The dynamic order in which proc is called on the characters
of the textual arguments is unspecified, as is the dynamic
order in which the coercions are performed. If any strings returned
by proc are mutated after they have been returned and before
the call to textual-map
has returned, then
textual-map
returns a text with unspecified contents; the
textual-map
procedure itself does not mutate those strings.
Example:
(textual-map (lambda (c0 c1 c2) (case c0 ((#\1) c1) ((#\2) (string c2)) ((#\-) (text #\- c1)))) (string->text "1222-1111-2222") (string->text "Hi There!") (string->text "Dear John")) => «Hear-here!»
textual-for-each
proc textual1 textual2 ... → unspecified
textual-map
or does not accept characters as arguments.
The textual-for-each
procedure applies proc element-wise
to the characters of the textual arguments, going from left
to right.
If more than one textual argument is given and not all have
the same length, then textual-for-each
terminates when the
shortest textual argument runs out.
textual-map-index
proc textual [start end] → text
textual-map-index
has returned, then
textual-map-index
returns a text with unspecified contents; the
textual-map-index
procedure itself does not mutate those strings.
textual-for-each-index
proc textual [start end] → unspecified
Example:
(let ((txt (string->text "abcde")) (v '())) (textual-for-each-index (lambda (cur) (set! v (cons (char->integer (text-ref txt cur)) v))) txt) v) => (101 100 99 98 97)
textual-count
textual pred [start end] → integer
textual-filter
pred textual [start end] → text
textual-remove
pred textual [start end] → text
If textual is a string, then that string does not share any storage with the result, so subsequent mutation of that string will not affect the text returned by these procedures. If textual is a text, implementations are encouraged to return a result that shares storage with that text whenever sharing would be space-efficient.
textual-replicate
textual from to [start end] → text
textual is a text or string;
start and end are optional arguments that specify
a subtext of textual,
defaulting to 0 and the length of textual.
This subtext is conceptually replicated both up and down the index space,
in both the positive and negative directions.
For example, if textual is "abcdefg"
,
start is 3,
and end is6,
then we have the conceptual bidirectionally-infinite text
... d e f d e f d e f d e f d e f d e f d ... -9 -8 -7 -6 -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 +6 +7 +8 +9
textual-replicate
returns the subtext of this text
beginning at index from,
and ending at to.
It is an error if from is greater than to.
You can use textual-replicate
to perform a variety of tasks:
(textual-replicate "abcdef" 2 8)
=> «cdefab»
(textual-replicate "abcdef" -2 4)
=> «efabcd»
(textual-replicate "abc" 0 7)
=> «abcabca»
Note that
It is an error if start=end, unless from=to, which is allowed as a special case.
textual-split
textual delimiter [grammar limit start end] → list
"\r\n"
.
The returned list will have one more item than the number of
non-overlapping occurrences of the delimiter
in the text.
If delimiter is an empty text, then the returned list
contains a list of texts, each of which contains a single character.
The grammar is a symbol with the same meaning as
in the textual-join
procedure.
If it is infix
, which is the default,
processing is done as described above, except
an empty textual produces the empty list;
if grammar is strict-infix
,
then an empty textual signals an error.
The values prefix
and suffix
cause a leading/trailing empty text in the result to be suppressed.
If limit is a non-negative exact integer, at most that
many splits occur, and the remainder of textual
is returned as the final element of the list
(so the result will have at most limit+1 elements).
If limit is not specified or is #f
, then
as many splits as possible are made.
It is an error if limit is any other value.
To split on a regular expression re,
use SRFI 115's regexp-split
procedure:
(map string->text (regexp-split re (textual->string txt)))
Rationale:
Although it would be more efficient to have a version of
regexp-split
that operates on texts directly,
the scope of this SRFI is limited to specifying operations
on texts analogous to those specified for strings by R7RS
and SRFI 130.
This SRFI comes with sample implementations organized as a representation-independent library that imports one of three kernel libraries:
kernel16
uses an internal representation based on UTF-16,
which performs well when strings can represent any Unicode
text and non-ASCII characters are common.
kernel8
uses an internal representation based on UTF-8,
which performs well when strings can represent any Unicode
text but most texts consist of ASCII characters.
kernel0
uses an internal representation based on Scheme
strings, which performs well if strings are acceptably
space-efficient and the string-ref
procedure
runs in constant time. It also performs well in interpreted
systems even when string-ref
takes linear time,
because the built-in string-ref
is likely to run
faster on short strings than any UTF-8 or UTF-16 scanner that
could be written in Scheme.
All three kernels implement a text-ref
procedure
that runs in O(1) time.
All three kernels use shared substructures to improve both
space efficiency and running time.
The sample implementations come with a black-box test program derived from a black-box test program written for SRFI 130.
There is also a program that compares the performance of strings and texts on a number of micro-benchmarks. These benchmarks are hardly typical, but they provide a rational basis for discussing performance tradeoffs between immutable texts, mutable strings with SRFI 130 cursors and operations, and the standard R7RS operations on strings.
For three decades, I have been hoping the Scheme standards would either make strings immutable or add a new data type of immutable texts; with Unicode, that hope became more urgent. During that time, I have discussed this with far more people than I can now remember. Most of those I do remember are among those acknowledged below by John Cowan or Olin Shivers, but I am pleased to add Lars T Hansen, Chris Hanson, Felix Klock, and Jonathan Rees to the list of those whose ideas (and counter-arguments!) have contributed to this SRFI.
John Cowan, the author of SRFI 130, deserves special thanks for blessing my desire to use SRFI 130 as the starting point for this SRFI, for designing the spans API whose implementations tested the key ideas of this SRFI's sample implementations, for chairing Working Group 2, and for a lot more I won't mention here.
To acknowledge all those who contributed to SRFI 130 and to its predecessor SRFI 13, written by Olin Shivers, I hereby reproduce John Cowan's acknowledgements from SRFI 130:
Thanks to the members of the SRFI 130 mailing list who made this SRFI what it now is, including Per Bothner, Arthur Gleckler, Shiro Kawai, Jim Rees, and especially Alex Shinn, whose idea it was to make cursors and indexes disjoint, and who provided the foof implementation. The following acknowledgements by Olin Shivers are taken from SRFI 13:
The design of this library benefited greatly from the feedback provided during the SRFI discussion phase. Among those contributing thoughtful commentary and suggestions, both on the mailing list and by private discussion, were Paolo Amoroso, Lars Arvestad, Alan Bawden, Jim Bender, Dan Bornstein, Per Bothner, Will Clinger, Brian Denheyer, Mikael Djurfeldt, Kent Dybvig, Sergei Egorov, Marc Feeley, Matthias Felleisen, Will Fitzgerald, Matthew Flatt, Arthur A. Gleckler, Ben Goetter, Sven Hartrumpf, Erik Hilsdale, Richard Kelsey, Oleg Kiselyov, Bengt Kleberg, Donovan Kolbly, Bruce Korb, Shriram Krishnamurthi, Bruce Lewis, Tom Lord, Brad Lucier, Dave Mason, David Rush, Klaus Schilling, Jonathan Sobel, Mike Sperber, Mikael Staldal, Vladimir Tsyshevsky, Donald Welsh, and Mike Wilson. I am grateful to them for their assistance.
I am also grateful to the authors, implementors and documentors of all the systems mentioned in the introduction. Aubrey Jaffer and Kent Pitman should be noted for their work in producing Web-accessible versions of the R5RS and Common Lisp spec, which was a tremendous aid.
This is not to imply that these individuals necessarily endorse the final results, of course.
During this document's long development period, great patience was exhibited by Mike Sperber, who is the editor for the SRFI, and by Hillary Sullivan, who is not.
As Olin said, we should not assume any of those individuals endorse this SRFI.
The Common Lisp "HyperSpec," produced by Kent Pitman, is essentially the ANSI spec for Common Lisp: http://www.lispworks.com/documentation/HyperSpec/Front/index.htm.
Copyright (C) William D Clinger (2016). All Rights Reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.