Title

Immutable Strings

Author

Per Bothner

Status

This SRFI is currently in final status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-140@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Received: 2016-07-11
60-day deadline: 2016-09-09
Draft #1 published: 2016-07-11
Draft #2 published: 2016-07-31
Draft #3 published: 2016-08-15
Draft #4 published: 2017-02-01
Draft #5 published: 2017-04-16
Draft #6 published: 2017-04-25
Draft #7 published: 2017-05-09
Draft #8 published: 2017-05-19
Finalized: 2017-05-24
Revised to fix errata:
- 2018-11-28 (Fixed four typos ("(stri 140").)
- 2019-04-06 (Fixed return type of string-pad, included read-line and read-string among istring-returning procedures from R7RS, and fixed typo in Libraries section.)
- 2020-03-04 (Add note about accumulation of strings in string-unfold-right.)

Abstract

This attempts to solve the same issues with R7RS strings raised by SRFI-135, but with better integration with the Scheme language.

We propose to retain the name string as the type of sequences of Unicode characters (scalar values). There are two standard subtypes of string:

Immutable strings, also called istrings, cannot be modified after they have been created. Calling string-set! on an istring throws an error. On the other hand, the core operations string-ref and string-length are guaranteed to be O(1).
Mutable strings can be modified in-place using string-set! and other operations. However, string-ref, string-set!, or string-length have no performance guarantees. On many implementation they may take time proportional to the length of the string.

An implementation may support other kinds of strings. For example on the Java platform it may be reasonable to consider any instance of java.lang.CharSequence to be a string.

The main part of the proposal specifies the default bindings of various procedure names, as might be pre-defined in a REPL. Specifically, some procedures that traditionally return mutable strings are changed to return istrings. We later discuss compatibility and other library issues.

This combines SRFI-13, SRFI-135, and SRFI-118.

Rationale

This attempts to solve the same issues with R7RS strings raised by SRFI-135: non-guaranteed O(1) indexing, limited sharing, in general the limited usefulness of mutable strings, especially fixed-size ones. Our solution is in concept the same, but the goal is better integration with the Scheme language. Specifically, SRFI-135 proposes new names for types and procedures in a way that is not integrated with Scheme tradition and existing code. It also adds a slew of new names for people to learn. While the relationship between old procedures and SRFI-135 procedures is natural and straightforward, it still adds a conceptual hurdle and wart to the Scheme language, complicating teaching and migration. This specification uses traditional RnRS and SRFI-13 names, but changes procedures to return istrings, rather than mutable strings. This may break some existing programs, but it seems worth it:

Programs that break seem to be rare. They are also easy to fix: You can always use string-copy to convert an istring to a mutable string.
Any breakage is obvious, sigalling an error, rather than subtle or producing an incorrect result.
Programs that depend on string-set! usually are old and don't support Unicode well, so could do with a review.
Preserving traditional procedure names means that existing code, documentation, and teaching materials are valid and automatically ported to using efficient istrings. In contrast, porting code to use SRFI-135 texts would be a major pain.

Specification

Note: This specification provides very similar functionality as SRFI-135. However, what SRFI-135 calls textual, we call plain string. What SRFI-135 calls text, we call either istring (in type specifiers) or plain string (in most procedure names). Specifically, SRFI-135 procedures that start with textual- or text-, in this specification start with plain string-. Any argument or result type marked as textual is changed to string, while the type text is changed to istring.

For example:

string-map proc string₁ string₂ ... → istring

One exception: make-string returns a mutable string, not an istring.

Notation

In the following procedure specifications:

An istring argument or result is an immutable string.
An mstring argument or result is a mutable string.
A string argument or result is any string (immutable or mutable).

Some procedures are specified as executing in guaranteed O(1) time. Unless specified, the other procedures may execute in linear time, using time proportional to the length of the involved string(s).

String literals

String literals have type istring. They have the same syntax as in R7RS (small). In an implementation that supports SRFI-109 string quasi-literals also evaluate to istrings.

Predicates

(string? obj) → boolean: Is obj a string? Must return true if istring? returns true. Must execute in O(1) time.

(istring? obj) → boolean: Is obj an immutable string, with guaranteed O(1) performance for string-ref and string-length? Must execute in O(1) time.

(string-null? string) → boolean: Is string the empty string? Same result as (= (string-length string) 0) but must execute in O(1) time.

(string-every pred string [start end]) → value

(string-any pred string [start end]) → value

Checks to see if every/any character in string satisfies pred, proceeding from left (index start) to right (index end). These procedures are short-circuiting: if pred returns false, string-every does not call pred on subsequent characters; if pred returns true, string-any does not call pred on subsequent characters. Both procedures are "witness-generating":

If string-every is given an empty interval (with start = end), it returns #t.
If string-every returns true for a non-empty interval (with start < end), the returned true value is the one returned by the final call to the predicate on (string-ref string (- end 1)).
If string-any returns true, the returned true value is the one returned by the predicate.

Note: The names of these procedures do not end with a question mark. This indicates a general value is returned instead of a simple boolean (#t or #f).

Conversions

(string->vector string [start end]) → char-vector

(string->list string [start end]) → char-list

string->vector, and string->list return a newly allocated (unless empty) vector, or list of the characters that make up the given substring.

(vector->string char-vector [start end]) → istring

(list->string char-list [start end]) → istring

These procedures return an istring containing the characters of the given subvector or sublist. The behavior of the result will not be affected by subsequent mutation of the vector or list.

(reverse-list->string char-list) → istring

An efficient implementation of (compose list->string reverse):

(reverse-list->string '(#\a #\B #\c)) → "cBa"

This is a common idiom in the epilogue of string-processing loops that accumulate their result using a list in reverse order. (See also string-concatenate-reverse for the "chunked" variant.)

(string->utf8 string [start end]) → bytevector

(string->utf16 string [start end]) → bytevector

(string->utf16be string [start end]) → bytevector

(string->utf16le string [start end]) → bytevector

These procedures return a newly allocated (unless empty) bytevector containing a UTF-8 or UTF-16 encoding of the given substring.

The bytevectors returned by string->utf8, string->utf16be, and string->utf16le do not contain a byte-order mark (BOM). string->utf16be returns a big-endian encoding, while string->utf16le returns a little-endian encoding.

The bytevectors returned by string->utf16 begin with a BOM that declares an implementation-dependent endianness. The latter should match the big-endian or little-endian identifier returned by the R7RS features procedure. The bytevector elements following that BOM encode the given substring using that endianness.

Rationale: These procedures are consistent with the Unicode standard. Unicode suggests UTF-16 should default to big-endian, but Microsoft prefers little-endian.

(utf8->string bytevector [start end]) → istring

(utf16->string bytevector [start end]) → istring

(utf16be->string bytevector [start end]) → istring

(utf16le->string bytevector [start end]) → istring

These procedures interpret their bytevector argument as a UTF-8 or UTF-16 encoding of a sequence of characters, and return a string containing that sequence.

The bytevector subrange given to utf16->string may begin with a byte order mark (BOM); if so, that BOM determines whether the rest of the subrange is to be interpreted as big-endian or little-endian; in either case, the BOM will not become a character in the returned string. If the subrange does not begin with a BOM, it is decoded using the same implementation-dependent endianness used by string->utf16.

The utf16be->string and utf16le->string procedures interpret their inputs as big-endian or little-endian, respectively. If a BOM is present, it is treated as a normal character and will become part of the result.

It is an error if the bytevector subrange given to utf8->string contains invalid UTF-8 byte sequences. For the other three procedures, it is an error if (- end start) is odd, or if the bytevector subrange contains invalid UTF-16 byte sequences.

Immutable string constructors

(string char ...) → istring: Returns a string consisting of the given characters.

(string-tabulate proc len) → istring

Proc is a procedure that accepts an exact integer as its argument and returns a character. Constructs a string of size len by calling proc on each value from 0 (inclusive) to len (exclusive) to produce the corresponding element of the string. The order in which proc is called on those indexes is not specified.

Rationale: Although string-unfold is more general, string-tabulate is likely to run faster for the common special case it implements.

(string-unfold stop? mapper successor seed [base make-final]) → istring

(string-unfold-right stop? mapper successor seed [base make-final]) → istring

This is a fundamental constructor for strings.

successor is used to generate a series of "seed" values from the initial seed:
seed, (successor seed), (successor² seed), (successor³ seed), ...
stop? tells us when to stop — when it returns true when applied to one of these seed values.
mapper maps each seed value to the corresponding character(s) in the result string, which are assembled into that string in left-to-right order. It is an error for mapper to return anything other than a character or string.
base is the optional initial/leftmost portion of the constructed string, which defaults to the empty string "". It is an error if base is anything other than a character or string.
make-final is applied to the terminal seed value (on which stop? returns true) to produce the final/rightmost portion of the constructed string. It defaults to (lambda (x) (string)). It is an error for make-final to return anything other than a character or string.

string-unfold-right is the same as string-unfold except the results of mapper are assembled into the string in right-to-left order, base is the optional rightmost portion of the constructed string, and make-final produces the leftmost portion of the constructed string. If mapper returns a string, the string is prepended to the constructed string (without reversal). For example:

(string-unfold-right null?
                     (lambda (x) (string  #\[ (car x) #\]))
                     cdr
                     '(#\a #\b #\c))
   = "[c][b][a]"

string-unfold is a fairly powerful string constructor. You can use it to convert a list to a string, read a port into a string, reverse a string, copy a string, and so forth. Examples:

(port->string p) = (string-unfold eof-object?
                           values
                           (lambda (x) (read-char p))
                           (read-char p))

(list->string lis) = (string-unfold null? car cdr lis)

(string-tabulate f size) = (string-unfold (lambda (i) (= i size)) f add1 0)

To map f over a list lis, producing a string:

(string-unfold null? (compose f car) cdr lis)

Interested functional programmers may enjoy noting that string-fold-right and string-unfold are in some sense inverses. That is, given operations knull?, kar, kdr, kons, and knil satisfying

(kons (kar x) (kdr x)) = x  and  (knull? knil) = #t

then

(string-fold-right kons knil (string-unfold knull? kar kdr x)) = x

and

(string-unfold knull? kar kdr (string-fold-right kons knil string)) = string.

This combinator pattern is sometimes called an "anamorphism."

Note: Implementations should not allow the size of strings created by string-unfold to be limited by limits on stack size.

For functionality similar to make-string, but returning an immutable string, you can use string-repeat.

Selection

(string-length string) → len

(string-ref string idx) → char

As in R7RS. However, if the string is an istring, both must execute in constant time; otherwise they may execute in time proportional to the length of string.

(substring string start end) → istring

This procedure returns a istring containing the characters of string starting with index start (inclusive) and ending with index end (exclusive).

If string is a mutable string, then that string does not share any storage with the result, so subsequent mutation of that string will not affect the result returned by substring. When the first argument is an istring, implementations are encouraged to return a result that shares storage with that istring, to whatever extent sharing is possible while maintaining some small fixed bound on the ratio of storage used by the shared representation divided by the storage that would be used by an unshared representation. In particular, these procedures should just return their first argument when that argument is an istring, start is 0, and end is the length of that string.

For the functionality of substring with guaranteed no sharing use xsubstring for an immutable result, or string-copy for a mutable result.

(string-take string nchars) → istring

(string-drop string nchars) → istring

(string-take-right string nchars) → istring

(string-drop-right string nchars) → istring

string-take returns an immutable string containing the first nchars of string; string-drop returns a string containing all but the first nchars of string. string-take-right returns a string containing the last nchars of string; string-drop-right returns a string containing all but the last nchars of string.

Subsequent mutation of the argument string will not affect the istring returned by these procedures. If string is an istring, implementations are encouraged to return a result that shares storage with that string (which is easily accomplished by using substring to create the result).

(string-take "Pete Szilagyi" 6) => "Pete S"
(string-drop "Pete Szilagyi" 6) => "zilagyi"

(string-take-right "Beta rules" 5) => "rules"
(string-drop-right "Beta rules" 5) => "Beta "

It is an error to take or drop more characters than are in the string:

(string-take "foo" 37) => error

(string-pad string len [char start end]) → istring

(string-pad-right string len [char start end]) → istring

Returns an istring of length len comprised of the characters drawn from the given subrange of string. The result is padded on the left (right) by as many occurrences of the character char (which defaults to #\space) as needed. If string has more than len chars, it is truncated on the left (right) to length len.

(string-pad     "325" 5) => "  325"
(string-pad   "71325" 5) => "71325"
(string-pad "8871325" 5) => "71325"

(string-trim string [pred start end]) → istring

(string-trim-right string [pred start end]) → istring

(string-trim-both string [pred start end]) → istring

Returns a string obtained from the given subrange of string by skipping over all characters on the left / on the right / on both sides that satisfy the second argument pred: pred defaults to char-whitespace?.

(string-trim-both "  The outlook wasn't brilliant,  \n\r")
    => "The outlook wasn't brilliant,"

Note: It would be more natural to rename string-trim to string-trim-left, and possibly rename string-trim-both to string-trim, but SRFI-13 uses the specified names.

Replacement

(string-replace string₁ string₂ start₁ end₁ [start₂ end₂]) → istring

Returns

(string-append (substring string₁ 0 start₁)
               (substring string₂ start₂ end₂)
               (substring string₁ end₁ (string-length string₁)))

That is, the segment of characters in string₁ from start₁ to end₁ is replaced by the segment of characters in string₂ from start₂ to end₂. If start₁=end₁, this simply splices the characters drawn from string₂ into string₁ at that position.

Examples:

(string-replace "The TCL programmer endured daily ridicule."
                 "another miserable perl drone" 4 7 8 22)
    => "The miserable perl programmer endured daily ridicule."

(string-replace "It's easy to code it up in Scheme." "lots of fun" 5 9)
    => "It's lots of fun to code it up in Scheme."

(define (string-insert s i t) (string-replace s t i i))

(string-insert "It's easy to code it up in Scheme." 5 "really ")
    => "It's really easy to code it up in Scheme."

(define (string-set s i c) (string-replace s (string c) i (+ i 1)))

(string-set "String-ref runs in O(n) time." 19 #\1)
    => "String-ref runs in O(1) time."

Comparison

(string=? string₁ string₂ string₃ ...) → boolean
(string<? string₁ string₂ string₃ ...) → boolean
(string>? string₁ string₂ string₃ ...) → boolean
(string<=? string₁ string₂ string₃ ...) → boolean
(string>=? string₁ string₂ string₃ ...) → boolean
(string-ci=? string₁ string₂ string₃ ...) → boolean
(string-ci<? string₁ string₂ string₃ ...) → boolean
(string-ci>? string₁ string₂ string₃ ...) → boolean
(string-ci<=? string₁ string₂ string₃ ..). → boolean
(string-ci>=? string₁ string₂ string₃ ...) → boolean: As in R7RS.

Prefixes & suffixes

(string-prefix-length string₁ string₂ [start₁ end₁ start₂ end₂]) → integer
(string-suffix-length string₁ string₂ [start₁ end₁ start₂ end₂]) → integer: Return the length of the longest common prefix/suffix of string₁ and string₂. For prefixes, this is equivalent to their "mismatch index" (relative to the start indexes).
The optional start/end indexes restrict the comparison to the indicated substrings of string₁ and string₂.
(string-prefix? string₁ string₂ [start₁ end₁ start₂ end₂]) → boolean
(string-suffix? string₁ string₂ [start₁ end₁ start₂ end₂]) → boolean: Is string₁ a prefix/suffix of string₂?
The optional start/end indexes restrict the comparison to the indicated substrings of string₁ and string₂.

Searching

(string-index string pred [start end]) → idx-or-false

(string-index-right string pred [start end]) → idx-or-false

(string-skip string pred [start end]) → idx-or-false

(string-skip-right string pred [start end]) → idx-or-false

string-index searches through the given substring from the left, returning the index of the leftmost character satisfying the predicate pred. string-index-right searches from the right, returning the index of the rightmost character satisfying the predicate pred. If no match is found, these procedures return #f.

Rationale: The SRFI 130 analogues of these procedures return cursors, even when no match is found, and SRFI 130's string-index-right returns the successor of the cursor for the first character that satisfies the predicate. As there are no cursors in this SRFI, it seems best to follow the more intuitive and long-standing precedent set by SRFI 13.

The start and end arguments specify the beginning and end of the search; the valid indexes relevant to the search include start but exclude end. Beware of "fencepost" errors: when searching right-to-left, the first index considered is (- end 1), whereas when searching left-to-right, the first index considered is start. That is, the start/end indexes describe the same half-open interval [start,end) in these procedures that they do in all other procedures specified by this SRFI.

The skip functions are similar, but use the complement of the criterion: they search for the first char that doesn't satisfy pred. To skip over initial whitespace, for example, say

(substring string
            (or (string-skip string char-whitespace?)
                (string-length string))
            (string-length string))

These functions can be trivially composed with string-take and string-drop to produce take-while, drop-while, span, and break procedures without loss of efficiency.

Note: SRFI-13 and Guile generalize pred to char_pred, which can be a predicate, a character, or a character set. This is an optional extension.

(string-contains string₁ string₂ [start₁ end₁ start₂ end₂]) → idx-or-false

(string-contains-right string₁ string₂ [start₁ end₁ start₂ end₂]) → idx-or-false

Does the substring of string₁ specified by start₁ and end₁ contain the sequence of characters given by the substring of string₂ specified by start₂ and end₂?

Returns #f if there is no match. If start₂ = end₂, string-contains returns start₁ but string-contains-right returns end₁. Otherwise returns the index in string₁ for the first character of the first/last match; that index lies within the half-open interval [start₁,end₁), and the match lies entirely within the [start₁,end₁) range of string₁.

(string-contains "eek -- what a geek." "ee" 12 18) ; Searches "a geek"
    => 15

Note: The names of these procedures do not end with a question mark. This indicates a useful value is returned when there is a match.

Case conversion

(string-upcase string) → istring
(string-downcase string) → istring
(string-foldcase string) → istring
(string-titlecase string) → istring: These procedures return the string obtained by applying Unicode's full uppercasing, lowercasing, case-folding, or title-casing algorithms to their argument. In some cases, the length of the result may be different from the length of the argument. If the result is equal to the argument in the sense of string=?, and the argument is immutable, then that argument may be returned. Note that language-sensitive mappings and foldings are not used.
The results are the same as the R7RS procedures, but as immutable strings.

Concatenation

(string-append string ...) → istring

Returns a string whose sequence of characters is the concatenation of the sequences of characters in the given arguments.

(string-concatenate string-list) → istring

Concatenates the elements of string-list together into a single string.

If any elements of string-list are mutable strings, then those strings do not share any storage with the result, so subsequent mutation of those string will not affect the string returned by this procedure. Implementations are encouraged to return a result that shares storage with some of the strings in the list if that sharing would be space-efficient.

Rationale: Some implementations of Scheme limit the number of arguments that may be passed to an n-ary procedure, so the (apply string-append string-list) idiom, which is otherwise equivalent to using this procedure, is not as portable.

(string-concatenate-reverse string-list [final-string [end]]) → istring

With no optional arguments, calling this procedure is equivalent to

(string-concatenate (reverse string-list))

If the optional argument final-string is specified, it is effectively consed onto the beginning of string-list before performing the list-reverse and string-concatenate operations.

If the optional argument end is given, only the characters up to but not including end in final-string are added to the result, thus producing

(string-concatenate 
  (reverse (cons (substring final-string 0 end)
                 string-list)))

For example:

(string-concatenate-reverse '(" must be" "Hello, I") " going.XXXX" 7)
  => "Hello, I must be going."

Rationale: This procedure is useful when constructing procedures that accumulate character data into lists of string buffers, and wish to convert the accumulated data into a single string when done. The optional end argument accommodates that use case when final-string is a mutable string, and is allowed (for uniformity) when final-string is an immutable string.

(string-join string-list [delimiter [grammar]]) → istring

This procedure is a simple unparser; it pastes strings together using the delimiter string.

The string-list is a list of strings. The delimiter is the string used to delimit elements; it defaults to a single space " ". The grammar argument is a symbol that determines how the delimiter is used, and defaults to 'infix. It is an error for grammar to be any symbol other than these four:

'infix means an infix or separator grammar: insert the delimiter between list elements. An empty list will produce an empty string.
'strict-infix means the same as 'infix if the string-list is non-empty, but will signal an error if given an empty list. (This avoids an ambiguity shown in the examples below.)
'suffix means a suffix or terminator grammar: insert the delimiter after every list element.
'prefix means a prefix grammar: insert the delimiter before every list element.

(string-join '("foo" "bar" "baz"))
         => "foo bar baz"
(string-join '("foo" "bar" "baz") "")
         => "foobarbaz"
(string-join '("foo" "bar" "baz") ":")
         => "foo:bar:baz"
(string-join '("foo" "bar" "baz") ":" 'suffix)
         => "foo:bar:baz:"

;; Infix grammar is ambiguous wrt empty list vs. empty string:
(string-join '()   ":") => ""
(string-join '("") ":") => ""

;; Suffix and prefix grammars are not:
(string-join '()   ":" 'suffix)) => ""
(string-join '("") ":" 'suffix)) => ":"

Fold & map & friends

(string-fold kons knil string [start end]) → value

(string-fold-right kons knil string [start end]) → value

These are the fundamental iterators for strings.

The string-fold procedure maps the kons procedure across the given string from left to right:

(... (kons string[₂] (kons string[1] (kons string[0] knil))))

In other words, string-fold obeys the (tail) recursion

  (string-fold kons knil string start end)
= (string-fold kons (kons string[start] knil) start+1 end)

The string-fold-right procedure maps kons across the given string or string from right to left:

(kons string[0]
      (... (kons string[end-3]
                 (kons string[end-₂]
                       (kons string[end-1]
                             knil)))))

obeying the (tail) recursion

  (string-fold-right kons knil string start end)
= (string-fold-right kons (kons string[end-1] knil) start end-1)

Examples:

;;; Convert a string or string to a list of chars.
(string-fold-right cons '() string)

;;; Count the number of lower-case characters in a string or string.
(string-fold (lambda (c count)
                (if (char-lower-case? c)
                    (+ count 1)
                    count))
              0
              string)

The string-fold-right combinator is sometimes called a "catamorphism."

(string-map proc string₁ string₂ ...) → istring: As in R7RS, except the result is an immutable string. As an extension, the result from proc may be a string (not just a character).
(string-for-each proc string₁ string₂ ...) → unspecified: As in R7RS.

(string-map-index proc string [start end]) → istring

Calls proc on each valid index of the specified substring, converts the results of those calls into strings, and returns the concatenation of those strings. It is an error for proc to return anything other than a character or string. The dynamic order in which proc is called on the indexes is unspecified, as is the dynamic order in which the coercions are performed. If any strings returned by proc are mutated after they have been returned and before the call to string-map-index has returned, then string-map-index returns a string with unspecified contents; the string-map-index procedure itself does not mutate those strings.

(string-for-each-index proc string [start end]) → unspecified

Calls proc on each valid index of the specified substring, in increasing order, discarding the results of those calls. This is simply a safe and correct way to loop over a substring.

Example:

(let ((txt (string->string "abcde"))
      (v '()))
  (string-for-each-index
    (lambda (cur) (set! v (cons (char->integer (string-ref txt cur)) v)))
    txt)
  v) => (101 100 99 98 97)

(string-count string pred [start end]) → integer: Returns a count of the number of characters in the specified substring of string that satisfy the given predicate.

(string-filter pred string [start end]) → istring
(string-remove pred string [start end]) → istring: Filter the given substring of string, retaining only those characters that satisfy / do not satisfy pred.
If string is a mutable string, then that string does not share any storage with the result, so subsequent mutation of that string will not affect the string returned by these procedures. If string is an immutable string, implementations are encouraged to return a result that shares storage with that string whenever sharing would be space-efficient.

Replication & splitting

(string-repeat string-or-character len) → istring

Create a string by repeating the first argument len times. If the first argument is a character, it is as if it were wrapped with the string constructor. We can define string-repeat in terms of the more general xsubstring procedure:

(define (string-repeat S N)
   (let ((T (if (char? S) (string S) S)))
     (xsubstring T 0 (* N (string-length T))))

(xsubstring string [from to [start end]]) → istring

This is an extended substring procedure that implements replicated copying of a substring.

string is a string; start and end are optional arguments that specify a substring of string, defaulting to 0 and the length of string. This substring is conceptually replicated both up and down the index space, in both the positive and negative directions. For example, if string is "abcdefg", start is 3, and end is 6, then we have the conceptual bidirectionally-infinite string

    ...  d  e  f  d  e  f  d  e  f  d  e  f  d  e  f  d  e  f  d ...
        -9 -8 -7 -6 -5 -4 -3 -2 -1  0 +1 +2 +3 +4 +5 +6 +7 +8 +9

xsubstring returns the substring of this string beginning at index from, and ending at to. It is an error if from is greater than to.

If from and to are missing they default to 0 and from+(end-start), respectively. This variant is a generalization of using substring, but unlike substring never shares substructures that would retain characters or sequences of characters that are substructures of its first argument or previously allocated objects. (Hence it is equivalent to SRFI-135's string-copy.)

You can use xsubstring to perform a variety of tasks:

To rotate a string left: (xsubstring "abcdef" 2 8) => "cdefab"
To rotate a string right: (xsubstring "abcdef" -2 4) => "efabcd"
To replicate a string: (xsubstring "abc" 0 7) => "abcabca"

Note that

The from/to arguments give a half-open range containing the characters from index from up to, but not including, index to.
The from/to indexes are not expressed in the index space of string. They refer instead to the replicated index space of the substring defined by string, start, and end.

It is an error if start=end, unless from=to, which is allowed as a special case.

Rationale: SRFI-135 names the corresponding procedure textual-replicate. The name string-replicate might be better in the abstract, but following SRFI-13's xsubstring name seems preferable.

(string-split string delimiter [grammar limit start end]) → list

Returns a list of strings representing the words contained in the substring of string from start (inclusive) to end (exclusive). The delimiter is a string to be used as the word separator. This will often be a single character, but multiple characters are allowed for use cases such as splitting on "\r\n". The returned list will have one more item than the number of non-overlapping occurrences of the delimiter in the string. If delimiter is an empty string, then the returned list contains a list of strings, each of which contains a single character.

The grammar is a symbol with the same meaning as in the string-join procedure. If it is infix, which is the default, processing is done as described above, except an empty string produces the empty list; if grammar is strict-infix, then an empty string signals an error. The values prefix and suffix cause a leading/trailing empty string in the result to be suppressed.

If limit is a non-negative exact integer, at most that many splits occur, and the remainder of string is returned as the final element of the list (so the result will have at most limit+1 elements). If limit is not specified or is #f, then as many splits as possible are made. It is an error if limit is any other value.

To split on a regular expression, you can use SRFI 115's regexp-split procedure.

Mutable string constructors

(make-string [k [char]]) → mstring

Return a new allocated mutable string of length k, where k defaults to 0. If char is given, then all the characters of the string are initialized to char, otherwise the contents of the string are unspecified. The 1-argument version is deprecated as poor style, except when k is 0.

To return an immutable string that repeats k times a character char use string-repeat.

This is as R7RS, except the result is variable-size and we allow leaving out k when it is zero.

Rationale: The most common pattern for mutable strings (at least in a language like Java) to allocate an empty string and incrementally append to it. It seems natural to initialize the string with (make-string), rather than (make-string 0).

(string-copy string [start [end]]) → mstring: As in R7RS.

Procedures for mutating a string

(string-set! mstring k char) → unspecified
(string-fill! mstring fill [start [end]]) → unspecified
(string-copy! to at from [start [end]]) → unspecified: As in R7RS.

(string-append! mstring value ...) → unspecified
(string-replace! mstring dst dst-start dst-end src [src-start [src-end]] → unspecified: As in SRFI-118.

Other string-returning procedures

The R7RS procedures read-line, read-string, symbol->string, number->string, and get-output-string all return istring.

R7RS specifies that it is an error to mutate the results of command-line, get-environment-variable, or get-environment-variables. These should return istrings.

Libraries

There is no reason - except backward compatibility - for a procedure such as string-upcase to return a mutable string. However, we need a mechanism to use the "old" mutable-string-returning string-upcase rather than the istring-returning version.

The following libraries are defined:

(srfi 140 base)

Same as R7RS's (scheme base) except that string procedure are as in this specification.

(srfi 140 char)

Same as R7RS's (scheme char) except that string procedures are as in this specification.

(srfi 140 mstrings)

Provides definition for all the procedures that this specification specifies to return an istring, but which in R7RS or SRFI-13 return mutable strings, such as string-append. This library provides versions that return mutable strings.

It could be implemented like this:

(define-library (srfi 140 mstrings)
  (import (rename (srfi 140 istrings) (string-upcase istring-upcase)))
  (export string-upcase etc ...)
  (begin
    (define (string-upcase str)
      (string-copy (istring-upcase str)))
    etc ...))

(srfi 140 istrings)

This library exports the same names as (string 140 mstrings), but the functions return an istring.

(srfi 135)

(srfi 135 texts)

The SRFI-135 library, if provided, is naturally implemented by importing (string 140 istrings) and other libraries, and then using export with rename.

Default environment

An implementation should base its default environment on the bindings of (srfi 140 istrings). However, if there is a command-line switch or other way to select "R7RS standards mode", then that switch should cause it to use the (srfi 140 mstrings) bindings.

Implementation

Since this specification changes core parts of Scheme, a portable library implementation is not possible.

Kawa version 3.0 will include a complete implementation of this specification. At time of writing, Kawa 3.0 isn't released or quite finished yet, but the Kawa git repository includes the functionality of this specification.

Implementing istring - general notes

SRFI-135 provides 3 alternative implementations for texts, which can be used to implement istrings:

kernel16 uses an internal representation based on UTF-16, which performs well when strings can represent any Unicode text and non-ASCII characters are common. The data structure used is a vector of bytevectors, where each bytevector (except the last) stores 128 code points. (Kawa uses a simpler data structure, described below.)
kernel8 uses an internal representation based on UTF-8, which performs well when most (but not all) strings consist of ASCII characters.
kernel0 uses an internal representation based on Scheme strings, which performs well if strings are acceptably space-efficient and the string-ref procedure runs in constant time. It also performs well in interpreted systems even when string-ref takes linear time, because the built-in string-ref is likely to run faster on short strings than any UTF-8 or UTF-16 scanner that could be written in Scheme.

Kawa's implementation of istring

The Kawa implementation of istring uses one data array and one offset array. The data array is a native Java String, which internally is an array of 16-bit chars. (JDK 9 uses an array of 8-bit bytes when all the characters are Latin-1.) The offset array stores, for every 16th character, the start index within the data array of that character. Thus to find the position of character 37, we get element 2 in the index array, which gives us the position of character 32, and then we scan linearly 5 characters forwards. If there are no surrogate characters in the range 32 to 48 (the normal case) then the difference between the offsets is 16, so no scan is needed (just add 5 to the offset for character 32).

This data structure is simple, fast, and relatively compact. The overhead of the index array is 4 bytes for every 16 characters (32 bytes or more), thus at most 12.5% overhead for long strings. In the common case of only BMP characters, no index array is needed.

The implementation of istrings uses the IString class. The following is simplified - see gnu/lists/IString.java for the complete class.

public class IString implements CharSequence {
    String str; // Contains actual char (code unit) values
    int cplength; // number of codepoints

    /* Index of every 16-th character.
     * Null if all characters are in the basic plane. */
    int[] offsets;

    /** used for substring etc */
    public int offsetByCodePoints(int i) {
        if (offsets == null)
            return i;
        int pos = offsets[i>>4];
        // Optimize if all characters between (i & 15)
        // and (i & 15) + 16 are all in the basic plane:
        if (pos + 16 == offsets[(i>>4) + 1])
            return pos + (i & 15);
        // scan linearly from pos, at most 15 characters forward:
        return str.offsetByCodePoints(str, pos, i & 15);
    }

    /** used for string-ref */
    public int indexByCodePoints(int index) {
        return str.codePointAt(offsetByCodePoints(index));
    }

    /* used for string-length */
    public int lengthByCodePoints() { return cplength; }

    /** To implement CharSequence */
    public char charAt(int i) { return str.charAt(i); }
    public String toString() { return str; }
    public int length() { return str.length(); }
}

The actual implementation supports shared substrings.

(Detail: Instead of using a String for the data array, it would be slightly more efficient to use a native Java char array (one object fewer and one indirection less). However, using a String has the advantage that the toString method is trivial.)

Linear-time procedures and non-istrings

Many implementations (including Kawa) support mutable strings or kinds of strings that do not have constant-time indexing. Many of the procedures in this specification perform a linear scan of a substring. It is important that these procedures not be implemented using string-ref (unless string-ref is O(1) even for non-istrings), since that could lead to quadratic run-time.

In Kawa, any object that implements the (standard Java) interface java.lang.CharSequence is a string. This includes the standard Java classes java.lang.String (immutable strings) and java.lang.StringBuilder. These are implemented using arrays of char values, which are 16-bit code units. Indexing in terms of char units (using the charAt method) is efficient (constant-time), as is determining the number of chars. (Since CharSequence is an interface, rather than a concrete implementation, we cannot be guaranteed of this, but it is true of all standard or Kawa classes that implement CharSequence.)

To implement a procedure like string-count we first map start and end to offsets in the CharSequence. (This is constant-time if the string is an istring, or if start/end have default values; otherwise linear-time.) Then we call charAt with increasing indexes. If a surrogate pair is seen, then we combine them, before calling the pred. This is quite efficient regardless of whether or not the string is an istring.

The Kawa implementation uses helper macros string-for-each-forwards (string-for-each with start/end indexes), and string-for-each-backwards. These make many of the specified procedures trivial.

Mutable strings

This specification requires that mutable strings have adjustable size. That means there needs to be some indirection between string object and the buffer containing the actual characters, so the latter can be re-allocated when needed. Using a gap-buffer is a reasonable approach, but not required.

Kawa uses the gnu.lists.FString class for mutable strings.

Testsuite

There is a testsuite, derived from William Clinger's SRFI-135 testsuite. See strings-test.scm. (It is also available in the Kawa git repository.)

Acknowledgements

Thank you to Olin Shivers, author of SRFI-13 String Libraries; and William D Clinger, author of SRFI-135 Immutable Texts.

Copyright

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Author: Per Bothner

Editor: Arthur A. Gleckler