This SRFI is currently in final status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-140@nospamsrfi.schemers.org
. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.
string-pad
, included read-line
and read-string
among istring-returning procedures from R7RS, and fixed typo in Libraries section.)string-unfold-right
.)This attempts to solve the same issues with R7RS strings raised by SRFI-135, but with better integration with the Scheme language.
We propose to retain the name string as the type of sequences of Unicode characters (scalar values). There are two standard subtypes of string:
string-set!
on an istring throws an error.
On the other hand, the core operations string-ref
and
string-length
are guaranteed to be O(1).
in-placeusing
string-set!
and other operations.
However, string-ref
, string-set!
,
or string-length
have no performance guarantees.
On many implementation they may take time proportional to the
length of the string.
java.lang.CharSequence
to be a string.
The main part of the proposal specifies the default bindings of various procedure names, as might be pre-defined in a REPL. Specifically, some procedures that traditionally return mutable strings are changed to return istrings. We later discuss compatibility and other library issues.
This combines SRFI-13, SRFI-135, and SRFI-118.
This attempts to solve the same issues with R7RS strings raised by SRFI-135: non-guaranteed O(1) indexing, limited sharing, in general the limited usefulness of mutable strings, especially fixed-size ones. Our solution is in concept the same, but the goal is better integration with the Scheme language. Specifically, SRFI-135 proposes new names for types and procedures in a way that is not integrated with Scheme tradition and existing code. It also adds a slew of new names for people to learn. While the relationship between old procedures and SRFI-135 procedures is natural and straightforward, it still adds a conceptual hurdle and wart to the Scheme language, complicating teaching and migration. This specification uses traditional RnRS and SRFI-13 names, but changes procedures to return istrings, rather than mutable strings. This may break some existing programs, but it seems worth it:
string-copy
to convert an istring
to a mutable string.string-set!
usually are old
and don't support Unicode well, so could do with a review.
portedto using efficient istrings. In contrast, porting code to use SRFI-135 texts would be a major pain.
Note: This specification provides very similar functionality
as SRFI-135.
However, what SRFI-135 calls textual
, we call plain string
.
What SRFI-135 calls text
, we call either istring
(in type specifiers) or plain string
(in most procedure names).
Specifically,
SRFI-135 procedures that start with textual-
or text-
, in this specification start with
plain string-
.
Any argument or result type marked as textual
is changed to
string
, while the type text
is changed to istring
.
For example:
string-map
proc string1 string2 ... → istring
One exception: make-string
returns a
mutable string, not an istring.
In the following procedure specifications:
Some procedures are specified as executing in guaranteed O(1) time. Unless specified, the other procedures may execute in linear time, using time proportional to the length of the involved string(s).
(string?
obj)
→ booleanistring?
returns true.
Must execute in O(1) time.
(istring?
obj)
→ boolean
string-ref
and string-length?
Must execute in O(1) time.
(string-null?
string)
→ boolean
(= (string-length string) 0)
but
must execute in O(1) time.
(string-every
pred string [start end])
→ value
(string-any
pred string [start end])
→ value
Checks to see if every/any character in string
satisfies pred,
proceeding from left (index start)
to right (index end).
These procedures are short-circuiting:
if pred returns false, string-every
does not call pred on subsequent characters;
if pred returns true, string-any
does not call pred on subsequent characters.
Both procedures are "witness-generating":
string-every
is given an empty interval
(with start = end),
it returns #t
.string-every
returns true for a non-empty
interval (with start < end),
the returned true value is the one returned by the final call to the
predicate on
(string-ref string (- end 1))
.string-any
returns true,
the returned true value is the one returned by the predicate.
Note:
The names of these procedures do not end with a question mark.
This indicates a general value is returned instead of a simple boolean
(#t
or #f
).
(string->vector
string [start end])
→ char-vector
(string->list
string [start end])
→ char-list
string->vector
,
and string->list
return a newly allocated (unless empty) vector, or list
of the characters that make up the given substring.
(vector->string
char-vector [start end])
→ istring
(list->string
char-list [start end])
→ istring
(reverse-list->string
char-list)
→ istring
(compose list->string reverse)
:
(reverse-list->string '(#\a #\B #\c)) → "cBa"This is a common idiom in the epilogue of string-processing loops that accumulate their result using a list in reverse order. (See also
string-concatenate-reverse
for the "chunked" variant.)
(string->utf8
string [start end])
→ bytevector
(string->utf16
string [start end])
→ bytevector
(string->utf16be
string [start end])
→ bytevector
(string->utf16le
string [start end])
→ bytevector
The bytevectors returned by string->utf8
,
string->utf16be
, and string->utf16le
do not contain a byte-order mark (BOM).
string->utf16be
returns a big-endian encoding,
while string->utf16le
returns a little-endian
encoding.
The bytevectors returned by string->utf16
begin with a BOM that declares an implementation-dependent
endianness.
The latter should match the big-endian
or little-endian
identifier returned by the R7RS features
procedure.
The bytevector elements following that BOM
encode the given substring using that endianness.
Rationale: These procedures are consistent with the Unicode standard. Unicode suggests UTF-16 should default to big-endian, but Microsoft prefers little-endian.
(utf8->string
bytevector [start end])
→ istring
(utf16->string
bytevector [start end])
→ istring
(utf16be->string
bytevector [start end])
→ istring
(utf16le->string
bytevector [start end])
→ istring
The bytevector subrange given to utf16->string
may begin with a byte order mark (BOM); if so, that BOM
determines whether the rest of the subrange is to be
interpreted as big-endian or little-endian; in either case,
the BOM will not become a character in the returned string.
If the subrange does not begin with a BOM, it is decoded using
the same implementation-dependent endianness used by
string->utf16
.
The utf16be->string
and utf16le->string
procedures interpret their inputs as big-endian or little-endian,
respectively. If a BOM is present, it is treated as a normal
character and will become part of the result.
It is an error if the bytevector subrange given to
utf8->string
contains invalid UTF-8 byte sequences.
For the other three procedures, it is an error if
(- end start)
is odd,
or if the bytevector subrange contains invalid UTF-16 byte sequences.
(string
char ...)
→ istring
(string-tabulate
proc len)
→ istring
Rationale:
Although string-unfold
is more general,
string-tabulate
is likely to run faster
for the common special case it implements.
(string-unfold
stop? mapper successor seed [base make-final])
→ istring
(string-unfold-right
stop? mapper successor seed [base make-final])
→ istring
""
.
It is an error if base is anything other than a character or string.(lambda (x) (string))
.
It is an error for make-final to return anything other
than a character or string.string-unfold-right
is the same as string-unfold
except the results of mapper are assembled into the string
in right-to-left order,
base is the optional rightmost portion of the constructed string,
and make-final produces the leftmost portion of the
constructed string. If mapper returns a string, the string
is prepended to the constructed string (without reversal). For example:
(string-unfold-right null? (lambda (x) (string #\[ (car x) #\])) cdr '(#\a #\b #\c)) = "[c][b][a]"
string-unfold
is a fairly powerful string constructor.
You can use it to
convert a list to a string, read a port into a string, reverse a string,
copy a string, and so forth. Examples:
(port->string p) = (string-unfold eof-object? values (lambda (x) (read-char p)) (read-char p)) (list->string lis) = (string-unfold null? car cdr lis) (string-tabulate f size) = (string-unfold (lambda (i) (= i size)) f add1 0)
To map f over a list lis, producing a string:
(string-unfold null? (compose f car) cdr lis)
Interested functional programmers may enjoy noting that
string-fold-right
and string-unfold
are in some sense inverses.
That is, given operations
knull?, kar, kdr, kons,
and knil satisfying
(kons (kar x) (kdr x)) = x and (knull? knil) = #t
then
(string-fold-right kons knil (string-unfold knull? kar kdr x)) = xand
(string-unfold knull? kar kdr (string-fold-right kons knil string)) = string.
This combinator pattern is sometimes called an "anamorphism."
Note: Implementations should not allow the size of strings created
by string-unfold
to be limited by limits on stack size.
For functionality similar to make-string
,
but returning an immutable string,
you can use string-repeat
.
(string-length
string)
→ len
(string-ref
string idx)
→ char
(substring
string start end)
→ istring
If string is a mutable string, then that string does not
share any
storage with the result, so subsequent mutation of that string
will not affect the result returned by substring
.
When the first argument is an istring,
implementations are encouraged to return a result that shares storage with
that istring,
to whatever extent sharing is possible while maintaining some
small fixed bound on the ratio of storage used by the shared
representation divided by the storage that would be used by
an unshared representation.
In particular, these procedures should just return their first
argument when that argument is an istring, start is 0, and
end is the length of that string.
For the functionality of substring
with guaranteed no sharing
use xsubstring
for an immutable result,
or string-copy
for a mutable result.
(string-take
string nchars)
→ istring
(string-drop
string nchars)
→ istring
(string-take-right
string nchars)
→ istring
(string-drop-right
string nchars)
→ istring
string-take
returns an immutable string containing the first
nchars of string;
string-drop
returns a string containing all but the
first nchars of string.
string-take-right
returns a string containing the
last nchars of string;
string-drop-right
returns a string containing all
but the last nchars of string.
Subsequent mutation of the argument string
will not affect the istring returned by these procedures.
If string is an istring, implementations are
encouraged to return a result that shares storage with that string
(which is easily accomplished by using substring
to
create the result).
(string-take "Pete Szilagyi" 6) => "Pete S" (string-drop "Pete Szilagyi" 6) => "zilagyi" (string-take-right "Beta rules" 5) => "rules" (string-drop-right "Beta rules" 5) => "Beta "
It is an error to take or drop more characters than are in the string:
(string-take "foo" 37) => error
(string-pad
string len [char start end])
→ istring
(string-pad-right
string len [char start end])
→ istring
#\space
) as needed.
If string has more
than len chars, it is truncated on the left (right)
to length len.
(string-pad "325" 5) => " 325" (string-pad "71325" 5) => "71325" (string-pad "8871325" 5) => "71325"
(string-trim
string [pred start end])
→ istring
(string-trim-right
string [pred start end])
→ istring
(string-trim-both
string [pred start end])
→ istring
char-whitespace?
.
(string-trim-both " The outlook wasn't brilliant, \n\r") => "The outlook wasn't brilliant,"
Note: It would be more natural to rename string-trim
to string-trim-left
,
and possibly rename string-trim-both
to string-trim
,
but SRFI-13 uses the specified names.
(string-replace
string1 string2 start1 end1 [start2 end2])
→ istring
(string-append (substring string1 0 start1) (substring string2 start2 end2) (substring string1 end1 (string-length string1)))
That is, the segment of characters in string1 from start1 to end1 is replaced by the segment of characters in string2 from start2 to end2. If start1=end1, this simply splices the characters drawn from string2 into string1 at that position.
Examples:
(string-replace "The TCL programmer endured daily ridicule." "another miserable perl drone" 4 7 8 22) => "The miserable perl programmer endured daily ridicule." (string-replace "It's easy to code it up in Scheme." "lots of fun" 5 9) => "It's lots of fun to code it up in Scheme." (define (string-insert s i t) (string-replace s t i i)) (string-insert "It's easy to code it up in Scheme." 5 "really ") => "It's really easy to code it up in Scheme." (define (string-set s i c) (string-replace s (string c) i (+ i 1))) (string-set "String-ref runs in O(n) time." 19 #\1) => "String-ref runs in O(1) time."
(string=?
string1 string2 string3 ...)
→ boolean
(string<?
string1 string2 string3 ...)
→ boolean
(string>?
string1 string2 string3 ...)
→ boolean
(string<=?
string1 string2 string3 ...)
→ boolean
(string>=?
string1 string2 string3 ...)
→ boolean
(string-ci=?
string1 string2 string3 ...)
→ boolean
(string-ci<?
string1 string2 string3 ...)
→ boolean
(string-ci>?
string1 string2 string3 ...)
→ boolean
(string-ci<=?
string1 string2 string3 ..)
. → boolean
(string-ci>=?
string1 string2 string3 ...)
→ boolean
(string-prefix-length
string1 string2 [start1 end1 start2 end2])
→ integer
(string-suffix-length
string1 string2 [start1 end1 start2 end2])
→ integer
The optional start/end indexes restrict the comparison to the indicated substrings of string1 and string2.
(string-prefix?
string1 string2 [start1 end1 start2 end2])
→ boolean
(string-suffix?
string1 string2 [start1 end1 start2 end2])
→ boolean
The optional start/end indexes restrict the comparison to the indicated substrings of string1 and string2.
(string-index
string pred [start end])
→ idx-or-false
(string-index-right
string pred [start end])
→ idx-or-false
(string-skip
string pred [start end])
→ idx-or-false
(string-skip-right
string pred [start end])
→ idx-or-false
string-index
searches through the given substring
from the left, returning the index of the leftmost character
satisfying the predicate pred.
string-index-right
searches from the
right, returning the index of the rightmost character
satisfying the predicate pred.
If no match is found, these procedures return #f
.
Rationale:
The SRFI 130 analogues of these procedures return cursors,
even when no match is found, and
SRFI 130's string-index-right
returns the successor
of the cursor for the first character that satisfies the predicate.
As there are no cursors in this SRFI, it seems best to follow the
more intuitive and long-standing precedent set by SRFI 13.
The start and end arguments specify the
beginning and end of the search; the valid indexes relevant to
the search include start but exclude end.
Beware of "fencepost" errors: when searching right-to-left,
the first index considered is
(- end 1)
,
whereas when searching left-to-right, the first index considered is
start.
That is, the start/end indexes describe the same half-open interval
[start,end) in these procedures that they do
in all other procedures specified by this SRFI.
The skip functions are similar, but use the complement of the criterion: they search for the first char that doesn't satisfy pred. To skip over initial whitespace, for example, say
(substring string (or (string-skip string char-whitespace?) (string-length string)) (string-length string))
These functions can be trivially composed with string-take
and
string-drop
to produce take-while, drop-while, span, and break
procedures without loss of efficiency.
Note: SRFI-13 and Guile generalize pred to char_pred, which can be a predicate, a character, or a character set. This is an optional extension.
(string-contains
string1 string2 [start1 end1 start2 end2])
→ idx-or-false
(string-contains-right
string1 string2 [start1 end1 start2 end2])
→ idx-or-false
Returns #f
if there is no match.
If start2 = end2,
string-contains
returns start1 but
string-contains-right
returns end1.
Otherwise returns the index in string1
for the first character of the first/last match;
that index lies within the half-open interval
[start1,end1),
and the match lies entirely within the
[start1,end1) range of string1.
(string-contains "eek -- what a geek." "ee" 12 18) ; Searches "a geek" => 15
Note: The names of these procedures do not end with a question mark. This indicates a useful value is returned when there is a match.
(string-upcase
string)
→ istring
(string-downcase
string)
→ istring
(string-foldcase
string)
→ istring
(string-titlecase
string)
→ istring
string=?
,
and the argument is immutable, then that argument may be returned.
Note that language-sensitive mappings and foldings are not used.
The results are the same as the R7RS procedures, but as immutable strings.
(string-append
string ...)
→ istring
(string-concatenate
string-list)
→ istring
string-list
together
into a single string.
If any elements of string-list are mutable strings, then those strings do not share any storage with the result, so subsequent mutation of those string will not affect the string returned by this procedure. Implementations are encouraged to return a result that shares storage with some of the strings in the list if that sharing would be space-efficient.
Rationale:
Some implementations of Scheme
limit the number of arguments that may be passed to an n-ary procedure,
so the (apply string-append string-list)
idiom,
which is otherwise equivalent to using this procedure, is not as
portable.
(string-concatenate-reverse
string-list [final-string [end]])
→ istring
(string-concatenate (reverse string-list))
If the optional argument final-string is specified,
it is effectively consed
onto the beginning of string-list
before performing the list-reverse
and
string-concatenate
operations.
If the optional argument end is given, only the characters up to but not including end in final-string are added to the result, thus producing
(string-concatenate (reverse (cons (substring final-string 0 end) string-list)))For example:
(string-concatenate-reverse '(" must be" "Hello, I") " going.XXXX" 7) => "Hello, I must be going."
Rationale: This procedure is useful when constructing procedures that accumulate character data into lists of string buffers, and wish to convert the accumulated data into a single string when done. The optional end argument accommodates that use case when final-string is a mutable string, and is allowed (for uniformity) when final-string is an immutable string.
(string-join
string-list [delimiter [grammar]])
→ istring
The string-list is a list of strings.
The delimiter is the string used to delimit elements; it defaults to
a single space " ".
The grammar argument is a symbol that determines
how the delimiter is
used, and defaults to 'infix
.
It is an error for grammar to be any symbol other
than these four:
'infix
means an infix or separator grammar:
insert the delimiter
between list elements. An empty list will produce an empty string.
'strict-infix
means the same as 'infix
if the string-list is non-empty,
but will signal an error if given an empty list.
(This avoids an ambiguity shown in the examples below.)
'suffix
means a suffix or terminator grammar:
insert the delimiter
after every list element.
'prefix
means a prefix grammar: insert the delimiter
before every list element.
(string-join '("foo" "bar" "baz")) => "foo bar baz" (string-join '("foo" "bar" "baz") "") => "foobarbaz" (string-join '("foo" "bar" "baz") ":") => "foo:bar:baz" (string-join '("foo" "bar" "baz") ":" 'suffix) => "foo:bar:baz:" ;; Infix grammar is ambiguous wrt empty list vs. empty string: (string-join '() ":") => "" (string-join '("") ":") => "" ;; Suffix and prefix grammars are not: (string-join '() ":" 'suffix)) => "" (string-join '("") ":" 'suffix)) => ":"
(string-fold
kons knil string [start end])
→ value
(string-fold-right
kons knil string [start end])
→ value
The string-fold
procedure maps the kons procedure
across the given string from left to right:
(... (kons string[2] (kons string[1] (kons string[0] knil))))
In other words, string-fold
obeys the (tail) recursion
(string-fold kons knil string start end) = (string-fold kons (kons string[start] knil) start+1 end)
The string-fold-right
procedure maps kons across the
given string or string from right to left:
(kons string[0] (... (kons string[end-3] (kons string[end-2] (kons string[end-1] knil)))))
obeying the (tail) recursion
(string-fold-right kons knil string start end) = (string-fold-right kons (kons string[end-1] knil) start end-1)
Examples:
;;; Convert a string or string to a list of chars. (string-fold-right cons '() string) ;;; Count the number of lower-case characters in a string or string. (string-fold (lambda (c count) (if (char-lower-case? c) (+ count 1) count)) 0 string)
The string-fold-right
combinator is sometimes called a "catamorphism."
(string-map
proc string1 string2 ...)
→ istring
(string-for-each
proc string1 string2 ...)
→ unspecified
(string-map-index
proc string [start end])
→ istring
string-map-index
has returned, then
string-map-index
returns a string with unspecified contents; the
string-map-index
procedure itself does not mutate those strings.
(string-for-each-index
proc string [start end])
→ unspecified
Example:
(let ((txt (string->string "abcde")) (v '())) (string-for-each-index (lambda (cur) (set! v (cons (char->integer (string-ref txt cur)) v))) txt) v) => (101 100 99 98 97)
(string-count
string pred [start end])
→ integer
(string-filter
pred string [start end])
→ istring
(string-remove
pred string [start end])
→ istring
If string is a mutable string, then that string does not share any storage with the result, so subsequent mutation of that string will not affect the string returned by these procedures. If string is an immutable string, implementations are encouraged to return a result that shares storage with that string whenever sharing would be space-efficient.
(string-repeat
string-or-character len)
→ istring
string
constructor.
We can define string-repeat
in terms of the
more general xsubstring
procedure:
(define (string-repeat S N) (let ((T (if (char? S) (string S) S))) (xsubstring T 0 (* N (string-length T))))
(xsubstring
string [from to [start end]])
→ istring
extended substringprocedure that implements replicated copying of a substring.
string is a string;
start and end are optional arguments that specify
a substring of string,
defaulting to 0 and the length of string.
This substring is conceptually replicated both up and down the index space,
in both the positive and negative directions.
For example, if string is "abcdefg"
,
start is 3,
and end is 6,
then we have the conceptual bidirectionally-infinite string
... d e f d e f d e f d e f d e f d e f d ... -9 -8 -7 -6 -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 +6 +7 +8 +9
xsubstring
returns the substring of this string
beginning at index from,
and ending at to.
It is an error if from is greater than to.
If from and to are missing they default to 0
and from+(end-start)
,
respectively.
This variant is a generalization of using substring
,
but unlike substring
never shares substructures that would
retain characters or sequences of characters that are substructures of
its first argument or previously allocated objects.
(Hence it is equivalent to SRFI-135's string-copy
.)
You can use xsubstring
to perform a variety of tasks:
(xsubstring "abcdef" 2 8)
=> "cdefab"
(xsubstring "abcdef" -2 4)
=> "efabcd"
(xsubstring "abc" 0 7)
=> "abcabca"
Note that
It is an error if start=end, unless from=to, which is allowed as a special case.
Rationale: SRFI-135 names the corresponding procedure textual-replicate
.
The name string-replicate
might be better in the abstract,
but following SRFI-13's xsubstring
name seems preferable.
(string-split
string delimiter [grammar limit start end])
→ list
"\r\n"
.
The returned list will have one more item than the number of
non-overlapping occurrences of the delimiter in the string.
If delimiter is an empty string, then the returned list
contains a list of strings, each of which contains a single character.
The grammar is a symbol with the same meaning as
in the string-join
procedure.
If it is infix
, which is the default,
processing is done as described above, except
an empty string produces the empty list;
if grammar is strict-infix
,
then an empty string signals an error.
The values prefix
and suffix
cause a leading/trailing empty string in the result to be suppressed.
If limit is a non-negative exact integer, at most that
many splits occur, and the remainder of string
is returned as the final element of the list
(so the result will have at most limit+1 elements).
If limit is not specified or is #f
, then
as many splits as possible are made.
It is an error if limit is any other value.
To split on a regular expression,
you can use SRFI 115's regexp-split
procedure.
(make-string
[k [char]])
→ mstring
To return an immutable string that repeats k times
a character char use string-repeat
.
This is as R7RS, except the result is variable-size and we allow leaving out k when it is zero.
Rationale: The most common pattern for mutable strings (at least
in a language like Java) to allocate an empty string and incrementally append to it.
It seems natural to initialize the string with (make-string)
,
rather than (make-string 0)
.
(string-copy
string [start [end]])
→ mstring
(string-set!
mstring k char)
→ unspecified
(string-fill!
mstring fill [start [end]])
→ unspecified
(string-copy!
to at from [start [end]])
→ unspecified
(string-append!
mstring value ...)
→ unspecified
(string-replace!
mstring dst dst-start dst-end src [src-start [src-end]] → unspecified
read-line
, read-string
, symbol->string
,
number->string
, and get-output-string
all return istring.
R7RS specifies that it is an error to mutate the results of command-line
,
get-environment-variable
, or get-environment-variables
.
These should return istrings.
string-upcase
to return a mutable string.
However, we need a mechanism to use the "old" mutable-string-returning
string-upcase
rather than the istring-returning version.
The following libraries are defined:
(srfi 140 base)
(scheme base)
except that string
procedure are as in this specification.
(srfi 140 char)
(scheme char)
except that string
procedures are as in this specification.
(srfi 140 mstrings)
string-append
.
This library provides versions that return mutable strings.
It could be implemented like this:
(define-library (srfi 140 mstrings) (import (rename (srfi 140 istrings) (string-upcase istring-upcase))) (export string-upcase etc ...) (begin (define (string-upcase str) (string-copy (istring-upcase str))) etc ...))
(srfi 140 istrings)
(string 140 mstrings)
,
but the functions return an istring.
(srfi 135)
(srfi 135 texts)
(string 140 istrings)
and other libraries,
and then using export
with rename
.
An implementation should base its default environment on the bindings
of (srfi 140 istrings)
. However, if there is a command-line switch
or other way to select "R7RS standards mode", then that switch should cause it to use
the (srfi 140 mstrings)
bindings.
Since this specification changes core parts of Scheme, a portable library implementation is not possible.
Kawa version 3.0 will include a complete implementation of this specification. At time of writing, Kawa 3.0 isn't released or quite finished yet, but the Kawa git repository includes the functionality of this specification.
SRFI-135 provides 3 alternative implementations for texts
,
which can be used to implement istrings:
kernel16
uses an internal representation based on UTF-16,
which performs well when strings can represent any Unicode
text and non-ASCII characters are common.
The data structure used is a vector of bytevectors,
where each bytevector (except the last) stores 128 code points.
(Kawa uses a simpler data structure, described below.)
kernel8
uses an internal representation based on UTF-8,
which performs well when most (but not all) strings consist of
ASCII characters.
kernel0
uses an internal representation based on Scheme
strings, which performs well if strings are acceptably
space-efficient and the string-ref
procedure
runs in constant time. It also performs well in interpreted
systems even when string-ref
takes linear time,
because the built-in string-ref
is likely to run
faster on short strings than any UTF-8 or UTF-16 scanner that
could be written in Scheme.
The Kawa implementation of istring uses one data array and one offset array.
The data array is a native Java String
, which internally
is an array of 16-bit chars.
(JDK 9 uses an array of 8-bit bytes when all the characters are Latin-1.)
The offset array stores, for every 16th character,
the start index within the data array of that character.
Thus to find the position of character 37, we get element 2 in the index array,
which gives us the position of character 32,
and then we scan linearly 5 characters forwards.
If there are no surrogate characters in the range 32 to 48 (the normal case)
then the difference between the offsets is 16, so no scan is needed
(just add 5 to the offset for character 32).
This data structure is simple, fast, and relatively compact. The overhead of the index array is 4 bytes for every 16 characters (32 bytes or more), thus at most 12.5% overhead for long strings. In the common case of only BMP characters, no index array is needed.
The implementation of istrings uses the IString
class.
The following is simplified - see gnu/lists/IString.java
for the complete class.
public class IString implements CharSequence { String str; // Contains actual char (code unit) values int cplength; // number of codepoints /* Index of every 16-th character. * Null if all characters are in the basic plane. */ int[] offsets; /** used for substring etc */ public int offsetByCodePoints(int i) { if (offsets == null) return i; int pos = offsets[i>>4]; // Optimize if all characters between (i & 15) // and (i & 15) + 16 are all in the basic plane: if (pos + 16 == offsets[(i>>4) + 1]) return pos + (i & 15); // scan linearly from pos, at most 15 characters forward: return str.offsetByCodePoints(str, pos, i & 15); } /** used for string-ref */ public int indexByCodePoints(int index) { return str.codePointAt(offsetByCodePoints(index)); } /* used for string-length */ public int lengthByCodePoints() { return cplength; } /** To implement CharSequence */ public char charAt(int i) { return str.charAt(i); } public String toString() { return str; } public int length() { return str.length(); } }
The actual implementation supports shared substrings.
(Detail: Instead of using a String
for the data array,
it would be slightly more efficient to use a native Java char array
(one object fewer and one indirection less). However,
using a String
has the advantage that the toString
method is trivial.)
Many implementations (including Kawa) support mutable strings
or kinds of strings that do not have constant-time indexing.
Many of the procedures in this specification perform a linear
scan of a substring. It is important that these procedures
not be implemented using string-ref
(unless string-ref
is O(1)
even for non-istrings), since that could lead to quadratic run-time.
In Kawa, any object that implements the (standard Java) interface
java.lang.CharSequence
is a string
.
This includes the standard Java classes java.lang.String
(immutable strings) and java.lang.StringBuilder
.
These are implemented using arrays of char
values,
which are 16-bit code units.
Indexing in terms of char
units (using the charAt
method) is efficient (constant-time), as is determining the number of char
s.
(Since CharSequence
is an interface, rather than a concrete
implementation, we cannot be guaranteed of this, but it is true of all
standard or Kawa classes that implement CharSequence
.)
To implement a procedure like string-count
we first map start and end to offsets
in the CharSequence
.
(This is constant-time if the string is an istring, or
if start/end have default values;
otherwise linear-time.)
Then we call charAt
with increasing indexes.
If a surrogate pair is seen, then we combine them, before
calling the pred.
This is quite efficient regardless of whether or not the
string is an istring.
The Kawa implementation uses helper macros
string-for-each-forwards
(string-for-each
with start/end indexes),
and string-for-each-backwards
.
These make many of the specified procedures trivial.
This specification requires that mutable strings have adjustable size. That means there needs to be some indirection between string object and the buffer containing the actual characters, so the latter can be re-allocated when needed. Using a gap-buffer is a reasonable approach, but not required.
Kawa uses the gnu.lists.FString
class for mutable strings.
strings-test.scm
.
(It is also available in the
Kawa git repository.)
Thank you to Olin Shivers, author of SRFI-13 String Libraries; and William D Clinger, author of SRFI-135 Immutable Texts.
Copyright (C) Per Bothner 2017
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.