Title

Sets, bags, integer sets

Author

John Cowan

This SRFI is currently in ``draft'' status. To see an explanation of each status that a SRFI can hold, see here. To provide input on this SRFI, please mail to <srfi minus 113 at srfi dot schemers dot org>. See instructions here to subscribe to the list. You can access previous messages via the archive of the mailing list.

Received: 2013/05/31
Draft: 2013/06/01-2013/08/01
Revised: 2013/06/09
Revised: 2013/07/03
Revised: 2013/12/04
Revised: 2014/03/31

Abstract

Sets and bags (also known as multisets) are unordered collections that can contain any Scheme object. Sets enforce the constraint that no two elements can be the same in the sense of the set's associated equality predicate; bags do not. Integer sets are a variant of sets that can only contain non-negative exact integers that are less than a maximum value specified when the integer set is created, and that always use numerical equality as their equality predicate.

Issues

16. What provisions should we have for controlling which element of a set is preserved when a new element equal to an existing element is added?

Rationale

Sets are a standard part of the libraries of many high-level programming languages, including Smalltalk, Java, and C++. Racket provides general sets similar to those of this proposal, though with fewer procedures, and there is a Chicken egg called sets (unfortunately undocumented), which provides a minimal set of procedures.

Bags are useful for counting anything from a fixed set of possibilities, e.g. the number of each type of error in a log file or the number of uses of each word in a lexicon drawn from a body of documents. Although other data structures can serve the same purpose, using bags clearly expresses the programmer's intent and can be optimized.

Although sets subsume integer sets, providing them separately allows for increased implementation efficiency. There are two Chicken eggs that provide integer sets: iset, which is conceptually similar to the integer sets described here, but using bignums rather than bytevectors; and cis, which uses lists of integer intervals.

Insofar as possible, the names in this SRFI are harmonized with the names used for ordered collections (lists, strings, vectors, and bytevectors) in Scheme. However, size is used instead of length to express the number of elements in the collection, because length implies order.

Although it's possible to use the general sets of this SRFI to contain characters, the use of SRFI 14 is recommended instead. The names in this SRFI are harmonized with SRFI 14, except that SRFI 14 does not contain analogues of the set>>?, set<=?, set>=?, set-remove, or set-partition procedures.

Sets do not have a lexical syntax representation. It's possible to use SRFI 108 quasi-literal constructors to create them in code, but this SRFI does not standardize how that is done.

The interface to general sets and bags depends on SRFI 114 comparators, despite that SRFI having a higher number than this one (for hysterical raisins). Comparators conveniently package the equality predicate of the set with the hash function or comparison procedure needed to implement the set efficiently.

During the discussion of this SRFI, a new expression type enumeration-case similar to case was proposed. It would dispatch based on a member of an enumeration set, with enumeration-specific error checking? Specifically, an error would be signaled if the cases are not exhaustive, or if any clause datum is not a member of the enumeration set. Unfortunately, due to phasing problems, enumeration-case would only work with enumerations defined using define-enum, and it would be awkward to implement it with syntax-rules only. Therefore, it was not included.

Specification

Sets, bags, and integer sets are mutually disjoint, and disjoint from other types of Scheme objects.

It is an error for any procedure defined in this SRFI to be invoked on sets or bags with different comparators.

Linear update

The procedures of this SRFI, by default, are "pure functional" — they do not alter their parameters. However, this SRFI defines a set of "linear-update" procedures, all of whose names end in !. They have hybrid pure-functional/side-effecting semantics: they are allowed, but not required, to side-effect one of their parameters in order to construct their result. An implementation may legally implement these procedures as pure, side-effect-free functions, or it may implement them using side effects, depending upon the details of what is the most efficient or simple to implement in terms of the underlying representation.

It is an error to rely upon these procedures working by side effect. For example, this is not guaranteed to work:

        (let* ((set1 (set 'a 'b 'c))      ; set1 = {a,b,c}.
               (set2 (set-adjoin! 'd)))   ; Add d to {a,b,c}.
          set1) ; Could be either {a,b,c} or {a,b,c,d}.

However, this is well-defined:

        (let ((set1 (set 'a 'b 'c)))
          (set-adjoin! set1 'd)) ; Add d to {a,b,c}.

So clients of these procedures write in a functional style, but must additionally be sure that, when the procedure is called, there are no other live pointers to the potentially-modified character set (hence the term "linear update").

There are two benefits to this convention:

Implementations are free to provide the most efficient possible implementation, either functional or side-effecting.
Programmers may nonetheless continue to assume that sets are purely functional data structures: they may be reliably shared without needing to be copied, uniquified, and so forth.

In practice, these procedures are most useful for efficiently constructing sets in a side-effecting manner, in some limited local context, before passing the character set outside the local construction scope to be used in a functional manner.

Scheme provides no assistance in checking the linearity of the potentially side-effected parameters passed to these functions — there's no linear type checker or run-time mechanism for detecting violations.

Note that if the implementation is entirely pure functional, it is allowed to return existing sets rather than newly allocated ones, even where this SRFI explicitly says otherwise.

Index

Constructors: set, set-adjoin, set-delete, set-unfold
Predicates: set?, set-contains?, set-empty?, set-disjoint?
Accessor: set-value
Linear update procedures: set-adjoin!, set-delete!, set-intern!
The whole set: set-size, set-find, set-count, hash-table-any?, hash-table-every?
Mapping and folding: set-map, set-for-each, set-fold, hash-table-filter, hash-table-filter!, hash-table-remove, hash-table-partition
Copying and conversion: set-copy, set-empty-copy, set->list, list->set, list->set!
Subsets: hash-table=?, hash-table<?, hash-table>?, hash-table<=?, hash-table>=?
Set operations: hash-table-union!, hash-table-intersection!, hash-table-difference!
Additional bag procedures: bag-element-count, bag-for-each-unique, bag-fold-unique, bag-increment!, bag-decrement!, bag->set, set->bag, set->bag!
Additional integer set procedures: integer-set, make-universal-integer-set, integer-set-complement, integer-set-complement!, integer-set-min, integer-set-max

Set procedures

Constructors

(set comparator element ... )

Returns a newly allocated empty set. The comparator argument is a SRFI 114 comparator. The elements are used to initialize the set.

(set-adjoin set element ... )

Returns a newly allocated set that uses the same comparator as set and contains all the values of set, and in addition each element unless it is already the same (in the sense of set's equality predicate) as an existing member. It is an error to add an element to set that does not return #t when passed to the type test procedure of the comparator with which set was defined.

(set-delete set element ... )

Returns a newly allocated set containing all the values of set except for the elements. Any element that is not a member of the set is ignored.

(set-unfold comparator continue? mapper successor seed)

Create a newly allocated set as if by set using comparator. If the result of applying the predicate continue? to seed is #f, return the set. Otherwise, apply the procedure mapper to seed. The value that mapper returns is added to the set. Then get a new seed by applying the procedure successor to seed, and repeat this algorithm.

Predicates

(set? obj)

Returns #t if obj is a set, and #f otherwise.

(set-member? set element)

Returns #t if element is a member of set and #f otherwise.

(set-empty? set)

Returns #t if set has no elements and #f otherwise.

(set-disjoint? set₁ set₂)

Returns #t if set₁ and set₂ have no elements in common and #f otherwise.

Accessor

(set-value set element)

Returns the element of set that is equal, in the sense of set's equality predicate, to element. If element is not a member of set, it is returned.

Linear update procedures

(set-adjoin! set element)

Returns a set that uses the same comparator as set and contains all the values of set, and in addition each element unless it is already a member. It is an error to add an element to set that does not return #t when passed to the type test procedure of the comparator with which set was defined.

(set-delete! set element)

Returns a set containing all the values of set except for the elements. Any element that is not equal (in the sense ofset's equality predicate) to some member of the set is ignored.

(set-intern! set element)

Returns two values. If there is an element of set that is equal, in the sense of set's equality predicate, to element, then that element is returned as the first value and set is returned as the second value. If there is no such element, then the first value is element and the second value is the result of (set-add! set element).

(set-search! set element failure success)

The set is searched for element. If it is not found, then the failure procedure is tail-called with two continuation arguments, insert and ignore, and is expected to tail-call one of them. If element is found, then the success procedure is tail-called with the matching element of set and two continuations, update and remove, and is expected to tail-call one of them.

The effects of the continuations are as follows (where obj is any Scheme object):

Invoking (insert obj) causes element to be inserted into set.
Invoking (ignore obj) causes set to remain unchanged.
Invoking (update new-element obj) causes new-element to be inserted into set.
Invoking (remove obj) causes the matching element of set to be removed from it.

In all cases, two values are returned: the possibly updated set and obj.

The whole set

(set-size set)

Returns the number of elements in set as an exact integer.

(set-count predicate set)

Returns the number of elements of set that satisfy predicate as an exact integer.

(set-find predicate set failure)

Returns an arbitrarily chosen element of set that satisfies predicate, or the result of invoking failure with no arguments if there is none.

(set-any? predicate set)

Returns #t if any element of set satisfies predicate, or #f otherwise. Note that this differs from the SRFI 1 analogue because it does not return a useful value.

(set-every? predicate set)

Returns #t if every element of set satisfies predicate, or #f otherwise. Note that this differs from the SRFI 1 analogue because it does not return a useful value.

Mapping and folding

(set-map comparator proc set)

Applies proc to each element of set in arbitrary order and returns a newly allocated set, created as if by set, which contains the results of the applications. For example:

        (set-map string-ci=? symbol->string (set eq? 'foo 'bar 'baz))
            => (set string-ci=? "foo" "bar" "baz")

(set-for-each proc set)

Applies proc to set in arbitrary order, discarding the returned values. Returns an unspecified result.

(set-fold proc nil set)

Invokes proc on each member of set in arbitrary order, passing the result of the previous invocation as a second argument. For the first invocation, nil is used as the second argument. Returns the result of the last invocation.

(set-filter predicate set)

Returns a newly allocated set, created as if by set-empty-copy, containing just the elements of set that satisfy predicate.

(set-filter! predicate set)

A linear update procedure that returns a set containing just the elements of set that satisfy predicate.

(set-remove predicate set)

Returns a newly allocated set, created as if by set-empty-copy, containing just the elements of set that do not satisfy predicate.

(set-partition predicate set)

Returns two values: a newly allocated set, created as if by set-empty-copy, containing just the elements of set that satisfy predicate, and another newly allocated set, created as if by set-empty-copy, containing just the elements of set that do not satisfy predicate.

Copying and conversion

(set-copy set)

Returns a newly allocated set containing the elements of set, and using the same comparator.

(set-empty-copy set)

Returns a newly allocated set containing no elements, and using the same comparator as set.

(set->list set)

Returns a newly allocated list containing the members of set in unspecified order. However, repeated calls to this procedure will return a list in the same order until the set is mutated.

(list->set comparator list)

Returns a newly allocated set, created as if by set using comparator, that contains the elements of list.

(list->set! comparator list)

Returns a set using comparator that contains the elements of list.

Subsets

(set=? set ...)

Returns #t if each set contains the same elements.

(set<? set ...)

Returns #t if each set other than the last is a proper subset of the following set, and #f otherwise.

(set>? set ...)

Returns #t if each set other than the last is a proper superset of the following set, and #f otherwise.

(set<=? set ...)

Returns #t if each set other than the last is a subset of the following set, and #f otherwise.

(set>=? set ...)

Returns #t if each set other than the last is a superset of the following set, and #f otherwise.

Set operations

(set-union set₁ set₂ ...)

(set-intersection set₁ set₂ ...)

(set-difference set₁ set₂ ...)

(set-xor set₁ set₂)

Returns a newly allocated set that is the union, intersection, asymmetric difference, or symmetric difference of the sets. Asymmetric difference is extended to more than two sets by taking the difference between the first set and the union of the others. Symmetric difference is not extended beyond two sets. Elements in the result set are drawn from the first set in which they appear.

(set-union! set₁ set₂ ...)

(set-intersection! set₁ set₂ ...)

(set-difference! set₁ set₂ ...)

(set-xor! set₁ set₂)

Linear update procedures returning a set that is the union, intersection, asymmetric difference, or symmetric difference of the sets. Asymmetric difference is extended to more than two sets by taking the difference between the first set and the union of the others. Symmetric difference is not extended beyond two sets. Elements in the result set are drawn from the first set in which they appear.

Bag procedures

Bags are like sets, but can contain the same object more than once. However, if two elements that are the same in the sense of the equality predicate, but not in the sense of eqv?, are both included, it is not guaranteed that they will remain distinct when retrieved from the bag. It is an error for a single procedure to be invoked on bags with different comparators.

The procedures for creating and manipulating bags are the same as those for sets, except that set is replaced by bag in their names, and that adding an element to a bag is effective even if the bag already contains the element. If two elements in a bag are the same in the sense of the equality predicate, the implementation may in fact store just one of them. There are no procedures named bag-xor or bag-xor!.

Additional bag procedures

(bag-element-count bag element)

Returns an exact integer representing the number of times that element appears in bag.

(bag-for-each-unique proc bag)

Applies proc to each unique element of bag in arbitrary order, passing the element and the number of times it occurs in bag, and discarding the returned values. Returns an unspecified result.

(bag-fold-unique proc nil bag)

Invokes proc on each unique element of bag in arbitrary order, passing the number of occurrences as a second argument and the result of the previous invocation as a third argument. For the first invocation, nil is used as the third argument. Returns the result of the last invocation.

(bag-increment! bag element count)

(bag-decrement! bag element count)

Linear update procedures that return a bag with the same elements as bag, but with the element count of element in bag increased or decreased by the exact integer count (but not less than zero).

(bag->set bag)

(set->bag set)

The bag->set procedure returns a newly allocated set containing the unique elements of bag. The set->bag procedure returns a newly allocated bag containing the elements of set. The comparator of the result is the same as the comparator of the argument.

Integer set procedures

The elements of an integer set are non-negative exact integers less than the set's limit, which is specified when it is created. Except as noted below, the procedures for creating and manipulating integer sets are the same as those for sets, except that set is replaced by integer-set in their names.

Wherever a newly allocated integer set is returned, it has the same limit as the source sets. It is an error for a single procedure to be invoked on integer sets with different limits.

There are no integer-set-value or integer-set-intern! procedures.

Additional integer set procedures

(integer-set limit element ...)

Returns a newly allocated integer set. The possible elements of the set are the exact integers from 0 to limit - 1, where limit is an exact non-negative integer. The set contains the elements.

(universal-integer-set limit)

Returns a newly allocated integer set. The possible elements of the set are the exact integers from 0 to limit - 1, where limit is an exact non-negative integer. The set contains all possible elements.

(range->-integer-set limit low high)

(integer-set-adjoin-range integer-set low high)

Returns a newly allocated integer set with the same limit as integer-set and that contains all its values, and in addition each integer from low (inclusive) to high (exclusive) unless it is already equal to an existing member.

(integer-set-adjoin-range! integer-set low high)

A linear update procedure that returns an integer set with the same limit as integer-set and that contains all its values, and in addition each integer from low (inclusive) to high (exclusive) unless it is already equal to an existing member.

(integer-set-complement integer-set)

Returns a newly allocated integer set that is the complement of integer-set.

(integer-set-complement! integer-set)

A linear update procedure that returns a set that is the complement of integer-set.

(integer-set-min integer-set)

(integer-set-max integer-set)

Returns the smallest or largest integer in integer-set, or #f if there is none.

(integer-set-delete-min! integer-set)

(integer-set-delete-max! integer-set)

Linear update procedures that return two values: the smallest or largest integer in integer-set or #f if there is none, and a set which contains all the elements of set except the integer being returned.

Implementation

The implementation places the identifiers defined above into the sets library.

Sets and bags are implemented as a thin veneer over hashtables, and integer sets over bytevectors. For clarity, the integer set implementation stores just one value into each bytevector element rather than the eight that would be possible.

The sample implementation contains the following files:

sets-impl.scm – implementation of general sets
bags-impl.scm – implementation of bags
isets-impl.scm – implementation of integer sets
count.scm – provides macros for conveniently looping over bytevectors
srfi-4-shim.scm – minimal implementation of R7RS bytevectors on top of SRFI 4's u8vectors
sets.sld – an R7RS library named (sets)
sets.scm – a Chicken library
sets-test.scm – test suite

The test suite will work with the Chicken test egg, which is provided on Chibi as the (chibi test) library.

Copyright

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Editor: Mike Sperber