Title

Sets, bags, integer sets, enumeration sets

Author

John Cowan

This SRFI is currently in ``draft'' status. To see an explanation of each status that a SRFI can hold, see here. To provide input on this SRFI, please mail to <srfi minus 113 at srfi dot schemers dot org>. See instructions here to subscribe to the list. You can access previous messages via the archive of the mailing list.

Abstract

Sets and bags (also known as multisets) are unordered mutable collections that can contain any Scheme object. Sets enforce the constraint that no two elements can be the same in the sense of the set's associated equivalence procedure; bags do not. Integer sets are a variant of sets that can only contain non-negative exact integers that are less than a maximum value specified when the integer set is created, and that always treat numerical equality as their equivalence procedure. Enumeration sets are another variant that can only contain symbols chosen from a set of symbols represented by an enumeration type; the equivalence procedure for them is eq?.

Issues

  1. R6RS provides define-enumeration to help set up enum-types. Is this worth having? Possible syntax is:

    (define-enumeration <type-name> (<symbol> ...) <constructor>)

    This would bind <type-name> to the enum-type, and constructor to a curried version of make-enum-set that already knows what type to use.

  2. Should there be a mechanism to convert between integer sets and integers as bitvectors, as defined in SRFI 33, SRFI 60, and R6RS?

  3. (resolved)

  4. How about set-intern!, which is like set-value but adds the element to the set if it's not already there?

  5. Should there be a lexical syntax for sets? If so, how do we incorporate the equivalence procedure, which has no printed representation? Perhaps we only need to support the big five, or perhaps not even eq? as it is implementation-dependent.

  6. (resolved)

  7. (resolved)

  8. Should we switch to unique enum objects rather than symbols?

  9. (resolved)

  10. Should set operations include an operation that adds an element to a set, returning a newly allocated set, but which does not require wrapping the new element in a set that is going to be thrown away immediately? In other words, should there be a functional analog to set-add!?

  11. There are two constructors, make-set and set (and likewise for the other types). This follows the general pattern of (make-){list,string,vector} from R7RS-small. However, the make- constructors construct an object of specified size with either undefined contents or a repeated fill element; neither of these concepts makes any sense for sets, and only minimal sense for bags.

    So it would seem reasonable to only have set as the constructor. Nevertheless, Schemers expect to find a make- constructor, and so I have provided one that always constructs an empty set, though set can do that just as well. Should we keep the status quo, eliminate make-set, eliminate set, or cause it to be an alias for set?

Rationale

Sets are a standard part of the libraries of many high-level programming languages, including Smalltalk, Java, and C++. Racket provides general sets similar to those of this proposal, though with fewer procedures, and there is a Chicken egg called sets (unfortunately undocumented), which provides a minimal set of procedures.

Bags are useful for counting anything from a fixed set of possibilities, e.g. the number of each type of error in a log file or the number of uses of each word in a lexicon drawn from a body of documents. Although other data structures can serve the same purpose, using bags clearly expresses the programmer's intent and can be optimized.

Although sets subsume both integer sets and enumeration sets, providing them separately allows for increased implementation efficiency. There are two Chicken eggs that provide integer sets: iset, which is conceptually similar to the integer sets described here, but using bignums rather than bytevectors; and cis, which uses lists of integer intervals.

The design of enumeration sets is founded on R6RS enumerations, but with the addition of reified enumeration types along the lines suggested by R6RS Formal Comment #262. The prefix enum is used in all cases instead of using both enum and enumeration as R6RS does. Approximate mappings are shown below.

Insofar as possible, the names in this SRFI are harmonized with the names used for ordered collections (lists, strings, vectors, and bytevectors) in Scheme. However, size is used instead of length to express the number of elements in the collection, because length implies order.

Specification

The procedures in this SRFI are in the (scheme sets) and (srfi xxx) libraries (or (srfi :xxx) on R6RS). However, the sample implementation currently places them in the (sets) library instead.

Sets, bags, integer sets, enumeration sets, and enumeration types are mutually disjoint, and disjoint from other types of Scheme objects.

Set procedures

It is an error for a single procedure to be invoked on sets with different equivalence procedures.

Constructors

(make-set equivalence)

Returns a newly allocated empty set. equivalence is the equivalence procedure for the set. If it is coarser than equal (in the sense of R7RS-small 6.1), the implementation may signal an error; however, string-ci=? must be supported.

(set equivalence element ...)

Returns a newly allocated set with equivalence procedure equivalence and containing the elements.

(set-copy set)

Returns a newly allocated set containing the elements of set, with the same equivalence procedure.

(set-empty-copy set)

Returns a newly allocated set containing no elements with the same equivalence procedure as set.

Predicates

(set? obj)

Returns #t if obj is a set, and #f otherwise.

(set-member? set element)

Returns #t if element is a member of set and #f otherwise.

Mutators

(set-add! set element)

Adds element to set unless it is already a member. Returns an unspecified value.

(set-delete! set element)

Deletes element from set if it is a member. Returns #t if the element was a member, #f if not.

Mapping and folding

(set-map equivalence proc set)

Applies proc to each element of set in arbitrary order and returns a newly allocated set with the equivalence predicate equivalence which contains the results of the applications. For example:

(set-map string-ci=? symbol->string (set eq? 'foo 'bar 'baz)) => (set string-ci=? "foo" "bar" "baz")

(set-for-each proc set)

Applies proc to set in arbitrary order, discarding the returned values. Returns unspecified results.

(set-fold proc nil set)

Invokes proc on each member of set in arbitrary order, passing the result of the previous invocation as a second argument. For the first invocation, nil is used as the second argument. Returns the result of the last invocation.

(set-unfold equivalence continue? mapper successor seed)

Create a new set as if by make-set. If the result of applying the predicate continue? to seed is #f, return the set. Otherwise, apply the procedure mapper to seed. The value that mapper returns is added to the set. Then get a new seed by applying the procedure successor to seed, and repeat this algorithm.

Conversion

(set->list set)

Returns a newly allocated list containing the members of set in unspecified order. However, repeated calls to this procedure will return a list in the same order until the set is mutated.

(list->set equivalence list)

Returns a newly allocated set with equivalence predicate equivalence containing the elements of list.

Subsets

(set=? set ...)

Returns #t if each set contains the same elements.

(set<? set ...)

Returns #t if each set other than the last is a proper subset of the following set, and #f otherwise.

(set>? set ...)

Returns #t if each set other than the last is a proper superset of the following set, and #f otherwise.

(set<=? set ...)

Returns #t if each set other than the last is a subset of the following set, and #f otherwise.

(set>=? set ...)

Returns #t if each set other than the last is a superset of the following set, and #f otherwise.

The whole set

(set-size set)

Returns the number of elements in set.

(set-filter predicate set)

Returns a newly allocated set with the same equivalence predicate as set containing just the elements of set that satisfy predicate.

(set-remove predicate set)

Returns a newly allocated set with the same equivalence predicate as set containing just the elements of set that do not satisfy predicate.

(set-partition predicate set)

Returns two values, a newly allocated set with the same equivalence predicate as set containing just the elements of set that satisfy predicate, and another newly allocated set with the same equivalence predicate as set containing just the elements of set that do not satisfy predicate.

(set-count predicate set)

Returns the number of elements of set that satisfy predicate as an exact integer.

(set-find predicate set failure)

Returns an arbitrarily chosen element of set that satisfies predicate, or the result of invoking failure with no arguments if there is none.

(set-any? predicate set)

Returns #t if any element of set satisfies predicate, or #f otherwise. Note that this differs from the SRFI 1 analogue because it does not return a useful value.

(set-every? predicate set)

Returns #t if every element of set satisfies predicate, or #f otherwise. Note that this differs from the SRFI 1 analogue because it does not return a useful value.

Set operations

(set-union set1 set2 ...)

(set-intersection set1 set2 ...)

(set-difference set1 set2 ...)

(set-xor set1 set2)

Returns a newly allocated set that is the union, intersection, asymmetric difference, or symmetric difference of the sets. Asymmetric difference is extended to more than two sets by taking the difference between the first set and the union of the others. Symmetric difference is not extended beyond two sets. Elements in the result set are drawn from the first set in which they appear.

(set-union! set1 set2 ...)

(set-intersection! set1 set2 ...)

(set-difference! set1 set2 ...)

(set-xor! set1 set2)

The same as set-union, set-intersection, set-difference, and set-xor respectively, but may mutate the set1 argument.

Set value

(set-value set element)

Returns the element of set that is equal, in the sense of the equivalence predicate, to element. If element is not a member of the set, it is returned.

Bag procedures

Bags are like sets, but can contain the same object more than once. However, if two elements that are equal in the sense of the equivalence procedure, but not in the sense of eqv?, are both included, it is not guaranteed that they will remain distinct when retrieved from the bag. It is an error for a single procedure to be invoked on bags with different equivalence procedures.

The procedures for creating and manipulating bags are the same as those for sets, except that set is replaced by bag in their names, and that adding an element to a bag is effective even if the bag already contains the element. There are no procedures bag-xor, bag-xor!, or bag-value.

(bag-element-count bag element)

Returns an exact integer representing the number of times that element appears in bag.

(bag-for-each-unique proc bag)

Applies proc to each unique element of bag in arbitrary order, passing the element and the number of times it occurs in bag, and discarding the returned values. Returns unspecified results.

(bag-fold-unique proc nil bag)

Invokes proc on each unique element of bag in arbitrary order, passing the number of occurrences as a second argument and the result of the previous invocation as a third argument. For the first invocation, nil is used as the third argument. Returns the result of the last invocation.

(bag-increment! bag element count)

(bag-decrement! bag element count)

Increases or decreases the element count of element in bag by the exact integer count.

(bag->set bag)

(set->bag set)

The bag->set procedure returns a newly allocated set containing the unique elements of bag. The set->bag procedure returns a newly allocated bag containing the elements of set.

Integer set procedures

The elements of an integer set are non-negative exact integers less than the set's limit, which is specified when it is created. Except as noted below, the procedures for creating and manipulating integer sets are the same as those for sets, except that set is replaced by integer-set in their names, and references to equivalence predicates are replaced by limits, as the equivalence function is always =. Wherever a newly allocated integer set is returned, it has the same limit as the source sets. It is an error for a single procedure to be invoked on integer sets with different limits.

The procedure integer-set-value is just the identity function, so it is not provided.

(make-integer-set limit)

Returns a newly allocated integer set. The possible elements of the set are the exact integers from 0 to limit - 1, where limit is an exact non-negative integer. The set is empty.

(make-universal-integer-set limit)

Returns a newly allocated integer set. The possible elements of the set are the exact integers from 0 to limit - 1, where limit is an exact non-negative integer. The set contains all possible elements.

(integer-set-complement integer-set)

Returns a newly allocated integer set that is the complement of integer-set.

(integer-set-complement! integer-set)

Mutates integer-set to its complement.

(integer-set-min integer-set)

(integer-set-max integer-set)

Returns the smallest or largest integer in integer-set, or #f if there is none.

(integer-set-delete-min! integer-set)

(integer-set-delete-max! integer-set)

Returns and deletes the smallest or largest integer in integer-set, or #f if there is none.

Enumeration sets

Except as noted below, the procedures for creating and manipulating enumeration sets are the same as those for sets, except that set is replaced by enum-set in their names. Wherever a newly allocated enumeration set is returned, it has the same enumeration type as the source sets. It is an error for a single procedure to be invoked on enumeration sets with different enumeration types.

The procedure enum-set-value is just the identity function, so it is not provided.

Enum-type procedures

(make-enum-type enum-list)

Returns a newly allocated enumeration type suitable for constructing enumeration sets whose members are symbols. The elements of enum-list are either symbols or else pairs where the car is a symbol and the cdr is an exact integer called the value of the symbol. If the element is a symbol, the value is greater by 1 than the value of the previous symbol, or 0 if there was no previous symbol. Values are normally distinct, but make-enum-type intentionally does not enforce this, for compatibility with C/C++ external routines. The symbols are said to be in the enumeration type. In R6RS the function of this procedure is provided as part of make-enumeration, but the values (called indexes) are assigned from 0 on up.

(enum-type? obj)

Returns #t if obj is an enumeration type, and #f otherwise.

(enum-type->alist enum-type)

Return a newly allocated alist mapping the symbols in enum-type to their values.

(enum-type-symbol-value enum-type symbol)

Return the value of symbol in enum-type, or #f if it is not in enum-type. A loose R6RS analogue is the procedure returned by enum-type-indexer when applied to an enum-set belonging to enum-type.

(enum-type-symbol enum-type value)

Returns the first symbol whose value in enum-type is value, or #f if there is none.

(enum-value=? enum-type symbol ...)

(enum-value<? enum-type symbol ...)

(enum-value>? enum-type symbol ...)

(enum-value<=? enum-type symbol ...)

(enum-value>=? enum-type symbol ...)

Returns #t if the values of the symbols are equal (which implies that the symbols themselves are eq?), monotonically increasing, monotonically decreasing, monotonically nondecreasing, and monoonically nonincreasing respectively, and #f otherwise.

Enum-set procedures

(make-enum-set enum-type)

Returns a newly allocated enumeration set. The possible elements of the set are the symbols in enum-type. The set is empty. The approximate R6RS equivalents are enum-set-constructor and make-enumeration.

(make-universal-enum-set enum-type)

Returns a newly allocated enumeration set. The possible elements of the set are the symbols in enum-type. The set contains all possible elements. The approximate R6RS equivalent is enum-set-universe.

(enum-set enum-type element ...)

Returns a newly allocated enumeration set. The possible elements of the set are the symbols in enum-type. The set is initialized to contain the elements. There is no R6RS equivalent.

(list->enum-set enum-type list)

Returns a newly allocated enumeration set. The possible elements of the set are the symbols in enum-type. The set is initialized to contain the elements of list. There is no R6RS equivalent.

(enum-set-complement enum-set)

Returns a newly allocated enumeration set that is the complement of enum-set. This procedure is also in R6RS.

(enum-set-complement! enum-set)

Mutates enum-set to its complement. This procedure is also in R6RS.

(enum-set-min integer-set)

(enum-set-max integer-set)

Returns the smallest or largest symbol in symbol index order in enum-set, or #f if there is none.

(enum-set-delete-min! enum-set)

(enum-set-delete-max! enum-set)

Returns and deletes the smallest or largest symbol in symbol index order in enum-set, or #f if there is none.

(enum-set-projection enum-set enum-type)

Returns a newly allocated enumeration set of type enum-type. Its elements are the symbols belonging to enum-set, ignoring any symbols which are not in enum-type. This procedure is also in R6RS, but uses a second enum-set in place of enum-type.

Implementation

Sets and bags are implemented as a thin veneer over hashtables, and integer sets over bytevectors. For clarity, the integer set implementation stores just one value into each bytevector element rather than the eight that would be possible. In turn, enumeration types are implemented over a hashtable from symbols to indexes, and enumeration sets over integer sets.

A sample implementation is here. The files sets-impl.scm, bags-impl.scm, isets-impl.scm, and enums-impl.scm are the basic code libraries for sets, bags, integer sets, and enumeration sets respectively. Each is divided into a section that makes use of the underlying collection and a section that doesn't. The file count.scm provides macros for conveniently looping over the bytevectors that underlie integer sets and enumeration sets. The file srfi-4-shim.scm is a minimal implementation of R7RS bytevectors on top of SRFI 4's u8vectors, just sufficient for this implementation, and may be useful for R5RS systems that do not provide R7RS bytevectors.

Two library definition files are provided: sets.sld provides an R7RS library named (sets) that exports the public procedures of this SRFI (the extension .sld is used by Chibi for library definition files). sets.scm is equivalent, but uses Chicken module syntax.

The file sets-test.scm is a test suite for this library. It uses Chicken's test egg, which is also provided on Chibi as an R7RS library named (chibi test).

Copyright

Copyright (C) John Cowan 2013. All Rights Reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Editor: Mike Sperber