Title

Sets, bags, integer sets, enumeration sets

Author

John Cowan

This SRFI is currently in ``draft'' status. To see an explanation of each status that a SRFI can hold, see here. To provide input on this SRFI, please mail to <srfi minus 113 at srfi dot schemers dot org>. See instructions here to subscribe to the list. You can access previous messages via the archive of the mailing list.

Abstract

Sets and bags (also known as multisets) are unordered mutable collections that can contain any Scheme object. Sets enforce the constraint that no two elements can be the same in the sense of the set's associated equality predicate; bags do not. Integer sets are a variant of sets that can only contain non-negative exact integers that are less than a maximum value specified when the integer set is created, and that always treat numerical equality as their equality predicate. Likewise, character sets are sets that can only contain characters; their equality predicate is char=?. Finally, enumeration sets are another variant that can only contain symbols chosen from a set of symbols represented by an enumeration type; the equality predicate for them is eq?.

Issues

  1. (resolved)

  2. (resolved)

  3. (resolved)

  4. (resolved)

  5. (resolved)

  6. (resolved)

  7. (resolved)

  8. Should we switch to unique enum objects rather than symbols?

  9. (resolved)

  10. (resolved)

  11. (resolved)

  12. Should we stick to just comparators, or also allow equality predicates?

  13. Should we switch to six libraries (sets, bags, integer sets, character sets, enum sets, and set-bag conversion), or stick with a single library? (This is not about dividing the SRFI itself.)

  14. Should we add a cursor API similar to SRFI 14's?

  15. Should we add comparators for the various types?

Rationale

Sets are a standard part of the libraries of many high-level programming languages, including Smalltalk, Java, and C++. Racket provides general sets similar to those of this proposal, though with fewer procedures, and there is a Chicken egg called sets (unfortunately undocumented), which provides a minimal set of procedures.

Bags are useful for counting anything from a fixed set of possibilities, e.g. the number of each type of error in a log file or the number of uses of each word in a lexicon drawn from a body of documents. Although other data structures can serve the same purpose, using bags clearly expresses the programmer's intent and can be optimized.

Although general sets subsume all other types of sets, providing a few types separately allows for increased implementation efficiency. In particular, sets whose elements are restricted to non-negative exact integers within a small range are provided. There are two Chicken eggs that provide integer sets: iset, which is conceptually similar to the integer sets described here, but using bignums rather than bytevectors; and cis, which uses lists of integer intervals.

Character sets are already specified by SRFI 14, The interface is not as complete as this SRFI, so an overlay of procedures is provided here.

The design of enumeration sets is founded on R6RS enumerations, but with the addition of reified enumeration types along the lines suggested by R6RS Formal Comment #262. The prefix enum is used in all cases instead of using both enum and enumeration as R6RS does. Approximate mappings are shown below.

Insofar as possible, the names in this SRFI are harmonized with the names used for ordered collections (lists, strings, vectors, and bytevectors) in Scheme. However, size is used instead of length to express the number of elements in the collection, because length implies order.

Sets do not have a lexical syntax representation. It's possible to use SRFI 108 quasi-literal constructors to create them in code, but this SRFI does not standardize how that is done.

Specification

The procedures in this SRFI are in the (srfi 113) library (or (srfi :113) on R6RS). However, the sample implementation currently places them in the (sets) library instead.

Sets, bags, integer sets, enumeration sets, and enumeration types are mutually disjoint, and disjoint from other types of Scheme objects. Character sets belong to the type defined by SRFI 14, but are disjoint from all other types.

It is an error for any procedure defined in this SRFI to be invoked on sets or bags with different comparators.

Procedures whose names end with ! are linear update procedures in the sense of SRFI 1. This means that an implementation may legally implement these procedures as pure, side-effect-free functions, or it may implement them using side effects, depending upon the details of what is the most efficient or simple to implement in terms of the underlying representation. Furthermore, if the implementations of all linear update procedures on a given type of set are pure, then the procedures which are said to return new sets are not required to actually allocate them: they may return already existing objects. See SRFI 1 for the rationale.

Set procedures

Constructors

(set comparator element ...)

Returns a new set. The comparator argument is either a SRFI 114 comparator or else a procedure, in which case it is the equality predicate for the set. Any comparator with an equality predicate and a corresponding hash function may be used. Implementations must accept the equality predicates eq?, eqv?, equal, string=?, and string-ci=?, and may accept others. The elements are used to initialize the set.

(set-add set element ...)

Returns a new set containing the elements of set, and in addition the specified elements. The new set uses the same comparator or equality predicate as set. It is an error to pass an element to this procedure that does not return #t when passed to the type test procedure of the comparator with which set was defined.

(set-delete set element ...)

Returns a new set containing the the elements of set with the possible exception of the elements, if they are present in set. The new set uses the same comparator or equality predicate.

(set-copy set)

Returns a new set containing the elements of set, and using the same comparator or equality predicate.

(set-empty-copy set)

Returns a new set containing no elements, and using the same comparator or equality predicate as set.

Predicates

(set? obj)

Returns #t if obj is a set, and #f otherwise.

(set-contains? set element)

Returns #t if element is a member of set and #f otherwise.

Accessor

(set-value set element)

Returns the element of set that is equal, in the sense of set's equality predicate, to element. If element is not a member of set, it is returned.

Mutators

(set-add! set element ...)

Returns a set containing the elements of set, and in addition the specified elements. The set uses the same comparator or equality predicate as set. It is an error to pass an element to this procedure that does not return #t when passed to the type test procedure of the comparator with which set was defined.

(set-delete! set element ...)

Returns a set containing the the elements of set with the possible exception of the elements, if they are present in set. The new set uses the same comparator or equality predicate..

(set-intern! set element)

Returns two values. If there is an element of set that is equal, in the sense of set's equality predicate, to element, then that element is returned as the first value and set is returned as the second value. If there is no such element, then the first value is element and the second value is the result of (set-add! set element).

Mapping and folding

(set-map comparator proc set)

Applies proc to each element of set in arbitrary order and returns a new set, created as if by set, which contains the results of the applications. For example:

(set-map string-ci=? symbol->string (set eq? 'foo 'bar 'baz))
    => (set string-ci=? "foo" "bar" "baz")

(set-for-each proc set)

Applies proc to set in arbitrary order, discarding the returned values. Returns an unspecified result.

(set-fold proc nil set)

Invokes proc on each member of set in arbitrary order, passing the result of the previous invocation as a second argument. For the first invocation, nil is used as the second argument. Returns the result of the last invocation.

(set-unfold comparator continue? mapper successor seed)

Create a new set as if by set. If the result of applying the predicate continue? to seed is #f, return the set. Otherwise, apply the procedure mapper to seed. The value that mapper returns is added to the set. Then get a new seed by applying the procedure successor to seed, and repeat this algorithm.

Conversion

(set->list set)

Returns a newly allocated list containing the members of set in unspecified order. However, repeated calls to this procedure will return a list in the same order until the set is mutated.

(list->set comparator list)

Returns a new set, created as if by set, that contains the elements of list.

(list->set! set list)

Returns set with the elements of list added to it.

Subsets

Note that these functions do not obey the trichotomy law, and thus do not specify a total order on sets.

(set=? set ...)

Returns #t if each set contains the same elements.

(set<? set ...)

Returns #t if each set other than the last is a proper subset of the following set, and #f otherwise.

(set>? set ...)

Returns #t if each set other than the last is a proper superset of the following set, and #f otherwise.

(set<=? set ...)

Returns #t if each set other than the last is a subset of the following set, and #f otherwise.

(set>=? set ...)

Returns #t if each set other than the last is a superset of the following set, and #f otherwise.

The whole set

(set-size set)

Returns the number of elements in set as an exact integer.

(set-filter predicate set)

Returns a new set, created as if by set-empty-copy, containing just the elements of set that satisfy predicate.

(set-filter! predicate set)

Returns a set containing just the elements of set that satisfy predicate.

(set-remove predicate set)

Returns a new set, created as if by set-empty-copy, containing just the elements of set that do not satisfy predicate.

(set-partition predicate set)

Returns two values: a new set, created as if by set-empty-copy, containing just the elements of set that satisfy predicate, and another new set, created as if by set-empty-copy, containing just the elements of set that do not satisfy predicate.

(set-count predicate set)

Returns the number of elements of set that satisfy predicate as an exact integer.

(set-find predicate set failure)

Returns an arbitrarily chosen element of set that satisfies predicate, or the result of invoking failure with no arguments if there is none.

(set-any? predicate set)

Returns #t if any element of set satisfies predicate, or #f otherwise. Note that this differs from the SRFI 1 analogue because it does not return a useful value.

(set-every? predicate set)

Returns #t if every element of set satisfies predicate, or #f otherwise. Note that this differs from the SRFI 1 analogue because it does not return a useful value.

Set operations

(set-union set1 set2 ...)

(set-intersection set1 set2 ...)

(set-difference set1 set2 ...)

(set-xor set1 set2)

Returns a new set that is the union, intersection, asymmetric difference, or symmetric difference of the sets. Asymmetric difference is extended to more than two sets by taking the difference between the first set and the union of the others. Symmetric difference is not extended beyond two sets. Elements in the result set are drawn from the first set in which they appear.

(set-union! set1 set2 ...)

(set-intersection! set1 set2 ...)

(set-difference! set1 set2 ...)

(set-xor! set1 set2)

The same as set-union, set-intersection, set-difference, and set-xor respectively, except that they may mutate the set1 argument.

Bag procedures

Bags are like sets, but can contain the same object more than once. The procedures for creating and manipulating bags are the same as those for sets, except that set is replaced by bag in their names. If two elements in a bag are equal in the sense of the equality predicate, the implementation may in fact store just one of them.

There are no procedures bag-xor or bag-xor!.

Additional procedures

(bag-element-count bag element)

Returns an exact integer representing the number of times that element appears in bag.

(bag-for-each-unique proc bag)

Applies proc to each unique element of bag in arbitrary order, passing the element and the number of times it occurs in bag, and discarding the returned values. Returns an unspecified result.

(bag-fold-unique proc nil bag)

Invokes proc on each unique element of bag in arbitrary order, passing the number of occurrences as a second argument and the result of the previous invocation as a third argument. For the first invocation, nil is used as the third argument. Returns the result of the last invocation.

(bag-increment! bag element n)

(bag-decrement! bag element n)

Returns a bag containing the elements of bag, but with n more or fewer copies of element.

(bag->set bag)

(set->bag set)

The bag->set procedure returns a new set containing the unique elements of bag. The set->bag procedure returns a new bag containing the elements of set. The comparator or equality predicate of the result is the same as the comparator or equality predicate of the argument.

Character set procedures

The elements of a character set are characters. Except as noted below, the procedures for creating and manipulating character sets are the same as those for sets, except that set is replaced by char-set in their names, and there are no comparator arguments, as the equality predicate is always char=?.

There is no char-set-intern! procedure.

The character sets of this SRFI belong to the same type as the character sets of SRFI 14, at least in implementations that support both SRFIs.

Additional procedures

(char-set element ...)

Returns a character set which contains the elements.

(make-universal-char-set)

Returns a character set containing all possible elements.

(char-set-complement char-set)

Returns a new character set that is the complement of char-set.

(char-set-complement! char-set)

Returns a character set that is the complement of char-set, possibly mutating char-set in the process.

(char-set-min char-set)

(char-set-max char-set)

Returns the smallest or largest character in char-set, or #f if there is none.

Integer set procedures

The elements of an integer set are non-negative exact integers less than the set's limit, which is specified when it is created. Except as noted below, the procedures for creating and manipulating integer sets are the same as those for sets, except that set is replaced by integer-set in their names, and references to comparators are replaced by limits, as the equality predicate is always =. Wherever an integer set is returned, it has the same limit as the source sets. It is an error for a single procedure to be invoked on integer sets with different limits.

There is no integer-set-intern! procedure.

Additional procedures

(integer-set limit element ...)

Returns a new integer set. The possible elements of the set are the exact integers from 0 to limit - 1, where limit is an exact non-negative integer. The set contains the elements.

(make-universal-integer-set limit)

Returns a new integer set. The possible elements of the set are the exact integers from 0 to limit - 1, where limit is an exact non-negative integer. The set contains all possible elements.

(integer-set-complement integer-set)

Returns a new integer set that is the complement of integer-set.

(integer-set-complement! integer-set)

Returns a character set that is the complement of char-set, possibly mutating char-set in the process.

(integer-set-min integer-set)

(integer-set-max integer-set)

Returns the smallest or largest integer in integer-set, or #f if there is none.

(integer-set->integer integer-set)

(integer->integer-set integer limit)

The integer-set->integer procedure returns a non-negative exact integer which encodes integer-set. When considered as a binary number, the ones bit (least significant bit) of the integer encodes the presence of 0 in the set, the twos bit of the integer encodes the presence of 1, and so on up to the most significant bit, which encodes the presence of limit - 1. The integer->integer-set returns an integer set whose limit is limit after performing the reverse transformation on integer (which must not be negative).

Enumeration sets

Except as noted below, the procedures for creating and manipulating enumeration sets are the same as those for sets, except that set is replaced by enum-set in their names. Wherever a new enumeration set is returned, it has the same enumeration type as the source sets. It is an error for a single procedure to be invoked on enumeration sets with different enumeration types.

Syntax

(define-enumeration-type <type-name> (<constructor> <symbol> ...) <predicate>)

Creates an enumeration type whose members are <symbol>s. The identifier <type-name> is bound to the enumeration type. The identifier <constructor> is bound to a procedure similar to enum-set, except that it does not accept an enum-type argument. Finally, the identifier <predicate> is bound to a predicate which is true if its argument is an enumeration set whose type is the declared type, and false otherwise. The syntax is reminiscent of define-record-type.

There is no enum-set-intern! procedure.

Enum-type procedures

(make-enum-type enum-list)

Returns an enumeration type suitable for constructing enumeration sets whose members are symbols. The elements of enum-list are either symbols or else pairs where the car is a symbol and the cdr is an exact integer called the value of the symbol. If the element is a symbol, the value is greater by 1 than the value of the previous element, or 0 if there is no previous element. Values are normally distinct, but make-enum-type intentionally does not enforce this, for compatibility with C/C++ external routines. The symbols are said to be in the enumeration type. In R6RS the function of this procedure is provided as part of make-enumeration, but the values (called indexes) are assigned from 0 on up.

(enum-type? obj)

Returns #t if obj is an enumeration type, and #f otherwise.

(enum-type->alist enum-type)

Return a newly allocated alist mapping the symbols in enum-type to their values.

(enum-type-symbol-value enum-type symbol)

Return the value of symbol in enum-type, or #f if it is not in enum-type. A loose R6RS analogue is the procedure returned by enum-type-indexer when applied to an enum-set belonging to enum-type.

(enum-type-symbol enum-type value)

Returns the first symbol whose value in enum-type is value, or #f if there is none.

(enum-value=? enum-type symbol ...)

(enum-value<? enum-type symbol ...)

(enum-value>? enum-type symbol ...)

(enum-value<=? enum-type symbol ...)

(enum-value>=? enum-type symbol ...)

Returns #t if the values of the symbols are equal (which implies that the symbols themselves are eq?), monotonically increasing, monotonically decreasing, monotonically nondecreasing, and monotonically nonincreasing respectively, and #f otherwise.

Enum-set procedures

(enum-set enum-type symbol ... )

Returns a new enumeration set. The possible elements of the set are the symbols in enum-type. The set is initialized to contain the symbols. The approximate R6RS equivalents are enum-set-constructor and make-enumeration.

(make-universal-enum-set enum-type)

Returns a new enumeration set. The possible elements of the set are the symbols in enum-type. The set contains all possible elements. The approximate R6RS equivalent is enum-set-universe.

(list->enum-set enum-type list)

Returns a new enumeration set. The possible elements of the set are the symbols in enum-type. The set is initialized to contain the elements of list. There is no R6RS equivalent.

(enum-set-complement enum-set)

Returns a new enumeration set that is the complement of enum-set. This procedure is also in R6RS.

(enum-set-complement! enum-set)

Returns an enumeration set that is the complement of enum-set, possibly mutating enum-set in the process. This procedure is also in R6RS, where it is required to mutate its argument.

(enum-set-min integer-set)

(enum-set-max integer-set)

Returns the smallest or largest symbol in symbol index order in enum-set, or #f if there is none.

(enum-set-projection enum-set enum-type)

Returns a new enumeration set of type enum-type. Its elements are the symbols belonging to enum-set, ignoring any symbols which are not in enum-type. This procedure is also in R6RS, but uses a second enum-set in place of enum-type.

Implementation

Sets and bags are implemented as a thin veneer over hashtables, and integer sets over bytevectors. Character sets just are SRFI 14 character sets. For clarity, the integer set implementation stores just one value into each bytevector element rather than the eight that would be possible. In turn, enumeration types are implemented over a hashtable from symbols to indexes, and enumeration sets over integer sets.

The sample implementation contains the following files:

Each *-impl file is divided into a section that makes use of the underlying collection and a section that doesn't.

The file sets-test.scm is a test suite for this library. It uses Chicken's test egg, which is also provided on Chibi as an R7RS library named (chibi test).

Copyright

Copyright (C) John Cowan 2013. All Rights Reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Editor: Mike Sperber