[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: handling duplicate elements





This is very timely, because I am currently working on a new release of
the SRFI, this time with a completely new implementation for sets and
bags.  The integer-set implementation will follow in a later release.

Great! I've been rewriting my immutable data structures, but have been holding off on proposing a SRFI until this one is finalized. I'd like the interface to the immutable structures to be as close to identical to SRFI 113 as practicable.
 

Eqv? would be the relevant predicate, given that when you put something
into a standard Scheme data structure such as a pair or vector, what you
get out is guaranteed to be eqv? to what went in, per R5RS/R7RS 3.4.

Right.

I have to admit that I've always had a hard time grokking the intended distinction between eq? and eqv?. I understand that equal? is a deep O(n) comparison, eq? is a pointer comparison (plus identity of O(1)-sized types), and eqv? lies somewhere in between. At the risk of getting sidetracked, do you happen to have a reference to an explanation of what eqv? means? The R7RS document is very precise but IMO not very instructive on this point.
 
I considered this approach, but I decided that it was overkill.  Other
than set-adjoin(!), the only procedures that need such selection are the
set operations set-union(!) and set-intersection(!), and since they are
commutative, a simple rule of "first argument beats second argument"
will suffice.

The problem is that it isn't just adjoin, union, and intersection. Duplicates may arise in the constructor, e.g.

(set default-comparator x x x) .

While a programmer is unlikely to do something that's so obviously pointless, the same situation could easily arise with

(set default-comparator a b c)

where a, b, and c came from client code and happen to be =?.

The list passed to list->set and list->set! may have duplicates, so they are implicated in this too.

And, set-map may create duplicates if the mapping procedure is not injective. I think this is the subtlest of these cases and the most likely to catch programmers by surprise. For example, if you map string-upcase on a set of strings, you may wind up with fewer strings than you started with. So whatever we decide, I think the documentation for set-map should remind programmers about non-injective functions.

This is one of the reasons I had an imap/monotone procedure in my draft immutable library; monotone procedures are a common special case of injective function. In hindsight, though, I think it should be imap/nondecreasing instead of imap/monotone. When you map a nondecreasing function on the values of a binary search tree, you can reuse the shape of the tree, which makes imap/nondecreasing potentially much faster than a more general imap that needs to check for duplicates and build a new tree structure from scratch.
 

So I am suggesting that set-adjoin(!) never replaces existing elements,
and adding a new operation set-replace(!) with three arguments: the set,
the element, and a default.  If the set contains the element, it is
deleted and the element is adjoined (this is a no-op if they are eqv?);
if not, the default is inserted.

So this suggestion is that set-adjoin and set-replace use "second argument beats first" and all other procedures use "first argument beats second." This behavior would accommodate my intended use case, of implementing an association-map on top of a set. So I could live with it.

But I'm not 100% happy with it, because it conflates two orthogonal issues: the differences between set-union and set-adjoin are that, in set-adjoin the right argument(s) are elements instead of sets, _and_ that set-adjoin uses "second argument beats first" while set-union uses "first argument beats first." It would be cleaner to separate these orthogonal concerns. But unfortuantely so far the only way I've thought of doing that are optional arguments or first-class selectors, both of which are admittedly rather heavyweight.

Kevin Wortman