[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Interface view of dictionaries

This page is part of the web mail archives of SRFI 44 from before July 7th, 2015. The new archives for SRFI 44 contain all messages, not just those from before July 7th, 2015.

bear wrote:
>>> (*-get-n dict num)
>>> (*-get-n-deleting! dict num)

> scgmille@xxxxxxxxxxxxxxxxxx wrote:
>> Trivially defined with enumeration.

> And with the same problem.  On a lot of dictionaries, if I go in
> knowing I want the first N elements, I can get them at one swell foop
> and not waste time traversing the structure again and again for each
> one.

Enumeration isn't quite that bad. The smart way to do it is to traverse
the collection only once, with a filter function that halts the
enumeration as soon as it gets what it needs. Pseudocode:

    (cfold f (cons n '()) dict)
    where n is the number to fetch and f is

    (lambda (nr datum)
      (let [(n (car nr)) (result (cdr nr))]
        (cond [(= n 0)
               (<halt enumeration and return (reverse l)>)]
              [(<match filter>)
               (cons (1- n) (cons datum result))]
              [else nr])))

This will return the elements in O(N) time, which isn't horrible.
However, it's also far from the best possible way to do it. For a
typical tree-based dictionary, you can do it in O((lg N) + K) time:
Finding the first element takes O(lg N), and traversing the structure to
get all K elements takes O(K) time.

There's a big difference between those performance characteristics.
Unless K is very large (close to N), it's effectively O(lg N) for the
"native" approach vs O(N) for the filtering enumeration.

That's how it works for element fetching, anyway. For consecutive
deletions, it's a bit more complicated. Unless the enumeration is
stable, consecutive deletions are O(N-squared), because you need to
enumerate the whole thing once for each iteration. A sophisticated
native function may still be able to do it in O(lg N) time.

This is why I complained that the SRFI didn't take performance into
consideration. Sure, it's a bad idea to attempt construction-time
optimizations too early in the design process. However, these are not
construction-time optimization. They're significant algorithm choices,
and they make a huge difference. In the worst case, it degrades
performance from O(lg N) to O(N-squared).

That kind of major algorithmic degradation *is* an appropriate
consideration for a design doc.

Tangent: In the example above, I used the traditional argument
convention for higher-order functions: (proc f seed collections ...).
One of the thing that annoys me about SRFI-44 is that it abandons good
conventions like these. Collection-fold permits an arbitrary number of
seeds but only one collection. That gives you something useless (if you
want multiple seeds, use a list like I did) while taking away something
useful (folding multiple collections).

Tangent off the tangent: Why the heck did SRFI-1 define the folding
function to accept the seed argument *last*? There's only one seed but
many lists, which naturally suggests a (f seed . data) signature for
folding functions that can work on any number of lists. Instead, it
specifies (f datum0 datum1 ... seed). Why?!

> Failure to provide a name for this function practically demands an
> implementation that sucks, which will then be applied to all
> dictionaries, whether the implementation of this function for that
> dictionary had to suck or not.

OK, I guess it's your turn to be hostile!

> My API allows efficient implementations.  The one proposed in the SRFI
> does not.  Because of that, they are not even remotely similar.  Yes,
> you could use the SRFI API to provide whacking slow versions of most
> of the functionality.  But every coder who has to live with them is
> going to hate them, either because they will waste CPU time that does
> not have to be wasted, or because redefining R5RS functions won't play
> nice with the existing code and module system, or both.

I generally agree, but with one reservation: While there are good
reasons for the PLT module system to work that way, it's still more
cumbersome than it needs to be. I've been discussing this on the PLT
mailing list, and I hope to develop some methods to ease the pain a bit.
I still think gratuitous incompatibilities are a bad idea, but it
shouldn't be so hard to use well-designed extensions to existing
Bradd W. Szonye