[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

cell recycling; reference implementation; param ordering; fresh lists

This page is part of the web mail archives of SRFI 32 from before July 7th, 2015. The new archives for SRFI 32 contain all messages, not just those from before July 7th, 2015.



I have altered the description of LIST-DELETE-NEIGHBOR-DUPS! and LIST-MERGE!
to include this text, which I think will clear up the cons-cell allocation
issue:

    On the other hand, the procedures
      LIST-DELETE-NEIGHBOR-DUPS!
      LIST-MERGE!
    make only a single, iterative, linear-time pass over their argument lists,
    using SET-CDR!s to rearrange the cells of the lists into the final result
    -- they work "in place." Hence, any cons cell appearing in the result must
    have originally appeared in an input. The intent of this
    iterative-algorithm commitment is to allow the programmer to be sure that
    if, for example, LIST-MERGE! is asked to merge two ten-million-element
    lists, the operation will complete without performing some extremely
    (possibly twenty-million) deep recursion.

I have changed the SRFI accordingly. The new draft is at
    http://www.cc.gatech.edu/~shivers/srfi/srfi-32.txt
and will propagate out to the standard url
    http://srfi.schemers.org/srfi-32/srfi-32.html
in the hands of esteemed editor Solsona in the near future.

    From: David Feuer <dfeuer@xxxxxxxxxxxxx>
    I'm still wondering where the reference implementation is.

It is on my disk, not released. I need to clean up one of the routines before
I release it, and it's just not as important to me right now as the design
discussion, so that is what is getting my attention.

    [On the subject of parameter order between the < arg & the data arg:]
    Do you provide functions with the same _names_ as existing
    implementations, or just the same interfaces?

Hmm, there is a small amount of clash, but less than I would have supposed.
Check the other-implementations summary. But even if the *names* don't clash,
the *convention* is completely inverted from current practice, which is asking
for confusion. 

Let's hear from some other voices.

    Another little thought:  You could provide _both_, with the different
    versions marked by asterisks or something.  That could be a bit confusing
    though, and even lead to a bit of bloat.

Ech. No, let's *settle* on a *standard* *convention*, and let implementations
deal with backwards compatibility for existing code if we depart from their
specific API.

    What I was thinking was that in functional-style code, if you sort an
    array, and if the array was in fact already sorted, you get a little win
    by using the old array rather than the new one (and letting the new one
    get trashed almost immediately in the first generation) rather than having
    2 identical arrays sitting around wasting memory, and missing the
    opportunity for a first-generation collection.  But I'm no expert at GC,
    so this may not actually be significant.  Of course you can always check
    if the array is sorted first, but that would take more time.

We're trying too hard here, being too clever.

    I was thinking of this in the "side-effecting world".  I was guessing
    (without any serious thought on the topic) that it could be faster (at
    least on some implementations) to do a total-copy sort than a
    copy-then-sort.  Don't have a clue if this is really the case.

OK, here's the skinny. If you want to sort a list and guaranteed produce a
fresh copy, here is how I would choose an underlying algorithm:
    - Best option, I think: Convert the list into a vector, sort that with an 
      in-place vector-sorting algorithm, then convert it back into a fresh
      list. This wins for reasons of memory locality -- chasing pointers
      through memory is slower than doing address arithmetic. The conversion
      costs you linear time, so it doesn't affect your O(n lg n) run time, and
      you need a linear amount of temp storage, but you're committed to
      allocating a linear amount of memory, anyway. Need to be stable? OK,
      allocate *two* vectors and use my opportunistic vector merge sort:
      linear-time best case, O(n lg n) worst case, stable, good memory
      locality, and you only changed the constant factor on your temp memory
      requirements from a 1 to a 2.

    - Don't want to use vectors / don't have vectors? You can roll a
      version of my opportunistic destructive list merge sort algorithm
      so that it copies the input list on the fly, as it goes, and then
      uses destructive merge ops on the intermediate sorted lists. This
      would give you linear-time best case, O(n lg n) worst case, stable,
      and (here's the part you want) every cons cell allocated appears
      in the result -- no "wasted" consing of temporary lists.

    - Really want to be functional, no side effects? OK, use my opportunistic
      list merge sort. As above, but no SET-CDR!s at all... hence you end
      up allocating some intermediate lists that get thrown away.

Now, wanting to sort a list and produce a guaranteed fresh copy is probably
not the common case. You can do a fine job of this in three lines of code
using the tools provided by this lib, if you choose plans A or C above. I don't
think it is really probably worth tying up "API space" supporting it any more
than that.

OK, thank you for making me walk through that design analysis.

By the way, I keep referring to "my opportunistic merge-sort algorithm." Vas
ist das? It's an algorithm I invented about three years ago (which is what
led to the production of this SRFI). It is a surpisingly simple algorithm with
the properties I described above. It's been circulated privately, and I've
submitted a manuscript describing the algorithm with proofs establishing the
run-time bounds for publication, but it has not yet been published... so
that's why you have almost certainly not heard of it. 

It is part of the reference implementation library, though.
    -Olin