Three-way action

This page is part of the web mail archives of SRFI 32 from before July 7th, 2015. The new archives for SRFI 32 contain all messages, not just those from before July 7th, 2015.

```    From: Sven Hartrumpf <Sven.Hartrumpf@xxxxxxxxxxxxxxxx>
Comparison predicates.
I saw the discussion about < and <=. But is not a three-valued comparison
function (symbols less, greater, equal) more efficient for some data types
if many duplicates are present? I often sort lists of some million strings
of average length 20 and many duplicates. I remember that some sort
algorithms will test (< a b) and (< b a) in certain constellations for one
setting of a and b. (Olin, for things like this a look at your reference
implementation would be helpful, although concentrating on the design was
a clever move for the discussion so far.) A three-valued comparison
function also speeds up vector-binary-search. A problem for three-valued
comparison functions is that their implementation can range from efficient
to inefficient:
- you can write an efficient string-compare for strings
- but, for numbers you would need support from the underlying Scheme system
beyond R5RS (?) although an implementation using > and < is not too slow.

OK, here's what I know about sorts that use three-way order predicates.

I only know of one sorting algorithm that can exploit this: it is a variant
of quicksort which I learned from Jon Bentley and was tagged to him and Doug
McIlroy. It is described in the comments at the end of the quicksort code
in my reference implementation; I append it for interested readers.

People with your needs deserve support from this SRFI. Now, we have two
basic approaches to putting a three-way-comparison sort into the SRFI:
- I could add it in the "general" sorting lib, perhaps
(vector-sort3! v compare [start end])
and so forth (possibly for stable, non-destructive, and list variants).

- I could just add it to the quicksort module
(quick-sort3! v compare [start end])

I do *not* think this function fits into the "general" category. For one,
I don't know of any *stable* variants, or list variants. I only know of
one, non-stable algorithm that works in-place on a vector. Period.

(I don't mean I don't know how to sort a list or stably sort a vector
using a three-way comparison function. After all, you could just use
your three-way comparison function in a "dumb" two-way < mode in any
of the standard algorithms. I mean I don't know of a way to do it that
*exploits* the extra discrimination provided by the three-way comparison.)

So unless there is a sorting honcho out there that can tell me three-way
comparison sorts come in a variety of functionalities (stable, in-place,
list, vector), it seems best to claim it's an *algorithm*, not a *general
operation* and file it that way: three-way in-place vector quick sort.
OK? If that's what you need, you pull it out of the quick-sort module.

Final remark: I think the comparison function f should return an integer:
(f x y) < 0		x < y
(f x y) = 0		x = y
(f x y) > 0		x > y
I prefer this to having it return a symbol, e.g. {'less, 'equal, 'greater}.
Why? First, the integer is frequently the natural result of the actual
comparison operation, e.g. consider the trivial comparison function -.
In fact, keeping in mind that - is the "model" for this comparison function
is a nice, simple, easy-to-remember way to decide "polarity," that is,
if a negative number means x < y or y < x.

Second, integers can be tested by the underlying hardware against zero
quickly. Using symbols is more expensive.

(quick-sort3! v compare [start end]) -> unspecified
to the vector quick-sort module. COMPARE returns an integer.
I'm not even going to add a non-in-place version.

The current quick-sort code from the ref implementation follows; see the
comments at the end. They will have to be turned into code.

That's my story on three-way comparison sorting. Comments?
-Olin

-------------------------------------------------------------------------------
;;; The SRFI-32 sort package -- quick sort			-*- Scheme -*-
;;; Copyright (c) 1998 by Olin Shivers.
;;; This code is open-source; see the end of the file for porting and
;;; Olin Shivers 10/98.

;;; Exports:
;;; (quick-sort  v < [start end]) -> vector
;;; (quick-sort! v < [start end]) -> unspecific

;;; This quicksort is at least somewhat non-naive -- it uses the median of
;;; three elements as the partition pivot, so pathological n^2 run time is
;;; much rarer (but not eliminated completely). If you really wanted to get
;;; fancy, you could use a random number generator to choose pivots. The key
;;; to this trick is that you only need to pick one random number for each
;;; *level* of recursion -- i.e. you only need (lg n) random numbers.  See the
;;; end of the file for a further trick, which I learned from Jon Bentley,
;;; for exploiting ordering procedures that discriminate 3 ways (<, =, >)
;;; to partition each subvector into 3 regions.

(define (quick-sort! v < . maybe-start+end)
(let-vector-start+end (start end) quick-sort! v maybe-start+end
(%quick-sort! v < start end)))

(define (quick-sort v < . maybe-start+end)
(let-vector-start+end (start end) quick-sort v maybe-start+end
(let ((ans (vector-copy v start end)))
(%quick-sort! ans < 0 (- end start))
ans)))

;;; %QUICK-SORT! is not exported.
;;; Preconditions:
;;;   V vector
;;;   START END fixnums
;;;   0 <= START, END <= (vector-length V)
;;; If these preconditions are ensured by the cover functions, you
;;; can safely change this code to use unsafe fixnum arithmetic and vector
;;; indexing ops, for *huge* speedup.
;;;
;;; We bail out to insertion sort for small ranges; feel free to tune the
;;; crossover -- it's just a random guess. If you don't have the insertion
;;; sort routine, just kill that branch of the IF and change the recursion
;;; test to (< 1 (- r l)) -- the code is set up to work that way.

(define (%quick-sort! v elt< start end)
(let recur ((l start) (r end))	; Sort the range [l,r).
(if (< 5 (- r l))

;; Choose the median of V[l], V[r], and V[middle] for the pivot.
(let* ((median (lambda (v1 v2 v3)
(if (elt< v1 v2) (values v1 v2) (values v2 v1))
(if (elt< big v3) big
(if (elt< little v3) v3 little)))))
(pivot (median (vector-ref v l)
(vector-ref v (quotient (+ l r) 2))
(vector-ref v (- r 1)))))

(let loop ((i l) (j (- r 1)))
(let ((i (let scan ((i i)) (if (elt< (vector-ref v i) pivot)
(scan (+ i 1))
i)))
(j (let scan ((j j)) (if (elt< pivot (vector-ref v j))
(scan (- j 1))
j))))
(if (< i j)
(let ((tmp (vector-ref v j)))
(vector-set! v j (vector-ref v i))	; Swap V[I]
(vector-set! v i tmp)		;  and V[J].
(loop (+ i 1) (- j 1)))

(begin (recur l i) (recur (+ j 1) r))))))

;; Small segment => punt to insert sort.
;; Use the dangerous subprimitive.
;; NOTE: It can happen that (< r l), which means an empty range.
;; If %INSERT-SORT! didn't tolerate such a degenerate range, we'd
;; have to check for this case.
(%insert-sort! v elt< l r)
)))

;;; Note: If you're ambitious, you might consider a variant of this quicksort
;;; routine. If you have a comparison routine that returns *three*
;;; indicators -- <, =, or > -- then the partition code can partition the
;;; vector into a left part that is <, a middle region that is =, and a right
;;; part that is > the pivot. Here's how it is done:
;;;   The partition loop divides the range being partitioned into five
;;;   subranges:
;;;       =======<<<<<<<<<?????????>>>>>>>=======
;;;   where = marks a value that is = to the pivot, < marks a value that is
;;;   less than the pivot, ? marks a value that hasn't been scanned, and
;;;   > marks a value that is greater than the pivot. Let's consider the
;;;   rightward scan. If it checks a ? value that is <, it keeps scanning.
;;;   If the ? value is >, we stop the scan -- we are ready to start the
;;;   leftward scan and then do a swap. But if the rightward scan checks a
;;;   ? value that is =, we swap it *down* to the end of the initial chunk
;;;   of ====='s -- we exchange it with the leftmost < value -- and then
;;;   continue our rightward scan. The leftwards scan works in a similar
;;;   fashion, scanning past > elements, stopping on a < element, and swapping
;;;   up = elements. When we are done, we have a picture like this
;;;       ========<<<<<<<<<<<<>>>>>>>>>>=========
;;;   Then swap the = elements up into the middle of the vector to get
;;;   this:
;;;       <<<<<<<<<<<<=================>>>>>>>>>>
;;;   Then recurse on the <'s and >'s. Working out all the tricky little
;;;   boundary cases I leave an exercise to the interested reader.
;;;     -Olin

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;; This code is
;;;     Copyright (c) 1998 by Olin Shivers.
;;; The terms are: You may do as you please with this code, as long as
;;; you do not delete this notice or hold me responsible for any outcome
;;; related to its use.
;;;
;;; Blah blah blah.

;;; Code tuning & porting
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;; This is very portable code. It's R4RS with the following exceptions:
;;; - VECTOR-COPY
;;; - The scsh LET-VECTOR-START+END macro for parsing and defaulting optional
;;;   START/END arguments.
;;; - The R5RS multiple-value VALUES procedure and the simple RECEIVE
;;;   multiple value-binding macro.
;;; - The quicksort recursion bottoms out in a call to an insertion sort
;;;   routine, %INSERT-SORT!. But you could even punt this and go with pure
;;;   recursion in a pinch.
;;;
;;; This code is *tightly* bummed as far as I can go in portable Scheme.
;;;
;;; The internal primitive %QUICK-SORT! that does the real work can be
;;; converted to use unsafe vector-indexing and fixnum-specific arithmetic ops
;;; *if* you alter the two small cover functions to enforce the invariants.
;;; This should provide *big* speedups. In fact, all the code bumming I've
;;; done pretty much disappears in the noise unless you have a good compiler
;;; and also can dump the vector-index checks and generic arithmetic -- so
;;; I've really just set things up for you to exploit.
;;;
;;; The optional-arg parsing, defaulting, and error checking is done with a
;;; portable R4RS macro. But if your Scheme has a faster mechanism (e.g.,
;;; Chez), you should definitely port over to it. Note that argument defaulting
;;; and error-checking are interleaved -- you don't have to error-check
;;; defaulted START/END args to see if they are fixnums that are legal vector
;;; indices for the corresponding vector, etc.

```