223: Generalized binary search procedures

by Daphne Preston-Kendal

Status

This SRFI is currently in final status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-223@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Abstract

Generalized procedures for binary search of vector-like data structures are provided which can be applied to any sequence type, including ones defined by the user, together with applications of these procedures for Scheme’s built-in vectors.

Rationale

While SRFIs 43 and 133 provide a vector-binary-search procedure, in neither SRFI is the option provided to control whether the index of the leftmost or rightmost matching element of the vector is returned. They also use a C-style cmp procedure which is compatible neither with the less-than comparison predicate used by SRFI 132 for sorting lists and vectors (such as might be used to make one ready for binary search), nor with SRFI 128 comparator ordering predicates. Finally, vector-binary-search, as the name implies, can only be used on Scheme vectors. Use on other sequence types, such as the homogeneous numeric vectors of SRFI 160, is not foreseen.

As binary search is notoriously tricky to implement correctly (especially in light of its apparent simplicity), a correct implementation which is generalizable to any sequence type (including one that a programmer might define themselves) is a useful building block for fast search applications.

Programmers should be aware that it rarely makes sense to use binary search on any sequence type which does not provide O(1) (or ‘effectively O(1)’) access to its members. Scheme lists are not suitable for binary searching, as list-ref is an O(n) operation.

Specification

In the following procedure specifications, the following variable names are used as procedure arguments having a particular meaning:

Generalized procedures

The following procedures must work when used, with suitable ref, lo, and hi arguments, on sequence types which define negative indexes. (This does not refer to the potential use of negative indexes to refer to items in a sequence counting from the last item rather than the first, but rather to sequences where negative indexes refer to unique positions in the sequence.)

(bisect-left a val ref less? lo hi)idx
Searches the sequence a for the value val, returning the smallest index idx into a such that all elements with indexes ≥ idx are all ≥ val. In other words, returns the leftmost index in a of val or anything which is less than it. Returns lo if val is less than everything in a.
(bisect-right a val ref less? lo hi)idx
Searches the sequence a for the value val, returning the largest index idx into a such that all elements with indexes < idx are all < val. In other words, returns the leftmost index in a which is to the right of (i.e. after) val or anything less than it. Returns hi if val is greater than everything in a.

Convenience procedure

Defining both *-bisect-left and *-bisect-right procedures for a given type, and handling the optionality of lo and hi correctly, can make code very repetitive. To ease this repetition, the following convenience procedure is provided.

(bisection ref) 1 (left-proc a val less? lo hi)
2 (right-proc a val less? lo hi)
(bisection ref lo-hi-proc) 1 (left-proc a val less?)
(left-proc a val less? lo hi)

2 (right-proc a val less?)
(right-proc a val less? lo hi)
Returns two values, procedures which call bisect-left and bisect-right respectively with the given ref procedure, comparable to vector-bisect-left and vector-bisect-right for vectors. If the lo-hi-proc argument is given, it should be a procedure which takes one argument, a sequence a, and returns two values, the default lo and hi values for that sequence (usually 0 and the length of the sequence). If no lo-hi-proc procedure is given, the lo and hi arguments to the returned procedures are mandatory.

Vector procedures

(vector-bisect-left a val less?)idx
(vector-bisect-left a val less? lo hi)idx
Searches the vector a using bisect-left.
(vector-bisect-right a val less?)idx
(vector-bisect-right a val less? lo hi)idx
Searches the vector a using bisect-right.

Library names

All procedures provided by this SRFI are in the library called (srfi 223), mutatis mutandis.

Implementations and future RnRS specifications should, if they adopt this SRFI, specify analogous *-bisect-left and *-bisect-right procedures for all sequence types they provide which offer O(1) or ‘effectively O(1)’ access time, and in which sorted data may reasonably be expected to appear, but they should do under another library name. The library (srfi 223) must contain only the procedures directly specified here. They may also choose to divide the core, generalized procedures of this library (bisect-left, bisect-right, and bisection) into one library (such as (xyz bisect)) and make each application of those procedures to specific sequence types their own sublibrary (such as (xyz bisect vector), (xyz bisect flexvector), etc.)

Example usage

Defining a new bisection

Bisect procedures which operate on Scheme bytevectors can be defined concisely with the following code.

(define-values (bytevector-bisect-left bytevector-bisect-right)
  (bisection bytevector-u8-ref
             (lambda (bv) (values 0 (bytevector-length bv)))))

Bisecting a vector

(vector-bisect-left #(1 2 2 3 5) 2 <) 1
(vector-bisect-right #(1 2 2 3 5) 2 <) 3
(vector-bisect-left #(1 2 2 3 5) 4 <) 4
(vector-bisect-left #(1 2 2 3 5) 4 < 0 1) 1

Or with the example bytevector bisection defined above, one could change the procedures here to their corresponding bytevector versions, and the vectors also to bytevectors, and get the same results.

Use with SRFI 128 comparators

The following code shows how to define a procedure vector-bisect-index which returns the smallest index of val in the vector a, or #f if val is not actually in a, according to the given SRFI 128 comparator cmpr.

(define (vector-bisect-index a val cmpr)
  (let ((idx (vector-bisect-left a val (comparator-ordering-predicate cmpr))))
    (and (< idx (vector-length a))
         (=? cmpr val (vector-ref a idx))
         idx)))

Implementation

The sample implementation in the Git repository for this SRFI should be correct and work on all compliant R7RS small implementations, although programmers working on Schemes with no automatic promotion of the results of + to bignums and very large sequences (typically more than a gibi-entry or so in size, depending on fixnum range) should be aware of the possibility of integer overflow. See Joshua Bloch, ‘Extra, Extra — Read All About It: Nearly All Binary Searches and Mergesorts are Broken’. Supporting Scheme systems without automatic bignum promotion is not a goal of the sample implementation.

Acknowledgements

Thanks to: John Cowan, for suggesting a more rigorous specification of the less? parameter, using language from SRFI 128, which in turn inspired an altogether clearer description of what the programmer’s responsibilities are in terms of providing correct values to the procedures in this library; Marc Nieper-Wißkirchen, for suggesting using SRFI 128 comparators instead of bare comparison predicates, and Daniel Itaborai, for requesting higher-level procedures, which suggestions were not adopted but together inspired the above example of implementing vector-bisect-index; Arthur Gleckler, for editing this SRFI and many others, and suggesting a cleaner implementation of vector-bisect-index; and especially to Sundarshan S. Chawathe, who by attempting to create an independent implementation of this library uncovered ambiguities and problems in the specifications of bisect-left and bisect-right in earlier drafts of this SRFI.

The sample implementation is essentially a translation of the Python standard library implementation of binary search in bisect.py into Scheme.

Any remaining errors in specification or sample implementation are the author’s and the author’s alone.

© 2021 Daphne Preston-Kendal (text and sample implementation in Scheme).
© 1992–2019 Python Software Foundation (guideline algorithm implementation in Python).

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Link to the applicable version of the licence for the original Python implementation, inasmuch as it is applicable.


Editor: Arthur A. Gleckler