241: Match — Simple Pattern-Matching Syntax to Express Catamorphisms on Scheme Data

by Marc Nieper-Wißkirchen

Status

This SRFI is currently in final status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-241@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Received: 2022-11-09
Draft #1 published: 2022-11-10
Draft #2 published: 2022-12-10
Finalized: 2023-03-23
Revised to fix errata:
- 2023-12-21 (Added missing vector pattern.)
- 2024-09-13 (Refined wording to clarify.)

Rationale
- Examples
- Relation to the (withdrawn) SRFI 204
Specification
Implementation
Acknowledgements

Abstract

This SRFI describes a simple pattern matcher based on one originally devised by Kent Dybvig, Dan Friedman, and Eric Hilsdale, which has a catamorphism feature to perform recursion automatically.

Rationale

Pattern matching to destructure expressions is a common task in data-oriented programming. Consequently, a number of pattern-matching libraries for Scheme have found more widespread use. Among them are the three mentioned in the (withdrawn) SRFI 200, namely Racket's pattern matcher, the pattern matcher of Andrew Wright and Robert Cartwright with extensions by Alex Shinn, and a pattern matcher that has been described by R. Kent Dybvig for the use with his Chez Scheme implementation.

The last-mentioned matcher, which is appealing also from a theoretical point of view, is finally the basis of the matcher described in this SRFI. Contrary to the other two matchers, it has a simple pattern language. Its strength lies in its catamorphism feature to perform recursion automatically. It is called match.

While one use of pattern matching is to bind variables, it can also be used to recursively destructure a value in the sense of R⁶RS's or SRFI 1's fold-right. In high-brow category language, such procedures destructuring inductive values are called catamorphisms.

In some sense, match syntactically expresses the general catamorphism on Scheme datums. For example, the aforementioned fold-right procedure (restricted to a single list argument) can be implemented as follows:

    (define (fold-right kons knil lis)
      (match lis
        [(,x . ,[x*]) (kons x x*)]
        [() knil]))

The syntax employed by the pattern matcher by Dybvig, Friedman, and Hilsdale, on which the one described here is based, is also the basis of the syntax of The Nanopass Framework.

Examples

A match expression evaluates an input expression producing an input value. This value is then matched against the patterns of a match expression much as the input of a case expression is matched against datums.

The general form of a match expression is

(match ⟨expr⟩ ⟨clause⟩ …)

where ⟨expr⟩ is the input expression to match and each ⟨clause⟩ has one of the following two forms:

[⟨pattern⟩ (guard ⟨guard expression⟩ …) ⟨body⟩]

[⟨pattern⟩ ⟨body⟩]

As with case, the input expression is evaluated to produce the input value and the first clause that the input value matches, if any, is selected. The body of the selected clause is evaluated, and the values of the last expression in the body are returned. An input value matches a clause if it fits the clause's pattern and the ⟨guard expressions⟩, if any, evaluate to a true value. Patterns may contain symbolic constants, which must match exactly, and pattern variables, which match any input. Pattern variables are prefixed by commas; symbolic constants are not.

Note that in R⁶RS matching brackets may be used in place of matching parentheses and vice versa, mostly for readability. We use brackets in this specification to document a recommended style. Users of Scheme systems not supporting brackets (and users who don't like brackets) can replace them with parentheses throughout.

    (match '(a 17 37)
      [(a ,x) 1]
      [(b ,x ,y) 2]
      [(a ,x ,y) 3])    ⟹ 3

The first clause fails to match because there are three items in the input list, and the pattern has only two. The second clause fails because b does not match a.

In the output expressions, the values of the pattern variables are bound to the corresponding pieces of input.

    (match '(a 17 37)
      [(a ,x) (- x)]
      [(b ,x ,y) (+ x y)]
      [(a ,x ,y) (* x y)])   ⟹ 629

When followed by an ellipsis (...), a pattern variable represents a sequence of input values.

    (match '(a 17 37) [(a ,x* ...) x*])   ⟹ (17 37)

Ellipses can follow a structured pattern containing one or more pattern variables.

    (match '(begin (1 5) (2 6) (3 7) (4 8))
      [(begin (,x* ,y*) ...) (append x* y*)])   ⟹ (1 2 3 4 5 6 7 8)

Ellipses can be nested, producing sequences of sequences of values.

    (match '((a b c d) (e f g) (h i) (j))
      [((,x* ,y** ...) ...) (list x* y**)])   ⟹ ((a e h j) ((b c d) (f g) (i) ()))

Recursion is frequently required while processing an input value with match. Here, a procedure returning the length of a list is defined.

    (letrec
        ([len
          (lambda (lst)
            (match lst
              [() 0]
              [(,x ,x* ...) (+ 1 (len x*))]))])
      (len '(a b c d)))                           ⟹ 4

A simpler version of the above uses the catamorphism feature of match. If a pattern variable is written as ,⁠[var], match recurs on the matching subpart of the input before evaluating the output expressions of the clause.

    (let ([len
           (lambda (lst)
             (match lst
               [() 0]
               [(,x . ,[y]) (+ 1 y)]))])
      (len '(a b c d)))                      ⟹ 4

In some cases, match will need to return multiple values. The catamorphism syntax can be used to receive multiple values. When making implicit recursive calls using the catamorphism syntax, zero or more variables between the parentheses can be included, each representing one of the expected return values.

    (let ([split
           (lambda (lis)
             (match lis
               [() (values '() '())]
               [(,x) (values `(,x) '())]
               [(,x ,y . ,[odds evens])
                (values `(,x . ,odds)
                        `(,y . ,evens))]))])
      (split '(a b c d e f)))                  ⟹ (a c e) (b d f)

This example is equivalent to the following one:

    (let ([split
           (lambda (lis)
             (match lis
               [() (values '() '())]
               [(,x) (values `(,x) '())]
               [(,x ,y . ,[split -> odds evens])
                (values `(,x . ,odds)
                        `(,y . ,evens))]))])
      (split '(a b c d e f)))                  ⟹ (a c e) (b d f)

In this case, the operator to which recursive calls are being made is explicitly named in the catamorphism subpattern, which is ,[split -> odds even] in this case.

In general, ,[⟨variable⟩ …] recurs to the top of the current match while ,[⟨cata operator⟩ -> ⟨variable⟩ …] recurs to (the result of evaluating) ⟨cata operator⟩. ⟨Cata operator⟩ must evaluate to a procedure that accepts one argument, the matched value, and returns as many values as there are identifiers following the ->.

Explicitly naming a catamorphism operator is useful when a parser has to defer parsing of a sub-input to a different parser or when special parsing parameters have to be active when parsing a sub-input.

Here is an example that illustrates the use of guards and how to achieve the effect of a catch-all clause.

    (let ([simple-eval
           (lambda (x)
             (match x
               [,i (guard (integer? i)) i]
               [(+ ,[x*] ...) (apply + x*)]
               [(* ,[x*] ...) (apply * x*)]
               [(- ,[x] ,[y]) (- x y)]
               [(/ ,[x] ,[y]) (/ x y)]
               [,x (assertion-violation 'simple-eval "invalid expression" x))))])
      (simple-eval '(+ (- 0 1) (+ 2 3))))                                           ⟹ 4

The match form extends quasiquote in the ⟨bodies⟩ to an ellipsis-aware version that allows ellipses to be used in place of unquote-splicing, which often leads to more readable code. Consider the following transformer of let forms in a Scheme-like language:

    (define translate
      (lambda (x)
        (match x
          [(let ([,var* ,expr*] ...) ,body ,body* ...)
           `((lambda ,var* ,body ,@body*) ,@expr*)]
          [,x (assertion-violation 'translate "invalid expression" x)])))

This can be rewritten as follows:

    (define translate
      (lambda (x)
      (match x
        [(let ((,var* ,expr*) ...) ,body ,body* ...)
         `((lambda ,var* ,body ,body* ...) ,expr* ...)]
        [,x (assertion-violation 'translate "invalid expression" x)])))

The better readability is, in particular, illustrated by the following procedure with nested ellipses. It converts unnamed let expressions into direct lambda applications, where the let has been generalized to allow an implicit begin in each right-hand-side expression.

    (define (f x)
      (match x
        [(let ([,x ,e1 ...] ...) ,b1 ,b2 ...)
         `((lambda (,x ...) ,b1 ,b2 ...)
           (begin ,e1 ...) ...)]))

The basic usage of (the ellipsis-aware) quasiquote is the same as in R⁶RS:

    `(list ,(+ 1 2) 4)   ⟹ (list 3 4)

Lists can still be spliced into sequences.

    `(a ,(+ 1 2) ,@(map abs '(4 -5 6)) b)   ⟹ (a 3 4 5 6 b)

The extension to quasiquote allows ellipses to be used in place of unquote-splicing (,@) to piece together the output form.

    `(a ,(+ 1 2) ,(map abs '(4 -5 6)) ... b)   ⟹ (a 3 4 5 6 b)

Within each subform followed by an ellipsis, each comma-prefixed item must be a list and all such items within the same subform must have the same length.

    `((,'(1 2 3) . ,'(a b c)) ...)   ⟹ ((1 . a) (2 . b) (3 . c))

A subform followed by an ellipsis may be contained within a larger subform that is also followed by an ellipsis. In this case, each comma-prefixed item must be a list of lists, each such item must have the same length, and the corresponding sublists of each such item must have the same lengths. This requirement generalizes to comma-prefixed items nested within more than two levels of ellipsis-followed subforms.

    `(((a ,'((x 1) (x 2) (x 3))) ...) ...)   ⟹ (((a x) (a 1)) ((a x) (a 2)) ((a x) (a 3)))

In the output, a subform may be followed directly by two or more ellipses; the requirements are the same as for nested ellipses, but the result is flattened rather than nested.

    `((a ,'((x 1) (x 2) (x 3))) ... ...)   ⟹ ((a x) (a 1) (a x) (a 2) (a x) (a 3))

Ellipses can also follow subforms containing items prefixed by unquote-splicing (,@).

    `((a ,@'((x 1) (x 2) (x 3))) ...)   ⟹ ((a x 1) (a x 2) (a x 3))

A subform of the form (... ⟨form⟩) is identical to ⟨form⟩, except that ellipses in the subform have no special meaning.

    `(... (,'(1 2 3) ...))   ⟹ ((1 2 3) ...)

Substitutions are made only for unquoted components appearing at the same nesting level as the outermost quasiquote.

    `(a `(b ,(list 1 2) ... ,(foo ,(list 1 3) ... d) e) f)
       ⟹ (a `(b ,(list 1 2) ... ,(foo 1 3 d) e) f)

Relation to the (withdrawn) SRFI 204

Both the (withdrawn) SRFI 204 and this SRFI describe a pattern matcher and suggest that the respective one becomes included in Scheme implementations and a future Scheme standard. Yet they are not in competition with each other. Quite the contrary, it makes a lot of sense that a Scheme system supports both matchers as they apply to different areas.

While a number of basic tasks can be solved equally well with the matcher described in SRFI 204 and the matcher described here, some problems are better solved with one or the other. The matcher of SRFI 204 mirrors most of Scheme's binding constructs. For these binding constructs, the catamorphism feature in this SRFI does not make sense, so this SRFI's matcher lacks it. On the other hand, the matcher described here distinguishes itself due to its catamorphism feature but, therefore, cannot support binding constructs.

SRFI 204 has a complex pattern language and allows one to write complex matchers with just a single pattern. If this is not needed, the pattern language of this SRFI excels due to its simplicity and exceptionally clear semantics.

For historical reasons, both the matcher described here and in SRFI 204 are called match. This specification does not change the established name. Fortunately, when reading code, there is virtually no danger of confusion. Pattern variables in the matcher described here are all unquoted and there is no quasiquote, while in patterns of SRFI 204's match, unquoted datums usually only appear in quasiquotations.

If more than one matcher is needed in a program, one can easily use the rename facility of the Scheme implementation's module system.

Remark: The author of this SRFI suggests that any reviving attempt of SRFI 204 should make the resulting pattern matcher extensible so that there are only few primitives with clear semantics allowing the rest of the specification to be defined in terms of them.

Even better than reviving SRFI 204 would be specifying a facility to define type-safe matchers as in the Nanopass framework.

Specification

Pattern language

The match form uses a pattern language akin to syntax-rules and syntax-case, which is described here.

A ⟨pattern⟩ has one of the following forms:

(⟨pattern⟩ ⟨ellipsis⟩ . ⟨pattern⟩)

(⟨pattern⟩ . ⟨pattern⟩)

()

#(⟨pattern⟩ …)

#(⟨pattern⟩ … ⟨pattern⟩ ⟨ellipsis⟩ ⟨pattern⟩ …)

,_

,⟨pattern variable⟩

⟨cata pattern⟩

⟨symbol⟩

⟨constant⟩

where a ⟨cata pattern⟩ has one of the following forms:

,[⟨cata operator⟩ -> ⟨cata variable⟩ …]

,[⟨cata variable⟩ …]

and where the ⟨ellipsis⟩ is the ellipsis ... (in the sense of free-identifier=?), ⟨cata operator⟩ is an expression, and ⟨pattern variable⟩ and ⟨cata variable⟩ are identifiers.

In each ⟨pattern⟩, the ⟨pattern variables⟩ and ⟨cata variables⟩ must be pairwise disjoint (in the sense of bound-identifier=?). ⟨Pattern variables⟩ must not be ..., _, or unquote (in the sense of free-identifier=?).

Patterns match Scheme values by the following rules:

A pattern of the form (⟨pattern₁⟩ ⟨ellipsis⟩ . ⟨pattern₂⟩) matches a dotted list if ⟨pattern₁⟩ matches each proper list element and ⟨pattern₂⟩ matches the tail of the dotted list.

A pattern of the form (⟨pattern₁⟩ . ⟨pattern₂⟩) matches a pair if ⟨pattern₁⟩ matches the car and ⟨pattern₂⟩ matches the cdr of the pair.

A pattern of the form () matches the empty list.

A pattern of the form #(⟨pattern⟩ …) matches a vector of as many elements as there are ⟨patterns⟩ if each element matches the corresponding ⟨pattern⟩.

A pattern of the form #(⟨pattern₁⟩ … ⟨pattern_k⟩ ⟨pattern_e⟩ ⟨ellipsis⟩ ⟨pattern_m+1⟩ … ⟨pattern_n⟩) matches a vector of n or more elements whose first k elements match ⟨pattern₁⟩ through ⟨pattern_k⟩, whose next m−k elements each match ⟨pattern_e⟩, and whose remaining n−m elements match ⟨pattern_m+1⟩ through ⟨pattern_n⟩.

A pattern of the form ,_ matches an arbitrary Scheme value.

A pattern of the form ,⟨pattern variable⟩ matches an arbitrary Scheme value.

A ⟨cata pattern⟩ matches an arbitrary Scheme value.

A pattern of the form ⟨symbol⟩ matches a symbol with the same name as ⟨symbol⟩.

A pattern of the form ⟨constant⟩ match a value that is equal (in the sense of equal?) to the literal datum ⟨constant⟩.

Match

The following syntax, together with the auxiliary syntax unquote, unquote-splicing, ..., _, guard, and ->, is exported by the libraries (srfi :241 match) and (srfi :241). The auxiliary syntax unquote, unquote-splicing, ..., and _ is identical to the auxiliary syntax exported by (rnrs base), and guard is identical to the syntax exported by (rnrs exceptions).

(match ⟨input expression⟩ ⟨clause₁⟩ …)

Syntax: ⟨input expression⟩ can be any expression. The ⟨clauses⟩ take one of two forms, either

(⟨pattern⟩ ⟨body⟩)

(⟨pattern⟩ (guard ⟨guard expression⟩ …) ⟨body⟩)

where the ⟨guard expressions⟩ can be any expressions.

Semantics: A match expression is evaluated as follows:

First, the ⟨input expression⟩ is evaluated to yield an input value.
Then, the first ⟨clause⟩ for which the input value matches its ⟨pattern⟩ is selected.
The environment in which the match expression is evaluated is extended by binding the pattern variables of the ⟨pattern⟩ to fresh locations, and the values that were matched against these pattern variables are stored in those locations.
If ⟨guard expressions⟩ are present in the selected clause, they are evaluated in left-to-right order in the extended environment until one evaluates to #f. In this case, the rest of the ⟨clause⟩ is skipped and pattern matching proceeds with the next ⟨clause⟩.
If no ⟨guard expression⟩ evaluates to #f, or if there is no ⟨guard expresssion⟩ present, in unspecified order, the ⟨cata operators⟩ of the ⟨cata clauses⟩ in the ⟨pattern⟩ of the selected ⟨clause⟩ are evaluated in no specific order in the extended environment to yield cata procedures, which are then invoked on the value that was matched against the ⟨cata pattern⟩. The environment is extended a second time by binding the ⟨cata variables⟩ of these ⟨cata patterns⟩ to fresh locations, and the values resulting from invoking the cata procedures are stored in left-to-right order in the corresponding locations.
Finally, the twice-extended environment is extended a third time by binding quasiquote to ellipsis-aware quasiquotation syntax (see below), the ⟨body⟩ of the selected ⟨clause⟩ is evaluated in this thrice-extended environment, and the resulting values from its last expression are returned.

If a ⟨cata pattern⟩ clause is of the second form, the missing ⟨cata operator⟩ defaults to an expression that evaluates to a procedure that, when called with one argument, evaluates the match expression with the input value replaced by the argument value in tail position.

If no pattern of any clause is matched, an exception of type &assertion-violation is raised.

If the match expression is in tail context, the ⟨bodies⟩ are ⟨tail bodies⟩.

Note: A pattern variable in the sense of this SRFI is an ordinary variable, not a pattern variable in the sense of the syntax-case system of R⁶RS.

Remark: The pattern language makes use of unquote without a corresponding quasiquote. It has been argued that this leads to an “unbalanced” quasiquotation syntax (there is an unquote without a surrounding quasiquote) and it has been suggested to add the “implicit” quasiquote explicitly so that the first example

    (define (fold-right kons knil lis)
      (match lis
        [(,x . ,[x*]) (kons x x*)]
        [() knil]))

would become:

    (define (fold-right kons knil lis)
      (match lis
        [`(,x . ,[x*]) (kons x x*)]
        [`() knil]))

We don't subscribe to the view that there is an “implicit” quasiquote in the match syntax described in this SRFI. In fact, quasiquote is not auxiliary syntax for match, and if there were an implicit quasiquote, a nested one should cancel an unquote, which it doesn't. The same fallacious argument was made in SRFI 200. Moreover, following the unquote is not an expression, but either a ⟨pattern variable⟩ or a ⟨cata pattern⟩.

Ellipsis-aware quasiquotation

The following syntax together with the auxiliary syntax unquote, unquote-splicing, and ... is exported by the library (srfi :241 match quasiquote). The auxiliary syntax is identical to the auxiliary syntax exported by (rnrs base).

(quasiquote ⟨ellipsis-aware qq template⟩)

If no instances of the ellipsis ... appear within the ⟨ellipsis-aware qq template⟩ at the same nesting level as the outermost quasiquote, the result of evaluating the ellipsis-aware quasiquote is the same as if the quasiquote syntax of R⁶RS were used. A subtemplate of the form (⟨ellipsis⟩ ⟨qq template⟩) is identical to ⟨qq template⟩, except that ellipses within ⟨qq template⟩ have no special meaning. If a subtemplate is followed by one or more instances of the ⟨ellipsis⟩, the expressions following a comma have to evaluate into nested lists of (at least) the same nesting depths. They are replaced in the output by all of the elements they match in the input, distributed as indicated. It is an error if the output cannot be built up as specified. An (unquote-splicing ⟨expression⟩ …) form is equivalent to (unquote ⟨expression⟩ …) ⟨ellipsis⟩, where this ⟨ellipsis⟩ retains its special meaning.

The auxiliary syntax quasiquote within an ⟨ellipsis-aware qq template⟩ is the ellipsis-aware quasiquote.

Implementation

A portable implementation for R⁶RS systems is in this SRFI's repository.

A portable implementation for R⁷RS systems that support syntax-case is possible as well. For an R⁷RS implementation, the library names should be derived from the given (SRFI 97-conformant) R⁶RS library names as usual.

Acknowledgements

The pattern matcher defined here has originally been described by R. Kent Dybvig. The original program was originally designed and implemented by Dan Friedman. It was later redesigned and reimplemented by Erik Hilsdale. The sample implementation of this SRFI shares no code with Dybvig's, Friedman's and Hilsdale's version.

The example section shamelessly reuses text from R. Kent Dybvig's original description.

Copyright

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Editor: Arthur A. Gleckler