257: Simple extendable pattern matcher with backtracking

by Sergei Egorov

Status

This SRFI is currently in draft status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-257@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Table of contents

Abstract

Pattern matching extends Scheme's repertoire of conditional constructs, allowing decomposition of compound data structures and binding their parts to variables. This SRFI proposes one such construct, match, which provides all common-denominator functionality described in SRFI-200, adding on top of it support for non-linear patterns and backtracking. Also, the proposed construct is modular, and allows for convenient extension via the define-match-pattern mechanism.

Issues

Currently, none.

Rationale

This SRFI proposes a construct with functionality similar to the one of “Wright-style” pattern matcher (see SRFI-204), with the following differences: it supports backtracking, and it is designed in a modular way, allowing user extensions not only in the form of new types of patterns, but also new pattern matching languages, such as the Dybvig-Friedman-Hilsdale matcher (SRFI-241).

This proposal is influenced by the design of the Gerbil matcher (see here). It features patterns resembling corresponding constructors, e.g. (vector (cons x y)), unlike most other designs, where patterns resemble printed representation of data, e.g. #((x . y)), or mix both forms to support pattern combinators such as and. However, while in most other designs the set of constructor-like pattern forms is fixed, and their names are hardwired in the design, the proposed construct makes constructor-like pattern forms “first-class” components that can be created/imported/exported independently; they are just macros, implementing an internal protocol.

The downside of this approach is that one can no longer name pattern matchers after regular constructors as they cannot share the same namespace. To stay within the bounds of what can be implemented via regular syntax-rules, the choice was made to use tilde-prefixed names, so the above pattern becomes (~vector (~cons x y)). This convention is used for all forms of patterns except for quote and quasiquote, which have convenient reader abbreviations and thus are built into the matcher.

Traditional patterns, resembling printed representations, are supported “under quasiquote”, as proposed in (withdrawn) SRFI-200. New types of patterns as well as new forms of pattern languages can be built via define-match-pattern form, which provides a syntax-rules -like mechanism for pattern rewriting.

Examples

SRFI-204 Examples

The examples below are taken from SRFI-204 and adapted to demonstrate the same functionality using the proposed pattern syntax in its constructor-like and quasiquoted forms:

Literal patterns:

(let ([ls (list 'a "b" #f 2 '() #\c '#(1))])
  (list (match ls [(~list 'a "b" #f 2 '() #\c #(1)) 'ok])
        (match ls [`(a "b" #f 2 () #\c #(1)) 'ok])))
⟹ (ok ok)

Pattern variables:

(match (list 1 2 3) [(~list a b c) b]) ⟹ 2
(match (list 1 2 3) [(~list _ b _) b]) ⟹ 2
(match (list 1 2 3) [`(a ,b c) b] [_ 'fail]) ⟹ fail
(match (list 1 2 3) [`(1 ,b ,_) b] [_ 'fail]) ⟹ 2

Nonlinear patterns:

(match (list 'A 'B 'A) [(~list a b a) a] [_ 'fail]) ⟹ A
(match (list 'A 'B 'A) [`(,a b ,a) a] [_ 'fail]) ⟹ fail
(match (list 'A 'B 'A) [`(,a B ,a) a] [_ 'fail]) ⟹ A
(match (list 'A 'B 'A) [`(,a ,b ,a) a] [_ 'fail]) ⟹ A

In this proposal, quasiquoted and regular patterns can be arbitrarily combined:

(let ([x '(1 2 3 4)])
  (list (match x [(~cons a (~append b (~list c))) (list a b c)])
        (match x [(~cons a `(,@b ,@(~list c))) (list a b c)])
        (match x [(~cons a `(,@b ,c)) (list a b c)])
        (match x [`(,a ,@b ,c) (list a b c)])))
⟹ ((1 (2 3) 4) (1 (2 3) 4) (1 (2 3) 4) (1 (2 3) 4))

Ellipsis (~etc) and tail patterns:

(match (list 1 2) [(~list* 1 2 (~etc 3)) #t]) ⟹ #t
(match (list 1 2 3) [(~list* 1 2 (~etc 3)) #t]) ⟹ #t
(match (list 1 2 3 3 3) [(~list* 1 2 (~etc 3)) #t]) ⟹ #t

(match '((a time) (stitch saves) (in nine)) [(~etc (~list x y)) (list x y)])
⟹ ((a stitch in) (time saves nine))

(match '((a b) (c d) (e f)) [(~etc (~list x y)) (list x y)])
⟹ ((a c e) (b d f))

(define (transpose x)
  (match x [(~etc (~cons a (~etc b))) (cons a (transpose b))] [_ '()]))

(transpose '((1 2 3) (4 5 6))) ⟹ ((1 4) (2 5) (3 6))

(define (palindrome? str)
  (let loop ([chars (filter char-alphabetic?
                            (string->list (string-foldcase str)))])
    (match chars
      ['() #t]
      [(~list a) #t]
      [(~cons a (~append (~etc b) (~list a))) (loop b)]
      [_ #f])))

(palindrome? "Able was I, ere I saw Elba.") ⟹ #t
(palindrome? "Napoleon") ⟹ #f

(define (first-column x)
  (match x [(~etc (~cons a (~etc _))) a]))

(first-column '((1 2 3) (4 5 6) (7 8 9))) ⟹ (1 4 7)

This proposal disagrees with “Wright-style” matcher (WSM) on quasiquote unquote-splicing patterns; in this proposal, they are treated in accordance with the regular quasiquote rules, rather than being turned into ... repeats:

(match (list 1 2) [`(1 2 ,@3) #t] [_ #f]) ⟹ #f ; WSM returns #t
(match '(1 2 . 3) [`(1 2 ,@3) #t] [_ #f]) ⟹ #t ; WSM returns #f
(match (list 1 2 3 3 3) [`(1 2 ,@3) #t] [_ #f]) ⟹ #f ; WSM returns #t

WSM-compatible behavior can be imitated by explicit insertion of ~etc:

(match (list 1 2) [`(1 2 ,@(~etc 3)) #t] [_ #f]) ⟹ #t
(match '(1 2 . 3) [`(1 2 ,@(~etc 3)) #t] [_ #f]) ⟹ #f
(match (list 1 2 3 3 3) [`(1 2 ,@(~etc 3)) #t] [_ #f]) ⟹ #t

Unlike WSM, this proposal supports non-linearity in and out of “ellipsis” patterns:

(match '((1 2 3 4) ((1) (2) (3) (4)) (1 2 3 4))
  [(~list a* (~etc (~list a*)) a*) a*])
⟹ (1 2 3 4)

Some WSM postfix operators do not have built-in equivalents, but can be easily defined. An analog of WSM **1 can be defined as follows:

(define-match-pattern ~etc+ ()
  [(~etc+ p) (~pair? (~etc p))])

(match (list 1 2) [(~list* a b (~etc+ c)) c] [_ #f]) ⟹ #f
(match (list 1 2 3) [(~list* a b (~etc+ c)) c] [_ #f]) ⟹ (3)

An analog of WSM =.. k can be defined as follows:

(define-match-pattern ~etc= ()
  [(~etc= k p)
   (~and (~list? (~prop length => k))
         (~etc p))])

(match '((a b) (c d) (e f))
  [(~etc= 3 (~list x y)) (list x y)]
  [_ 'fail])
⟹ ((a c e) (b d f))

(match '((a b) (c d) (e f) (g h))
  [(~etc= 3 (~list x y)) (list x y)]
  [_ 'fail])
⟹ fail

An analog of WSM *.. k j can be defined as follows:

(define-match-pattern ~etc** ()
  [(~etc** k j p)
   (~and (~list? (~prop length => (~and (~test >= (k)) (~test <= (j)))))
         (~etc p))])

(match '((a b) (c d) (e f))
  [(~etc** 2 4 (~list x y)) (list x y)]
  [_ 'fail])
⟹ ((a c e) (b d f))

(match '((a b) (c d) (e f) (g h))
  [(~etc** 2 4 (~list x y)) (list x y)]
  [_ 'fail])
⟹ ((a c e g) (b d f h))

(match '((a b) (c d) (e f) (g h) (i j))
  [(~etc** 2 4 (~list x y)) (list x y)]
  [_ 'fail])
⟹ fail

Tail patterns will match dotted pairs, and ellipsis patterns won't:

(define (keys x)
  (match x [(~etc (~cons a (~etc _))) a] [_ 'fail]))

(keys '((a 1) (b 2) (c 3))) ⟹ (a b c)
(keys '((a . 1) (b . 2) (c . 3))) ⟹ fail

(define (keys x)
  (match x [(~etc (~cons a _)) a] [_ 'fail]))

(keys '((a 1) (b 2) (c 3))) ⟹ (a b c)
(keys '((a . 1) (b . 2) (c . 3))) ⟹ (a b c)

Logical patterns:

(match 1 [(~and) #t]) ⟹ #t
(match 1 [(~and x) x]) ⟹ 1
(match 1 [(~and x 1) x]) ⟹ 1

(match #f [(~and) #t] [_ #f]) ⟹ #t
(match #f [(~and x) (=> fail) (if x #t (fail))] [_ #f]) ⟹ #f

(match 1 [(~or) #t] [_ #f]) ⟹ #f
(match 1 [(~or x) x]) ⟹ 1
(match 1 [(~or x 2) x]) ⟹ 1

(define (last-matches-one-of-first-three x)
  (match x [`(,a ,a) #t]
           [`(,a ,b ,@c ,(~or a b)) #t]
           [`(,a ,b ,c ,@d ,c) #t]
           [_ #f]))

(last-matches-one-of-first-three '(1 2 3 4 5 1)) ⟹ #t
(last-matches-one-of-first-three '(1 2 3 4 5 2)) ⟹ #t
(last-matches-one-of-first-three '(1 2 3 4 5 3)) ⟹ #t
(last-matches-one-of-first-three '(1 2 3 4 5 6)) ⟹ #f

(define (last-matches-one-of-first-three2 x)
  (match x
    [`(,a ,a) #t]
    [`(,a ,b ,@c ,d) (=> fail)
     (if (or (equal? d a) (equal? d b)) #t (fail))]
    [`(,a ,b ,c ,@d ,e) (equal? c e)]
    [_ #f]))

(last-matches-one-of-first-three2 '(1 2 3 4 5 1)) ⟹ #t
(last-matches-one-of-first-three2 '(1 2 3 4 5 2)) ⟹ #t
(last-matches-one-of-first-three2 '(1 2 3 4 5 3)) ⟹ #t
(last-matches-one-of-first-three2 '(1 2 3 4 5 6)) ⟹ #f

This proposal explicitly assigns #f to undefined ~or variables:

(match '(0 1 2 3 4 5 6 7)
  [(~etc (~or 2 6 rest)) rest])
⟹ (0 1 #f 3 4 5 #f 7)

Single-argument ~not patterns are supported. It is an error for a pattern to refer to the same variable inside and outside of a ~not pattern.

(match 1 [(~and x (~not #f)) x] [_ 'fail]) ⟹ 1
(match #f [(~and x (~not #f)) x] [_ 'fail]) ⟹ fail
(match 1 [(~not 2) #t]) ⟹ #t

Predicates and fields are supported via ~? and ~= respectively. It is an error to refer to pattern variables in non-subpattern parts of these patterns.

(match 1 [(~? odd? x) x]) ⟹ 1
(match '(a) [(~= car x) x]) ⟹ a

Tests that use more than one pattern variable should be implemented in the body of the corresponding rule:

(define (fibby? x)
  (match x
    [(~list* a b c rest)
     (if (= (+ a b) c) (fibby? (cons b (cons c rest))) #f)]
    [(~list a b) #t]
    [(~list a) #t]
    ['() #t]
    [_ #f]))

(fibby? '(4 7 11 18 29 47)) ⟹ #t

Advanced Examples

This section contains examples specific to new functionality proposed in this SRFI.

Backtracking

Some patterns marked as “iterative” in this proposal may match the input object in multiple ways, or “solutions”, backtracking to the next solution if the current one causes downstream patterns to fail (structurally or because of a disagreement in values of shared pattern variables). This process may be time-consuming, but is associated only with iterative patterns, causing no overhead if they are not used.

Iteration strategies vary between pattern types. Regular versions of append matchers are greedy, giving preference to long leftmost segments:

(define (pr* p . x*)
  (for-each (lambda (x) (display x p)) x*))

(let ([p (open-output-string)])
  (match "abc"
    [(~string-append a (~string b) c)
     (=> next) (pr* p "1:" a "+" b "+" c ";") (next)]
    [(~string-append a c)
     (=> next) (pr* p "2:" a "+" c ";") (next)]
    [x (get-output-string p)]))
⟹ "1:ab+c+;2:abc+;"

Non-greedy appends give preference to short leftmost segments:

(let ([p (open-output-string)])
  (match "abc"
    [(~string-append/ng a (~string b) c)
     (=> next) (pr* p "1:" a "+" b "+" c ";") (next)]
    [(~string-append/ng a c)
     (=> next) (pr* p "2:" a "+" c ";") (next)]
    [x (get-output-string p)]))
⟹ "1:+a+bc;2:+abc;"

Backtracking can be triggered explicitly from the body; calling “back” guard formal retries starting from the current pattern's most recent backtracking point or next rule if none left:

(let ([p (open-output-string)])
  (match "abc"
    [(~string-append a (~string b) c)
     (=> next back) (pr* p "1:" a "+" b "+" c ";") (back)]
    [(~string-append a c)
     (=> next back) (pr* p "2:" a "+" c ";") (back)]
    [x (get-output-string p)]))
⟹ "1:ab+c+;1:a+b+c;1:+a+bc;2:abc+;2:ab+c;2:a+bc;2:+abc;"

Custom match forms

It is relatively easy to define custom match forms with a different pattern grammar using a combination of “converter” patterns defined via define-match-pattern and regular syntax-rule -based macro for the form itself. This process is facilitated by a collection of special-purpose patterns, as shown below.

Defining an analog of Dybvig-Friedman-Hilsdale matcher with support for catamorphism feature:

(define-match-pattern ~cmp->p (<...> <_> unquote ->)
  [(_ l ,(x)) (~prop l => x)]
  [(_ l ,(f -> x ...)) (~prop f => x ...)]
  [(_ l ,(x ...)) (~prop l => x ...)]
  [(_ l ,<_>) _]
  [(_ l ,x) x]
  [(_ l ()) '()]
  [(_ l (x <...>)) (~etc (~cmp->p l x))] ; optimization
  [(_ l (x <...> . y)) (~append/t y (~etc (~cmp->p l x)) (~cmp->p l y))]
  [(_ l (x . y)) (~cons (~cmp->p l x) (~cmp->p l y))]
  [(_ l #(x ...)) (~list->vector (~cmp->p l (x ...)))]
  [(_ l other) 'other])

(define-syntax cm-match
  (syntax-rules (guard)
    [(_ () l x () (c ...))
     (match x c ... [_ (error "cm-match error" x)])]
    [(_ () l x ([cmp (guard . e*) . b] cmc ...) (c ...))
     (cm-match () l x (cmc ...)
       (c ... [(~replace-specials <...> <_> (~cmp->p l cmp))
                 (=> n) (if (and . e*) (let () . b) (n))]))]
    [(_ () l x ([cmp . b] cmc ...) (c ...))
     (cm-match () l x (cmc ...)
       (c ... [(~replace-specials <...> <_> (~cmp->p l cmp))
                 (let () . b)]))]
    [(_ x cmc ...)
     (let l ([v x]) (cm-match () l v (cmc ...) ()))]))

Note the use of ~replace-specials pattern that substitutes <...> <_> for special ... _ identifiers to allow them to have their regular meaning inside define-match-pattern. It also uses the ~append/t pattern, a non-iterative list append that takes the length of the tail segment from its first argument.

Defined in this manner, cm-match can be used as D-F-H match w.r.t. matching functionality (no enhanced right-hand-side quasiquote):

(let ([simple-eval
       (lambda (x)
         (cm-match x
           [,i (guard (integer? i)) i]
           [(+ ,[x*] ...) (apply + x*)]
           [(* ,[x*] ...) (apply * x*)]
           [(- ,[x] ,[y]) (- x y)]
           [(/ ,[x] ,[y]) (/ x y)]
           [,x (error "invalid expression" x)]))])
    (simple-eval '(+ (- 0 1) (+ 2 3))))
⟹
4

(let ([split
       (lambda (lis)
        (cm-match lis
          [() (values '() '())]
          [(,x) (values `(,x) '())]
          [(,x ,y . ,[odds evens])
           (values `(,x . ,odds)
                   `(,y . ,evens))]))])
  (split '(a b c d e f)))
⟹
(a c e)
(b d f)

Defining a syntax-rules -like matcher with explicit list of literal identifiers:

(define-match-pattern ~srp->p (<...> <_>)
  [(_ l* ()) '()]
  [(_ l* (x <...>)) (~etc (~srp->p l* x))] ; optimization
  [(_ l* (x <...> . y)) (~append/t y (~etc (~srp->p l* x)) (~srp->p l* y))]
  [(_ l* (x . y)) (~cons (~srp->p l* x) (~srp->p l* y))]
  [(_ l* #(x ...)) (~list->vector (~srp->p l* (x ...)))]
  [(_ l* <_>) _]
  [(_ l* a) (~if-id-member a l* 'a a)])

(define-syntax sr-match
  (syntax-rules (=>)
    [(_ () l* v () (c ...))
     (match v c ... [_ (error "sr-match error" v)])]
    [(_ () l* v ([srp . b] src ...) (c ...))
     (sr-match () l* v (src ...) (c ...
       [(~replace-specials <...> <_> (~srp->p l* srp)) . b]))]
    [(_ x (l ...) src ...)
     (let ([v x]) (sr-match () (l ...) v (src ...) ()))]))

Note the use of ~if-id-member pattern to check if an atom is a literal.

(sr-match '(begin (a 5) (b 6) (c 7) (d 8))
  (begin)
  [(begin (x* y*) ...) (list x* y*)])
⟹ ((a b c d) (5 6 7 8))

(sr-match '((a b c d) (e f g) (h i) (j))
  ()
  [((x* y** ...) ...) (list x* y**)])
⟹ ((a e h j) ((b c d) (f g) (i) ()))

Specification

The grammar rules for the match form are:

   ⟨match expression ⟩ ⟶
          (matchexpression ⟩ ⟨match rule ⟩*)

   ⟨match rule ⟩ ⟶
          (match pattern ⟩ (=>next id ⟩ ⟨back id ⟩)body ⟩ )
      |   (match pattern ⟩ (=>next id ⟩)body ⟩ )
      |   (match pattern ⟩ ⟨body ⟩ )

   ⟨match pattern ⟩ ⟶
          ⟨underscore ⟩
      |   ⟨match pattern identifier ⟩
      |   ⟨atomic datum ⟩
      |   (quotedatum ⟩)
      |   (quasiquotematch quasipattern ⟩)
      |   (match pattern name ⟩ ⟨match pattern argument ⟩*)

   ⟨match quasipattern ⟩ ⟶
          ()
      |   ⟨atomic datum ⟩
      |   ⟨match quasipattern symbol ⟩
      |   (unquotematch pattern ⟩)
      |   ((unquote-splicingmatch pattern ⟩) .match quasipattern ⟩)
      |   (match quasipattern ⟩ .match quasipattern ⟩)
      |   #(match quasipattern ⟩*)

   ⟨atomic datum ⟩ ⟶
      |   ⟨boolean ⟩
      |   ⟨number ⟩
      |   ⟨character ⟩
      |   ⟨string ⟩
      |   ⟨bytevector ⟩

   ⟨match pattern name ⟩ ⟶
          ⟨any identifier except quote and quasiquote

   ⟨match pattern argument ⟩ ⟶
          ⟨match pattern ⟩
      |   ⟨expression ⟩

   ⟨match pattern identifier ⟩ ⟶
          ⟨any identifier except _ and ...

   ⟨next id ⟩ ⟶
          ⟨identifier ⟩

   ⟨back id ⟩ ⟶
          ⟨identifier ⟩

   ⟨match quasipattern symbol ⟩ ⟶
          ⟨any identifier except unquote and unquote-splicing ⟩

   ⟨underscore ⟩ ⟶
          ⟨the identifier _ ⟩

Note: _ ... and => in this specification are auxiliary syntaxes identical to the like-named syntaxes exported by the standard libraries.

The match form behaves as follows. First, its ⟨expression ⟩ argument is evaluated. It should return a single value, which is subsequently tested against match rules in the order they are written until a matching rule is found. If no matching rules are found, match returns an undefined value.

A rule matches if the application of its pattern to the input object is successful and there is no “guard clause” (a list of the => keyword followed by one or two identifiers). If a guard clause is present, it binds the identifiers to procedures allowing advanced control over the matching process. The first procedure, bound within the body to ⟨next id ⟩, allows one to skip the current rule as if its pattern match was unsuccessful. In two-identifier guard clauses, the second procedure, bound within the body to ⟨back id ⟩, allows one to tell the matcher that the current pattern should consider another way to match, or, if no such way exists, to skip the pattern as if its attempts to match were unsuccessful. Both procedures accept no arguments and can be called from any tail position within the body. If these procedures are called in a different way, the behavior is undefined.

Patterns may contain pattern variables, which, if the application of a pattern succeeds, become bound in the corresponding body to parts of the input object. The values of these variables may be used to reject the match via the guard clause mechanism described above.

Any pattern containing pattern variables may use the same variable in more than one place within the pattern. In such cases all the corresponding sub-objects, assigned to the repeated variable, should “agree”, i.e. to be the same according to the equal? predicate.

Some patterns, marked as (iterative)  in the pattern entry's header line, may match the input object in multiple ways. The matcher tries these different matches in the order specified in the pattern description. The two common orders are (iterative, greedy) , maximizing the number of input elements matched by sub-patterns on the left, and (iterative, non-greedy) , maximizing the number of input elements matched by sub-patterns on the right. When iterative patterns are nested, the matcher acts as if the outer pattern has full control over the decomposition of the input object, allowing its immediate sub-patterns to work with the parts in the order the outer pattern supplies them.

Basic patterns

atomic datum ⟩   match pattern
Literal patterns, ⟨boolean ⟩, …, ⟨bytevector ⟩, match equal objects (in terms of Scheme equal? predicate).

_   match pattern
The ⟨underscore ⟩ pattern matches anything. It is not a variable, so it produces no bindings visible in the body of the rule.

match pattern identifier ⟩   match pattern
The ⟨match pattern identifier ⟩ pattern acts like a pattern variable. A pattern variable will match any value and can be repeated. If repeated, all subsequent like-named variables will only match the value that is equal? to the value matched by the first. This value is made available inside ⟨body ⟩ as if a local identifier with the same name is bound to this value via the let form surrounding the body. It is an error to refer to pattern identifiers in expressions within the same pattern, or to use them as ⟨match pattern name ⟩s.

(quotedatum ⟩)  match pattern
The quote pattern matches ⟨datum ⟩. Again, the Scheme equal? predicate is used for comparison.

(quasiquotematch quasipattern ⟩)  match pattern
The quasiquote pattern can be understood in terms of regular patterns. It is translated into combinations of basic patterns and data patterns (see below). The translation function T[⟨match quasipattern ⟩] ⟹ ⟨match pattern ⟩ resembles simplified translation of the regular quasiquote form (shown in abbreviated form, first matching rule is chosen):

T[()]  ⟹  (quote ())
T[⟨atomic datum ⟩]  ⟹  atomic datum ⟩
T[(unquotemp ⟩)]  ⟹  mp ⟩
T[((unquote-splicingmp ⟩))]  ⟹  mp ⟩
T[((unquote-splicingmp ⟩) .mqp ⟩)]  ⟹  (~appendmp ⟩  T[⟨mqp ⟩])
T[(mqp1.mqp2)]  ⟹  (~cons T[⟨mqp1⟩]  T[⟨mqp2⟩])
T[#(mqp ⟩*)]  ⟹  (~vector T[⟨mqp1⟩]*)

The Value pattern

(~valueexpression ⟩)  match pattern
The ~value pattern matches if the input object is equal to the result of ⟨expression ⟩, calculated during the match. The Scheme equal? predicate is used for comparison. This is a convenient way to match against non-pattern variables bound outside the match form.

List patterns

List patterns match various types of pairs/lists, disassembling them into sub-expressions and applying to them sub-patterns. They are named to resemble the corresponding Scheme constructors, and can be combined in a similar manner. Their names are not built into the matcher, but are defined via define-syntax or its equivalent, so they need to be imported if used explicitly.

(~consmpa⟩ ⟨mpd)   match pattern
The ~cons pattern matches a pair whose car  matches ⟨mpa⟩ and whose cdr  matches ⟨mpd⟩.

(~listmp ⟩ …)   match pattern
The ~list pattern matches a proper list, with as many elements as there are sub-patterns ⟨mp ⟩ … The elements should match the corresponding patterns.

(~appendmp ⟩ …)   match pattern (iterative, greedy)
The ~append pattern matches a (possibly improper) list, breaking it into sub-list segments that can match the corresponding patterns ⟨mp ⟩ … If the list is improper, the last pattern is applied to its improper tail segment.
    Note:  This is an example of a iterative pattern that may match the target list in many different ways; all “solutions” are equally good, as long as the segments can be appended via append to make a list equal to the original.
    The solutions are tried starting with ones maximizing the length of the leftmost segment, then the segment after it, etc.

(~append/ngmp ⟩ …)   match pattern (iterative, non-greedy)
The ~append/ng pattern is a “non-greedy” variant of the ~append pattern. The solutions are tried starting with the ones maximizing the length of the rightmost segment, then the segment before it, etc.

(~append/tdatum ⟩ ⟨mp1⟩ ⟨mp2)   match pattern
The ~append/t pattern matches a (possibly improper) list, breaking it into two sub-list segments to be matched against ⟨mp1⟩ and ⟨mp2⟩ sub-patterns.
    The number of pairs in the “spine” of the second segment is assumed to be equal to the number of “spine” pairs in the ⟨datum ⟩ argument. The pattern fails if the input list is too short to be broken in this way.
    Note:  This is a non-iterative pattern, providing a degree of efficiency where the length of the tail pattern is known at macroexpand time.

(~etcmp ⟩)   match pattern
The ~etc pattern matches a proper list such that every one of its elements matches ⟨mp ⟩. If this sub-pattern contains pattern vars, they will be bound to lists of values obtained via individual element matches in the order of their appearance in the input list. This pattern can be compared to the postfix ellipsis (...) notation popularized by syntax-rules.
    Note:  If one or more sub-pattern variables are also used outside ~etc in the same rule-level pattern, the non-linear “agreement” rule is applied to the final list values collected by ~etc pattern on a successful match.

Vector patterns

A limited set of vector patterns is provided. Advanced matching of vectors can be performed either by combining list patterns with ~list->vector converter pattern, or by using core patterns described further below.

(~vectormp ⟩ …)   match pattern
The ~vector pattern matches a vector with as many elements as there are sub-patterns ⟨mp ⟩ … The elements should match the corresponding patterns.

(~vector-appendmp ⟩ …)   match pattern (iterative, greedy)
The ~vector-append pattern matches a vector, breaking it into sub-vector segments that can match the corresponding patterns ⟨mp ⟩ …
    Note:  This is another example of an iterative pattern that may match the target vector in many different ways; the “solutions” are attempted starting with ones maximizing the length of the leftmost segment, then the segment after it, etc.

(~vector-append/ngmp ⟩ …)   match pattern (iterative, non-greedy)
The ~vector-append/ng pattern is a “non-greedy” variant of the ~vector-append pattern. The solutions are attempted starting with the ones maximizing the length of the rightmost segment, then the segment before it, etc.

String patterns

A limited set of string patterns is provided. Advanced matching of strings can be performed either by combining list patterns with the ~list->string converter pattern, or by using core patterns described further below.

(~stringmp ⟩ …)   match pattern
The ~string pattern matches a string with as many characters as there are sub-patterns ⟨mp ⟩ … The characters should match the corresponding patterns.

(~string-appendmp ⟩ …)   match pattern (iterative, greedy)
The ~string-append pattern matches a string, breaking it into sub-string segments that can match the corresponding patterns ⟨mp ⟩ …
    This is an iterative pattern; the “solutions” are attempted starting with ones maximizing the length of the leftmost segment, then the segment after it, etc.

(~string-append/ngmp ⟩ …)   match pattern (iterative, non-greedy)
The ~string-append/ng pattern is a “non-greedy” variant of the ~string-append pattern. The solutions are attempted starting with the ones maximizing the length of the rightmost segment, then the segment before it, etc.

Conversion patterns

These patterns extend the functionality of the matcher, allowing it to deal with numbers and symbols as strings, strings and chars as numbers, etc. They are named as the corresponding constructors, so they match the types of objects on the right-hand side of ->, e.g. ~list->vector matches vectors, but applies its sub-pattern to the result of converting it to the list. They should be read as converters for the patterns : ~list->vector makes a vector pattern from a list pattern, and so on.

(~vector->listmp ⟩)   match pattern
(~string->listmp ⟩)   match pattern
(~list->vectormp ⟩)   match pattern
(~list->stringmp ⟩)   match pattern
(~string->symbolmp ⟩)   match pattern
(~symbol->stringmp ⟩)   match pattern
Convert a pattern built to match objects of the type on the left side of -> to the pattern able to match objects of the type on the right side of ->, applying the “reverse” converter function internally. E.g. if ⟨mp ⟩ can match lists of characters, (~list->stringmp ⟩) will be able to match strings containing the same characters.

(~string->numbermp ⟩)   match pattern
(~string->numbermp ⟩ ⟨rx ⟩)   match pattern
(~number->stringmp ⟩)   match pattern
(~number->stringmp ⟩ ⟨rx ⟩)   match pattern
These patterns convert between number and string patterns, accepting an optional ⟨rx ⟩ argument. This additional argument is an expression that is evaluated when the pattern is applied, and should produce a radix, suitable for Scheme string->number / number->string procedures.

Predicate patterns

The role of predicate patterns is to “guard” their optional sub-patterns, making sure that the input object belongs to a certain Scheme type. Their set is limited, but new ones can be easily defined, as explained in the section on derived patterns below.

(~null?mp ⟩ …)   match pattern
(~pair?mp ⟩ …)   match pattern
(~list?mp ⟩ …)   match pattern
(~boolean?mp ⟩ …)   match pattern
(~number?mp ⟩ …)   match pattern
(~integer?mp ⟩ …)   match pattern
(~vector?mp ⟩ …)   match pattern
(~string?mp ⟩ …)   match pattern
(~symbol?mp ⟩ …)   match pattern
(~char?mp ⟩ …)   match pattern
These patterns match if the input object satisfies the like-named Scheme predicate, and if there are sub-patterns, all sub-patterns match the same input object, as if combined with the ~and pattern combinator (see below).

Logical patterns

These patterns apply their sub-patterns to the same input object, combining the results of individual sub-pattern matches.

(~andmp ⟩ …)   match pattern
This pattern matches if all its sub-patterns match the input object. If sub-patterns contain pattern values, all of them are bound on a successful match. If there are no sub-patterns, the (~and) pattern matches any object.

(~ormp ⟩ …)   match pattern (iterative)
This pattern matches if any of its sub-patterns matches the input object. The sub-patterns are tried in the order they are written, and the process stops on the first sub-pattern that matches successfully (but see the note below). If sub-patterns contain pattern values, all of them are bound on a match, but only the ones belonging to the “successful” sub-pattern will have their values bound to the corresponding sub-values. The remaining variables are bound to #f. If there are no sub-patterns, the (~or) pattern fails on every object.
    Note:  this pattern is iterative in the following sense: if backtracking exhausts all possible match solutions for one sub-pattern, the next one is tried, then the next one, etc.

(~notmp ⟩)   match pattern
This pattern matches if its sub-pattern fails to match the input object. If it contains pattern variables, none of them are bound on a successful match (i.e. when the sub-pattern fails).

Compatibility patterns

These patterns provide functionality popularized by the “Wright-style” family of matchers.

(~list*mp ⟩ … ⟨mpt)   match pattern
The ~list* pattern matches a (possibly improper) list with at least as many elements as there are sub-patterns ⟨mp ⟩ … The elements should match the corresponding patterns; the remaining part of the list should match the last pattern, ⟨mpt⟩.

(~list-no-ordermp ⟩ …)   match pattern
This pattern matches a proper list with as many elements as there are sub-patterns ⟨mp ⟩ … It iterates through all possible permutations of the input list looking for one with elements that match the corresponding sub-patterns.

(~list-no-order*mp ⟩ … ⟨mpt)   match pattern
This pattern matches a (possibly improper) list with at least as many elements as there are sub-patterns ⟨mp ⟩ … The elements, in some permutation of the original order, should match the corresponding patterns; the remaining part of the list should match the last pattern, ⟨mpt⟩.

(~=fn ⟩ ⟨mp ⟩)   match pattern
In this pattern, ⟨fn ⟩ is not a pattern. The pattern binds its input value to a fresh variable ⟨iv ⟩, and then evaluates the (fn ⟩ ⟨iv ⟩) expression; its result is matched against the ⟨mp ⟩ sub-pattern.
    Note:  ⟨fn ⟩ does not have to name a function; it can be of any form. The combined expression is evaluated as is so the matching process may be interrupted in case of errors.

(~?fn ⟩ ⟨mp ⟩ …)   match pattern
In this pattern, ⟨fn ⟩ is not a pattern. The pattern binds its input value to a fresh variable ⟨iv ⟩, and evaluates the (fn ⟩ ⟨iv ⟩) expression. If the expression evaluates to #f, the pattern fails. Otherwise, the input object is matched against ⟨mp ⟩… sub-patterns, combined via ~and (i.e. all of them should match).
    Note:  ⟨fn ⟩ does not have to name a function; it can be of any form. The expression is evaluated as is, so the matching process may be interrupted in case of errors.

Core patterns

These patterns may not be as compact, nor as convenient to use as the ones described above, but they provide functionality needed to build derived patterns via define-match-pattern (see below). In fact, all patterns described above except for ~etc, basic and logical ones are implementable in this way; some of the possible derivations are shown in the Appendix.

(~propfn ⟩ =>mp ⟩ …)   match pattern
(~propfn ⟩ (arg ⟩ …) =>mp ⟩ …)   match pattern
In this pattern, ⟨fn ⟩ and all ⟨arg ⟩s are not patterns. The pattern binds its input value to a fresh variable ⟨iv ⟩, and then evaluates the (fn ⟩ ⟨iv ⟩ ⟨arg ⟩…) expression; its results are matched against the ⟨mp ⟩ … sub-patterns.
    Note:  ⟨fn ⟩ does not have to name a function; it can be of any form. The combined expression is evaluated as is, so the matching process may be interrupted in case of errors.

(~testfn ⟩)   match pattern
(~testfn ⟩ (arg ⟩ …))   match pattern
(~testfn ⟩ =>mp ⟩)   match pattern
(~testfn ⟩ (arg ⟩ …) =>mp ⟩)   match pattern
In this pattern, ⟨fn ⟩ and all ⟨arg ⟩s are not patterns. The pattern binds its input value to a fresh variable ⟨iv ⟩, and evaluates the (fn ⟩ ⟨iv ⟩ ⟨arg ⟩…) expression. If the expression evaluates to #f, the pattern fails. Otherwise, if ⟨mp ⟩ sub-pattern is present, it should also match the result; if it is absent, the match just succeeds.
    Note:  ⟨fn ⟩ does not have to name a function; it can be of any form. The expression is evaluated as is, so the matching process may be interrupted in case of errors.

(~iteratestart ⟩ ⟨head ⟩ ⟨tail ⟩ (var ⟩…)mp ⟩)   match pattern
This pattern combinator builds a backtracking matcher that produces a stream of “solutions” to be matched against ⟨mp ⟩; ⟨start ⟩, ⟨head ⟩, ⟨tail ⟩, and all ⟨var ⟩s are not patterns. Internally, it uses an iterative loop with ⟨var ⟩… serving is a list of state variables (their names should be different, but otherwise unimportant due to hygiene). The rest of the non-pattern arguments are used to form internal expressions and can be either names of procedures or macros. They are used as follows:
    ⟨start ⟩ is invoked with the original value and two continuation procedures: first one (try ) accepts 'seed' values for state variables if start succeeds, the second one (fail ) accepts no values and is called if start fails.
    ⟨head ⟩ accepts current values of state variables and returns a single value to be matched against ⟨mp ⟩ pattern.
    ⟨tail ⟩ accepts the same two continuations as ⟨start ⟩, followed by the current values of state variables. It should either apply its try  continuation to continue with new values of state vars, or apply its fail  one to signal that there are no more “solutions”.
    Note:  both try  and fail  should be applied in tail positions.

Pattern language construction patterns

This SRFI facilitates creation of entirely new pattern-matching forms via derivation. Some of such derived forms may have special needs, met by the pattern combinators below.

(~if-id-memberid ⟩ (literal-id ⟩ …)mpt⟩ ⟨mpf)   match pattern
This pattern selects one of the two sub-pattern arguments based on the first two non-pattern arguments. The ⟨id ⟩ argument is a match pattern identifier, but it is not used as a sub-pattern. The second argument is a list of distinct match pattern identifiers; they are also not used as sub-patterns. The ~if-id-member checks if ⟨id ⟩ is in the list of ⟨literal-id ⟩s, selecting the first sub-pattern for matching if it is there, or the second sub-pattern if it's not. The method of comparison is the same as the one used to compare input identifiers with the list of literal identifiers in syntax-rules.

(~replace-specialsnew ... ⟩ ⟨new _ ⟩ ⟨mp ⟩)   match pattern
This pattern matches ⟨mp ⟩ after replacing all ellipsis (...) identifiers in it with ⟨new ... ⟩, and all underscore (_) identifiers with ⟨new _ ⟩. This is useful if ⟨mp ⟩ incorporates patterns designed to use regular ellipsis and/or underscore as part of its syntax. Replacing them allows pattern rewriting rules (see define-match-pattern below) to employ these identifiers in their special role, controlling the rewriting.

Defining new patterns

New patterns can be defined using a pattern rewriting system, replicating the functionality of syntax-rules, but using patterns instead of regular Scheme expressions or definitions. It hides the details of the internal pattern macroexpansion protocol, creating an illusion that patterns expand into combinations of other patterns directly.

(define-match-patternname ⟩ (literal-id ⟩ …) (imp ⟩ ⟨omp ⟩))   syntax
This form expands into an equivalent of

(define-syntaxname ⟩ (syntax-rules (literal-id ⟩ …)rule ⟩ …))

where ⟨name ⟩ is a new match pattern name, and each ⟨rule ⟩ is a syntax-rules transformation rule, created from a pair of an input pattern ⟨imp ⟩ and the corresponding output pattern ⟨omp ⟩. Please see the description of syntax-rules for the details.

To show how it works in practice, here's a definition of a ~qq pattern, equivalent to the built-in quasiquote pattern; it implements the T[ ] transcription function. More examples are available in the Examples section and in the Appendix below.

(define-match-pattern ~qq (unquote unquote-splicing)
  [(_ ,p)          p]
  [(_ (,@lp))      lp]
  [(_ (,@lp . dp)) (~append lp (~qq dp))]
  [(_ (ap . dp))   (~cons (~qq ap) (~qq dp))]
  [(_ #(p ...))    (~vector (~qq p) ...)]
  [(_ a)           (quote a)])

(match '(1 (2 . 3) #(4))
  [(~qq (,x (,y . ,z) #(,t)))
       `(,x (,y . ,z) #(,t))])
⟹ (1 (2 . 3) #(4))

Elements of templating

Although the bulk of this proposal is dedicated to patterns, there are a few things that can be done to simplify the templating too. There is some symmetry between the pattern language and the regular Scheme expressions, which in many cases allows one to re-create the input object by constructing it back in the same way it was taken apart, e.g.:

(match '(1 (2 . 3) #(4))
  [(~list x (~cons y z) (~vector t))
    (list x  (cons y z)  (vector t))])
⟹ (1 (2 . 3) #(4))

(match '(1 (2 . 3) #(4))
  [`(,x (,y . ,z) #(,t))
   `(,x (,y . ,z) #(,t))])
⟹ (1 (2 . 3) #(4))

Things get more complicated when one has to deal with “ellipsis” expressions. To re-construct them, one has to use map with a lambda expression containing the element construction code. To simplify this process, two extra syntax forms are provided.

(valueexpression ⟩)   procedure or syntax
This form just evaluates ⟨expression ⟩ and returns its value, complementing the ~value pattern. It is also the only form other than quote recognized by name by the etc syntax form:

(etcconstructor ⟩)   syntax
This form may help in shortening the construction code needed to combine lists produced by a successful ~etc pattern. The ⟨constructor ⟩ argument is a limited form of expression: it is either an ⟨atomic datum ⟩, an identifier, a quoted S-expression (a two element list starting with quote), a value expression (see above), or an application or macro use in a form of a list that starts with an identifier, followed by zero or more ⟨constructor ⟩s.
    The etc form is expanded as follows: first, it is scanned for variables. It collects all identifiers outside of quote and value forms that are not used as head identifiers of an application or macro use. Then, a map application is built, with a lambda form with the collected variables as formal parameters and the ⟨constructor ⟩ as its body, followed by the same variables as the lists for map to process. The variables do not have to be taken from patterns, they can come from any place in the current context; all decisions on which variables are unrolled and which are replicated are explicit.
    Uses of the etc form can be nested:

(match '((0) (1 2) (3 4 5) (6 7 8 9))
  [(~etc (~cons x (~etc y*)))
    (etc  (cons x  (etc y*)))])
⟹ ((0) (1 2) (3 4 5) (6 7 8 9))

(match '((0) (1 2) (3 4 5) (6 7 8 9))
  [(~etc (~cons x (~etc y*)))
   (cons (etc x) (etc y*))])
⟹ ((0 1 3 6) () (2) (4 5) (7 8 9))

Appendix

Derived pattern types

This appendix gives possible definitions for some derived pattern types in terms of the basic and core pattern types.

(define-match-pattern ~value ()
  [(_ e)
   (~test equal? (e))])

(define-match-pattern ~cons ()
  [(_ ap dp)
   (~and (~test pair?) (~prop car => ap) (~prop cdr => dp))])

(define-match-pattern ~list ()
  [(~list) '()]
  [(~list p . p*) (~cons p (~list . p*))])

(define-match-pattern ~list* ()
  [(~list* p) p]
  [(~list* p . p*) (~cons p (~list* . p*))])

(define-syntax match:cno-start
  (syntax-rules ()
    [(_ xv try f)
     (if (pair? xv) (try '() xv) (f))]))

(define-syntax match:cno-head
  (syntax-rules ()
    [(_ h t)
     (cons (car t) (append h (cdr t)))]))

(define-syntax match:cno-tail
  (syntax-rules ()
    [(_ try f h t)
     (if (pair? (cdr t)) (try (cons (car t) h) (cdr t)) (f))]))

(define-match-pattern ~cons-no-order ()
  [(~cons-no-order pe pr)
   (~iterate match:cno-start match:cno-head match:cno-tail (h t)
     (~cons pe pr))])

(define-match-pattern ~list-no-order ()
  [(~list-no-order) '()]
  [(~list-no-order p) (~list p)]
  [(~list-no-order p . p*)
   (~cons-no-order p (~list-no-order . p*))])

(define-match-pattern ~list-no-order* ()
  [(~list-no-order* p) p]
  [(~list-no-order* p . p*)
   (~cons-no-order p (~list-no-order* . p*))])

(define-match-pattern ~vector? ()
  [(_ p ...)
   (~and (~test vector?)  p ...)])

(define-match-pattern ~vector->list ()
  [(_ p)
   (~and (~test list?) (~prop list->vector => p))])

(define-match-pattern ~vector ()
  [(_ p ...)
   (~and (~test vector?) (~prop vector->list => (~list p ...)))])

(define-match-pattern ~number->string ()
  [(_ p)
   (~and (~test string?) (~prop string->number => p))]
  [(_ p rx)
   (~and (~test string?) (~prop string->number (rx) => p))])

(define-match-pattern ~= ()
  [(~= f p) (~prop f => p)])

(define-match-pattern ~? ()
  [(~? f p ...) (~and (~test f) p ...)])

Implementation

Sample syntax-rules-based implementation is available. It runs on many R6RS and R7RS and some R5RS Schemes.

Source for the sample implementation.

Acknowledgements

This proposal is both based on and inspired by the work of the following people: Andrew K. Wright, Robert Cartwright, Alex Shinn, R. Kent Dybvig, Dan Friedman, Erik Hilsdale, Chris Hanson, Gerald Jay Sussman, Marc Nieper-Wißkirchen, Dimitris Vyzovitis, and Panicz Maciej Godek. I am grateful to Andrew Pochinsky and Daphne Preston-Kendal for their feedback.

© 202? Sergei Egorov

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Editor: Arthur A. Gleckler