264: String syntax for Scheme Regular Expressions

by Sergei Egorov

Status

This SRFI is currently in draft status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-264@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Abstract

This SRFI proposes SSRE, an alternative string-based syntax for Scheme Regular Expressions as defined by SRFI 115. String syntax is both compact and familiar to many regexp users; it is translated directly into SRE S-expressions, providing equivalent constructs. While the proposed syntax mostly follows PCRE, it takes into account specifics of Scheme string syntax and limitations of SRE, leaving out constructs that either duplicate functionality provided by Scheme strings or have no SRE equivalents. The repertoire of named sets and boundary conditions can be extended via a parameter mechanism. Extensions to PCRE syntax allow concise expression of operations on named character sets.

Issues

The design of the string-sre-definitions procedure assumes that parameters follow the protocol described in SRFI 39, namely that a parameter procedure can be called with a value argument to set the parameter globally. This behavior is not required by R7RS.

Rationale

This SRFI proposes a “PCRE-style” string syntax for Scheme Regular Expressions (see SRFI 204). Unlike existing parsers with similar goals, the proposed solution features both formal syntax (described via RNRS-like extended BNF notation), and well-defined semantics (providing only constructs with direct SRE equivalents).

In contrast with PCRE syntax, SSRE syntax does not provide its own separate notation for encoding non-ASCII characters, fully relying on regular Scheme string syntax for this purpose. No PCRE constructs for features that are not covered by SRE specification are supported; if a SRE feature can be represented by more than one PCRE construct, only the most frequently used one of each group is supported by SSRE. The examples below show SSRE notation and its equivalent in SRE form (note that multiple SRE equivalents with the exact same semantics may exist). The string-sre->sre procedure does the transformation.

  "\\(?(\\d{3})\\D{0,3}(\\d{3})\\D{0,3}(\\d{4})"
  ⟹ (: (? #\() ($ (= 3 numeric)) (** 0 3 (~ numeric))
        ($ (= 3 numeric)) (** 0 3 (~ numeric)) ($ (= 4 numeric)))

  "(?<n>A)(?:(?<n>foo)|(?<n>bar))\\k<n>"
  ⟹ (: (-> n #\A) (or (-> n (: #\f #\o #\o)) (-> n (: #\b #\a #\r)))
        (backref n))

  "^[^$|*+?{,}\\d-]+"
  ⟹ (: bos (+ (~ (or #\$ #\| #\* #\+ #\? #\{ #\, #\} numeric #\-))))

  "(?<=(?=.(?<=x)))"
  ⟹ (look-behind (look-ahead (: nonl (look-behind #\x))))

Extensions to PCRE syntax allow concise expression of operations on named character sets. If an open brace is followed by a non-digit, the characters following the open brace up to the matching closing brace form a “character set” notation, consisting of named sets and set operations between them. POSIX-like character classes have one-letter names, so the notation is concise; connecting individual sets with set operations allows one to express SRE expressions that are not expressible in PCRE.

  "a{d}*" ⟹ (: #\a (* numeric))

  "a{d|s}*" ⟹ (: #\a (* (or numeric space)))

  "{~s}*{a}" ⟹ (: (* (~ space)) alpha)

  "{p-[.]}{1,2}" ⟹ (** 1 2 (- punct #\.))

  "{{p-[.]}|a}+" ⟹ (+ (or (- punct #\.) alpha))

  "{p-[.]|a}+" ⟹ (+ (or (- punct #\.) alpha))

Note that since “character set” notation is disjoint from the postfix range repeat operator, they can coexist in the same SSRE string without a problem. In addition to named sets, it can accept “character class” operands; set operators are ~ (complement), | (or), & (and), and - (set difference).

In addition to combining sets of characters, the “character set” notation allows one to refer to and combine named boundary conditions; boundary condition operators are ! (not), and | (or).

  ".{!b}." ⟹ (: nonl nwb nonl)

  ".{b}.*" ⟹ (: nonl (or bow eow) nonl)

SRE-specific named boundaries can be written via shortcut names (< and > characters can start and end names respectively), and via their normal SRE names:

  "{<s|<w}." ⟹ (: (or bos bow) nonl)

  "{bos|bow}." ⟹ (: (or bos bow) nonl)

Also, the same notation can represent special named expressions:

  "{<w>}?" ⟹ (? word)

  "{<g>}{2,}" ⟹ (>= 2 grapheme)

SSRE supports the most frequently used regexp options (i, m, s, x, n, u); they can be specified via (?option ⟩… notation at the start of the string or a non-capturing group:

  "(?i)hello" ⟹ (w/nocase (: #\h #\e #\l #\l #\o))

  "(?i:h)ello" ⟹ (: (w/nocase #\h) #\e #\l #\l #\o)

The set of named sets/boundary conditions/expressions can be extended via the string-sre-definitions parameter, e.g.:

  (string-sre-definitions
    (string-sre-bind 'Any 'cset 'any
    (string-sre-bind 'Nd 'cset char-set:Nd
    (string-sre-bind 'vowel 'cset '(or #\a #\e #\i #\o #\u #\y #\w)
    (string-sre-bind 'Vowel 'cset '(or #\A #\E #\I #\O #\U #\Y #\W)
    (string-sre-bind 'EHi 'cset char-set:Egyptian_Hieroglyphs
    (string-sre-bind 't 'cset char-set:title-case
    (string-sre-definitions))))))))

Specification

The specification below assumes the reader's familiarity with both Scheme Regular Expressions (SRFI 115) and PCRE Syntax. Only topics requiring special attention are discussed; all PCRE constructs supported by SSRE, as defined by the formal syntax below, have their obvious SRE equivalents.

(string-sre->srestring ⟩)   procedure
This procedure converts ⟨string ⟩ in SSRE syntax to the corresponding SRE. If the string is not a valid SSRE, an error that satisfies string-sre-syntax-error? is signaled.

(string-sre->regexpstring ⟩)   procedure
This procedure converts ⟨string ⟩ in SSRE syntax to the corresponding regexp by applying the regexp procedure from SRFI 115 to the conversion result. If the string is not a valid SSRE, an error that satisfies string-sre-syntax-error? is signaled.

(string-sre-syntax-error?obj ⟩)   procedure
Error type predicate. Returns #t if ⟨obj ⟩ is an object raised by the string-sre->sre procedure. Otherwise, returns #f.

(string-sre-definitions)   procedure
(string-sre-definitions bindings ⟩)   procedure
This procedure acts as a parameter, providing access to the list of defined entities for the string-sre->sre procedure. The ⟨bindings ⟩ argument is either a result of calling string-sre-definitions with no arguments, or a result of calling one of the two procedures below.

(string-sre-bind name ⟩ ⟨type ⟩ ⟨sre ⟩ ⟨bindings ⟩)   procedure
(string-sre-unbind name ⟩ ⟨bindings ⟩)   procedure
The ⟨bindings ⟩ argument is either a result of calling string-sre-definitions with no arguments, or a result of calling one of these two procedures. The ⟨name ⟩ argument is a symbol. The ⟨type ⟩ argument is one of the following symbols: cset (stands for character set), bcnd (stands for boundary condition), or expr (stands for any SRE expression).
The first procedure creates a new ⟨bindings ⟩ object, giving a new definition for ⟨name ⟩, replacing the old definition if any; the second procedure creates a new ⟨bindings ⟩ object that has the same definitions as the original one, except for ⟨name ⟩, if such a definition was present; otherwise, the original bindings object is returned.

Note that the result of the conversion depends only on the input string and the value of the string-sre-definitions parameter. This allows implementations to cache conversion results if the converter is called with the same SSRE string repeatedly.

String SRE grammar

The grammar rules for the SSRE describe the contents of the SSRE string as a sequence of characters, rather than its surface syntax as described in the R7RS “Formal Syntax” section. In particular, when writing actual string literals, double quotes and backslashes need to be escaped by prepending a backslash; non-printing characters such as newline have to be either included directly or represented via one of Scheme’s string escape mechanisms. Non-ASCII characters are handled in the same way; no special escape mechanism is provided for this purpose by SSRE.

atmosphere ⟩ ⟶ ⟨whitespace ⟩ | ⟨comment ⟩
whitespace ⟩ ⟶ ⟨space, tab, or newline ⟩
comment ⟩ ⟶ #all subsequent characters up to a newline ⟩
inter-token space ⟩ ⟶ ⟨atmosphere ⟩*

count ⟩ ⟶ ⟨number ⟩
number ⟩ ⟶ ⟨digit ⟩+
digit ⟩ ⟶ 0 | ... | 9

word ⟩ ⟶ ⟨letter ⟩+
letter ⟩ ⟶ a | ... | z | A | ... | Z | _

name ⟩ ⟶ ⟨name constituent ⟩+
name constituent ⟩ ⟶ ⟨letter ⟩ | < | >

xs ⟩ ⟶ ⟨inter-token space if x option is in effect, empty otherwise ⟩

body regexp ⟩ ⟶
|(?regexp options ⟩ )xs ⟩ ⟨alt regexp ⟩
|⟨xs ⟩ ⟨alt regexp ⟩

alt regexp ⟩ ⟶
|sequence regexp ⟩ ⟨xs ⟩ |xs ⟩ ⟨alt regexp ⟩
|⟨sequence regexp ⟩

sequence regexp ⟩ ⟶
|quantified regexp ⟩ ⟨xs ⟩ ⟨sequence regexp ⟩
|⟨empty regexp ⟩

empty regexp ⟩ ⟶ ⟨xs ⟩

quantified regexp ⟩ ⟶
|primary regexp ⟩ ⟨quantifier ⟩*

quantifier ⟩ ⟶
|xs ⟩ *
|⟨xs ⟩ +
|⟨xs ⟩ ?
|⟨xs ⟩ {xs ⟩ ⟨repeat ⟩ ⟨xs ⟩ }

repeat ⟩ ⟶
|count ⟩ ⟨xs ⟩ ,xs ⟩ ⟨count ⟩
|⟨count ⟩ ⟨xs ⟩ ,
|⟨count ⟩

primary regexp ⟩ ⟶
|char regexp ⟩
|^
|$
|.
|⟨boundary shortcut ⟩
|⟨class shortcut ⟩
|⟨class regexp ⟩
|⟨set regexp ⟩
|⟨escaped punctuation ⟩
|⟨regexp shortcut ⟩
|⟨capture regexp ⟩
|⟨group regexp ⟩
|⟨lookahead regexp ⟩
|⟨lookbehind regexp ⟩
|⟨backref regexp ⟩

char regexp ⟩ ⟶ ⟨x char if x option is in effect, nonx char otherwise ⟩
nonx char ⟩ ⟶ ⟨any char but \ ^ $ . | * + ? [ ] ( ) { } ⟩
x char ⟩ ⟶ ⟨any nonx char but # and whitespace ⟩

boundary shortcut ⟩ ⟶
|\b | \B | \< | \> | \A | \z

class regexp ⟩ ⟶
|[^class body ⟩ ]
|[class body ⟩ ]

class body ⟩ ⟶
|]class element ⟩*
|⟨class element ⟩*

class element ⟩ ⟶
|[:class name ⟩ :]
|⟨class shortcut ⟩
|⟨class char ⟩ -class char ⟩
|⟨class char ⟩

class shortcut ⟩ ⟶
|\d | \D | \s | \S | \w | \W
|\pletter ⟩ | \p{class name ⟩ }
|\Pletter ⟩ | \P{class name ⟩ }

class char ⟩ ⟶
|[.any character ⟩ .]
|⟨any character but ] and \ ⟩
|\\ | \^ | \- | \[ | \]

set regexp ⟩ ⟶
|{xs ⟩ ⟨set alt ⟩ ⟨xs ⟩ }

set alt ⟩ ⟶
|set infix op ⟩ ⟨xs ⟩ |xs ⟩ ⟨set infix op ⟩
|⟨set infix op ⟩

set infix op ⟩ ⟶
|set prefix op ⟩
|⟨set infix op ⟩ ⟨xs ⟩ &xs ⟩ ⟨set prefix op ⟩
|⟨set infix op ⟩ ⟨xs ⟩ -xs ⟩ ⟨set prefix op ⟩

set prefix op ⟩ ⟶
|set primary ⟩
|~xs ⟩ ⟨set prefix op ⟩
|!xs ⟩ ⟨set prefix op ⟩

set primary ⟩ ⟶
|set name ⟩
|⟨class regexp ⟩
|⟨set regexp ⟩

escaped punctuation ⟩ ⟶
|\\ | \^ | \$ | \. | \| | \* | \+ | \? | \[ | \] | \( | \) | \{ | \} | \# | \whitespace ⟩

regexp shortcut ⟩ ⟶
|\X | \Z

capture regexp ⟩ ⟶
|(?<capture label ⟩ >xs ⟩ ⟨alt regexp ⟩ )
|(xs ⟩ ⟨alt regexp ⟩ )

group regexp ⟩ ⟶
|(?regexp options ⟩ :xs ⟩ ⟨alt regexp ⟩ )

lookahead regexp ⟩ ⟶
|(?=xs ⟩ ⟨alt regexp ⟩ )
|(?!xs ⟩ ⟨alt regexp ⟩ )

lookbehind regexp ⟩ ⟶
|(?<=xs ⟩ ⟨alt regexp ⟩ )
|(?<!xs ⟩ ⟨alt regexp ⟩ )

backref regexp ⟩ ⟶
|\digit ⟩ ⟨digit ⟩
|\digit ⟩
|\k<capture label ⟩ >

regexp options ⟩ ⟶
|option letter ⟩*
|⟨option letter ⟩* -option letter ⟩*

option letter ⟩ ⟶ i | m | s | x | n | u

capture label ⟩ ⟶ ⟨word ⟩
class name ⟩ ⟶ ⟨word ⟩
set name ⟩ ⟶ ⟨name ⟩

Named entities

The table below lists named character sets / boundary conditions / expressions available by default, along with their types and SRE equivalents. Please note that the same named character sets are used in in all of the following: set notation, named classes in character class notation [:name:], character property notations \p{name}, \P{name}, as well as their single-letter variants.

Name   Type   SRE Name   Type   SRE Name   Type   SRE
anycsetany _csetany
digitcsetnumeric dcsetnumeric ncsetnumeric
lowercsetlower lcsetlower
uppercsetupper ucsetupper
alphacsetalpha acsetalpha
alnumcsetalnum ancsetalnum
xdigitcsetxdigit xcsetxdigit
cntrlcsetcntrl ccsetcntrl
punctcsetpunct pcsetpunct
graphcsetgraph gcsetgraph
symbolcsetsymbol ycsetsymbol
spacecsetspace scsetspace
printcsetprint gscsetprint
blankcsetsee note 1 hcsetsee note 1
vcsetsee note 2
wcsetsee note 3
bosbcndbos <sbcndbos
eosbcndeos s>bcndeos
bolbcndbol <lbcndbol
eolbcndeol l>bcndeol
bowbcndbow <wbcndbow <bcndbow
eowbcndeow w>bcndeow >bcndeow
bogbcndbog <gbcndbog
eogbcndeog g>bcndeog
wbbcndsee note 4 bbcndsee note 4
nwbbcndnwb
<w>exprword
<g>exprgrapheme Xexprgrapheme

Note 1: any horizontal space character. In a Unicode context this corresponds to space, tab, and any other character in the Space Separator category (Zs). In an ASCII context this corresponds to space and tab.

Note 2: any vertical space character. In a Unicode context this corresponds to line feed, form feed, carriage return, and any other character in the Line and Paragraph Separator categories (Zl, Zp). In an ASCII context this corresponds to line feed, form feed, and carriage return.

Note 3: any word character. Equivalent to (or alnum #\_).

Note 4: word boundary. Equivalent to (or bow eow).

Character set notation operators

Listed by priority, these operators go as follows: | has the lowest priority, & and - have the same (medium) priority, ! and ~ have the highest priority.

Operator   Type   SRE
~cset→cset~ (complement)
&cset×cset→cset& (intersection)
|cset×cset→cset| (union)
-cset×cset→cset- (set difference)
!bcnd→bcndneg-look-ahead (not)
|bcnd×bcnd→bcnd| (or)

Regexp options

Options can be specified either as optional symbol arguments to the string-sre->sre procedure, or in the SSRE string itself via (?option ⟩… notation at the start of the string or a non-capturing group. They have their traditional meanings. Some of them affect the way certain SSRE expressions are translated into SRE, some are purely syntactical (x). The table below lists all of them with short descriptions; in the Translation column, comma separates SRE constructs used when the option is on from their alternatives used when the option is off. Initially all options are off except for u.

Option   TranslationDescription
i(w/nocase …), (w/case …)case-insensitive matching
m^ $ are bol eol, bos eosmultiline mode
s. is any, nonlsingle-line mode
xrelaxed syntax, regular syntaxfree-spacing and comments
nnothing, ($ …)do not capture unnamed groups
u(w/unicode …), (w/ascii …)use Unicode matching

Please note that the n option, as expected, does not disable capturing for named capturing groups, and thus is not converted to SRE w/nocapture, which would disable all capturing.

Sample implementation

Implementation note:

Source for the sample implementation (R6RS/R7RS).
Tests (ASCII only).

Acknowledgements

This proposal is based on PCRE syntax and inspired by the pcre->sre converter by Alex Shinn, distributed as a part of the IrRegex package by the same author.

© 2025 Sergei Egorov

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Editor: Arthur A. Gleckler