Title

Scheme Regular Expressions

Author

Alex Shinn

This SRFI is currently in ``draft'' status. To see an explanation of each status that a SRFI can hold, see here. To provide input on this SRFI, please mail to <srfi minus 115 at srfi dot schemers dot org>. See instructions here to subscribe to the list. You can access previous messages via the archive of the mailing list.

Received: 2013/10/08
Draft: 2013/10/12-2013/12/12

Abstract
Issues
Rationale
Types and Naming Conventions
Compatibility Levels and Features
Library Procedures and Syntax
SRE Syntax

Basic Patterns
Repeating patterns
Submatch Patterns
Character Sets
Named Character Sets
Boundary Assertions
Extended Patterns

Implementation
References

Abstract

This SRFI provides a library for matching strings with regular expressions described using the SRE "Scheme Regular Expression" notation first introduced by SCSH, and extended heavily by IrRegex.

Issues

How to integrate with the PCRE regular expression library? The intention is to make this the primitive notation, and for POSIX require a separate wrapper such as (pcre->sre <str>). Alternately we could allow both in the same API, as in IrRegex, though this introduces an ambiguity. Finally, we could make this entirely separate from the PCRE API.

From SCSH's SREs I've left out the dsm notation which doesn't seem as though it need be exposed to the user, the posix-string notation because it's better accomplished with pcre->sre, and uncase whose exact semantics and motivation I never quite understood. I also left out the blank character class since it's a GNU extension without an accepted Unicode definition.

| and & are allowed, but the former must be escaped, which looks fairly ugly. For aesthetics they can also be written or and and, respectively.

I've kept most IrRegex extensions, but made many of the non-POSIX ones optional, designated by the regexp-extended feature, and backref specifically gets its own feature regexp-backrefs. I left out the common utility patterns integer, domain, url, etc., which can easily enough be included in libraries and unquoted into SREs.

The => shorthand for named matches used by IrRegex would perhaps have better been named <-, the more common choice to represent binding in parsers, leaving => open for the send-to-procedure idiom used in cond.

The API uses string indices for start, end and match positions, which is slow for a UTF8 implementation. However, the reference implementation uses string cursors for efficient iteration, minimizing offset conversions, and suffers no penalty if submatch strings are directly extracted instead of bounds.

Unicode properties and grapheme handling have no precedent in SRE implementations, though has much precedent in other regexp libraries. Making Unicode the default feels right, but the vast majority of regexps are likely to want ASCII.

Many Unicode properties as well as Unicode script names that are available in PCRE are not provided as char-sets here.

SREs with embedded SRFI 14 char-sets can't be written and read back in portably. R7RS WG2 is considering external syntax representations, and may include them for SRFI 14 char-sets as well, making this a non-issue. On the other hand SREs with embedded compiled regexps, as allowed in SCSH, are not supported, largely to preserve writeability. Instead you should embedded other SREs.

regexp->sre is frequently requested in IrRegex. It is useful and the only argument against it is that it would require more memory for compiled regexps (linearly more for most implementations), but I'll wait to see if it's requested in the discussion.

There aren't enough examples.

Rationale

Regular expressions are the lingua franca of string matching today, used in everything from editors to search engines to APIs for everything in between. Other tools may be better suited to specific purposes, but it is assumed any modern language provide regular expression support.

The choice to use SREs instead of strings for regular expressions is a natural one, since they are easier to read and extend as well as being both faster and simpler to compile. An efficient reference implementation of this SRFI can be written in 800 lines of code, whereas in IrRegex the full PCRE parser alone requires over 500 lines.

Procedure Index

regexp	rx	char-set->sre	valid-sre?
regexp?	regexp-match	regexp-match?	regexp-search
regexp-fold	regexp-extract	regexp-split	regexp-partition
regexp-replace	regexp-replace-all	rx-match?	rx-match-count
rx-match-submatch	rx-match-submatch-start	rx-match-submatch-end	rx-match->list

Sre Syntax Index

<string>	seq	:	or
\|	w/nocase	w/case	w/ascii
w/unicode	?	*	+
>=	=	**	submatch
$	submatch-named	=>	backref
<char>	(<string>)	/	or
~	-	and	&
any	nonl	ascii	lower-case
lower	upper-case	upper	alphabetic
alpha	numeric	num	alphanumeric
alphanum	alnum	punctuation	punct
symbol	graphic	graph	whitespace
white	space	printing	print
control	cntrl	hex-digit	xdigit
bos	eos	bol	eol
bow	eow	nwb	word
word+	word	bog	eog
grapheme	??	*?	**?
look-ahead	look-behind	neg-look-ahead	neg-look-behind
atomic	if	commit

Types and Naming Conventions

We introduce two new types, regexp and rx-match, which are disjoint from all other types. We also introduce the concept of an "SRE," which is not a disjoint type but is a Scheme object following the specification described below.

SRFI 14 defines the char-set type, which can be used as part of an SRE.

In the prototypes below the following naming conventions imply type restrictions:

char-set: a SRFI 14 character set
cset-sre: an sre which corresponds to matching a single character out of a set of characters
end: an exact, non-negative integer, defaulting to the (string-length str)
finish: a procedure (lambda (i rx-match str obj) ...)
obj: any object
knil: any object
kons: a procedure (lambda (i rx-match str obj) ...)
re: an SRE or pre-compiled regexp object
rx-match: an rx-match object from a successful match
sre: an SRE as described below
start: an exact, non-negative integer, defaulting to 0
str: a string
subst: an sexp describing a substition template
X-or-false: either an object of type X or the false value

Compatibility Levels and Features

We specify a thorough, though not exhaustive, syntax with many extensions popular in modern regular expression libraries such as PCRE. This is because it is assumed in many cases said libraries will be used as the underlying implementation, the features will be desirable, and if left unspecified people will provide their own, often incompatible, extensions.

On the other hand it is acknowledged that not all implementations will be able to support all extensions. Some, like commit, are specific to backtracking, some are difficult to implement for DFA implementations, and some, like backref, are prohibitively expensive for any implementation. Furthermore, even if an implementation has Unicode support, its regexp library may not.

To resolve these differences we divide the syntax into a minimal core which all implementations are required to support, and additional extensions. In R7RS or other implementations which support SRFI 0 cond-expand, the availability can be tested with the following cond-expand features:

regexp-extended - the extended patterns described below are supported
regexp-backrefs - the backref pattern is supported
regexp-unicode - regexp character sets support Unicode

The first two simply refer to support for certain SRE patterns.

regexp-unicode indicates support for Unicode contexts. Toggling between Unicode and ASCII can be done with the w/unicode and w/ascii patterns. In a Unicode context, the named character sets have their full Unicode definition as described below, grapheme boundaries are "extended grapheme clusters," and word boundaries are "default word boundaries" as defined in UAX #29 (Unicode Text Segmentation). Thus Unicode contexts are equivalent to Level 2 support for regular expressions as defined in Unicode TR-18. Implementations which provide this feature may still support non-Unicode characters.

Library Procedures and Syntax

(regexp re) => regexp

Compile a regexp if given an object whose structure matches the SRE syntax. This may be written as a literal or partial literal with quote or quasiquote, or may be generated entirely programmatically. Returns re unmodified if it is already a regexp. Raises an error if re is neither a regexp nor a valid representation of an SRE.

(rx sre ...) => regexp

Macro shorthand for (regexp `(: sre ...)). May be able to perform some or all computation at compile time if sre is not unquoted. Note because of this equivalence with the procedural constructor regexp, the semantics of unquote differs from the original SCSH implementation in that unquoted expressions can expand into any object matching the SRE syntax, rather than a compiled regexp object. Further, unquote and unquote-splicing both expand all matches.

Rationale: Providing a procedural interface provides for greater flexibility, and without loss of potential compile-time optimizations by preserving the syntactic shorthand. The alternative is to rely on eval to dynamically generate regular expressions. However regexps in many cases come from untrusted sources, such as search parameters to a server, or from serialized sources such as config files or command-line arguments. Moreover many applications may want to keep many thousands of regexps in memory at once. Given the relatively heavy cost and insecurity of eval, and the frequency with which SREs are read and written as text, we prefer the procedural interface.

(char-set->sre char-set) => sre

Returns an SRE corresponding to the given SRFI 14 character set. The resulting SRE expands the character set into notation which does not make use of embedded SRFI 14 character sets, and so is suitable for writing portably.

(valid-sre? obj) => boolean

Returns true iff obj can be safely passed to regexp.

(regexp? obj) => boolean

Returns true iff obj is a regexp.

(regexp-match re str [start [end]]) => rx-match-or-false

Returns an rx-match object if re successfully matches the entire string str from start (inclusive) to end (exclusive), or #f is the match fails. The rx-match object will contain information needed to extract any submatches.

(regexp-match? re str [start [end]]) => boolean?

Returns #t if re matches str as in regexp-match, or #f otherwise. May be faster than regexp-match since it doesn't need to return submatch data.

(regexp-search re str [start [end]]) => rx-match-or-false

Returns an rx-match object if re successfully matches a substring of str between start (inclusive) and end (exclusive), or #f is the match fails. The rx-match object will contain information needed to extract any submatches.

(regexp-fold re kons knil str [finish [start [end]]]) => obj

The fundamental regexp matching iterator. Repeatedly searches str for the regexp re so long as a match can be found. On each successful match, applies

   (kons i rx-match str acc)

where i is the index since the last match (beginning with start), rx-match is the resulting match, and acc is the result of the previous kons application, beginning with knil. When no more matches can be found, calls finish with the same argument, except that rx-match is #f.

By default finish just returns acc.

(regexp-extract re str [start [end]]) => list

Extract all non-empty substrings of str which match re between start and end as a list of strings.

   (regexp-extract '(+ numeric) "192.168.0.1")
   => ("192" "168" "0" "1")

(regexp-split re str [start [end]]) => list

Split str into a list of strings separated by matches of re.

   (regexp-split '(+ space) " fee fi  fo\tfum\n")
   => ("fee" "fi" "fo" "fum")

(regexp-partition re str [start [end]]) => list

Partition str into a list of non-empty strings matching re, interspered with the unmatched portions of the string. The first and every odd element is an unmatched substring, which will be the empty string if re matches at the beginning of the string or end of the previous match. The second and every even element will be a substring matching re. If the final match ends at the end of the string, no trailing empty string will be included. Thus, in the degenerate case where str is the empty string, the result is ("").

   (regexp-partition '(+ (or space punct)) "")
   => ("")
   (regexp-partition '(+ (or space punct)) "Hello, world!\n")
   => ("Hello" ", " "world" "!\n")
   (regexp-partition '(+ (or space punct)) "¿Dónde Estás?")
   => ("" "¿" "Dónde" " " "Estás" "?")

(regexp-replace re str subst [start [end]]) => string

Returns a new string replacing the first match of re in str with the subst. subst can be a string, an integer or symbol indicating the contents of a numbered or named submatch of re, 'pre for the substring to the left of the match, or 'post for the substring to the right of the match.

   (regexp-replace '(+ space) "one two three" "_")
   => "one_two three"

(regexp-replace-all re str subst [start [end]]) => string

Equivalent to regexp-replace, but replaces all occurrences of re in str.

   (regexp-replace-all '(+ space) "one two three" "_")
   => "one_two_three"

(rx-match? obj) => boolean

Returns true iff obj is a successful match from regexp-match or regexp-search.

(rx-match-count rx-match) => integer

Returns the number of submatches of rx-match, regardless of whether they matched or not.

(rx-match-submatch rx-match str field) => string-or-false

Returns the substring of str in rx-match corresponding to field, either an integer or a symbol for a named submatch. Index 0 refers to the entire match, index 1 to the first lexicographic submatch, and so on. If an integer outside the range of matches, or a symbol is passed which does not correspond to a named submatch of the pattern, it is an error. If the corresponding submatch did not match, returns false.

(rx-match-submatch-start rx-match str field) => integer-or-false

Returns the start index of str in rx-match corresponding to field, as in rx-match-submatch.

(rx-match-submatch-end rx-match str field) => integer-or-false

Returns the end index of str in rx-match corresponding to field, as in rx-match-submatch.

(rx-match->list rx-match str) => list

Returns a list of all submatches in rx-match as string or false, beginning with the entire match 0.

SRE Syntax

The grammar for SREs is summarized below. Note that an SRE is a first-class object consisting of nested lists of strings, chars, char-sets, symbols and numbers. Where the syntax is described as (foo bar), this can be constructed equivalently as '(foo bar) or (list 'foo 'bar), etc. The following sections explain the semantics in greater detail.

    <sre> ::=
     | <string>                    ; A literal string match.
     | <cset-sre>                  ; A character set match.
     | (* <sre> ...)               ; 0 or more matches.
     | (+ <sre> ...)               ; 1 or more matches.
     | (? <sre> ...)               ; 0 or 1 matches.
     | (= <n> <sre> ...)           ; <n> matches.
     | (>= <n> <sre> ...)          ; <n> or more matches.
     | (** <n> <m> <sre> ...)      ; <n> to <m> matches.

     | (|  <sre> ...)              ; Alternation.
     | (or <sre> ...)

     | (:   <sre> ...)             ; Sequence.
     | (seq <sre> ...)     
     | ($ <sre> ...)               ; Numbered submatch.
     | (submatch <sre> ...)
     | (=> <name> <sre> ...)               ;  Named submatch.  <name> is
     | (submatch-named <name> <sre> ...)   ;  a symbol.

     | (w/case   <sre> ...)        ; Introduce a case-sensitive context.
     | (w/nocase <sre> ...)        ; Introduce a case-insensitive context.

     | (w/unicode   <sre> ...)     ; Introduce a unicode context.
     | (w/ascii <sre> ...)         ; Introduce an ascii context.

     | bos                         ; Beginning/end of string.
     | eos                         ; End of string.

     | bol                         ; Beginning of line.
     | eol                         ; End of line.

     | bog                         ; Beginning of grapheme cluster.
     | eog                         ; End of grapheme cluster.
     | graheme                     ; A single grapheme cluster.

     | bow                         ; Beginning of word.
     | eow                         ; End of word.
     | nwb                         ; A non-word boundary.
     | (word <sre> ...)            ; A sre wrapped in word boundaries.
     | (word+ <cset-sre> ...)      ; A single word restricted to a cset.
     | word                        ; A single word.

     | (?? sre ...)                ; A non-greedy pattern, 0 or 1 match.
     | (*? sre ...)                ; Non-greedy 0 or more matches.
     | (**? m n sre ...)           ; Non-greedy <m> to <n> matches.
     | (look-ahead sre ...)        ; Zero-width look-ahead assertion.
     | (look-behind sre ...)       ; Zero-width look-behind assertion.
     | (neg-look-ahead sre ...)    ; Zero-width negative look-ahead assertion.
     | (neg-look-behind sre ...)   ; Zero-width negative look-behind assertion.
     | (atomic sre ...)            ; Match once and don't backtrack.
     | (if test pass fail)         ; Conditional patterns.
     | commit                      ; Suppress backtracking.

The grammar for cset-sre is as follows.

    <cset-sre> ::=
     | <char>                      ; literal char
     | "<char>"                    ; string of one char
     | <char-set>                  ; embedded SRFI 14 char set
     | (<string>)                  ; literal char set
     | (/ <range-spec> ...)        ; ranges
     | (or <cset-sre> ...)         ; union
     | (and <cset-sre> ...)        ; intersection
     | (- <cset-sre> ...)          ; difference
     | (~ <cset-sre> ...)          ; complement of union
     | (w/case <cset-sre> ...)     ; case and unicode toggling
     | (w/nocase <cset-sre> ...)
     | (w/ascii <cset-sre> ...)
     | (w/unicode <cset-sre> ...)
     | any | nonl | ascii | lower-case | lower
     | upper-case | upper | alphabetic | alpha
     | numeric | num | alphanumeric | alphanum | alnum
     | punctuation | punct | symbol | graphic | graph
     | whitespace | white | space | printing | print
     | control | cntrl | hex-digit | xdigit

    <range-spec> ::= <string> | <char>

Basic Patterns

<string>

A literal string.

   (regexp-search "needle" "hayneedlehay") => #<rx-match>
   (regexp-search "needle" "haynEEdlehay") => #f

(seq sre ...)

(: sre ...)

Sequencing.

   (regexp-search '(: "one" space "two" space "three") "one two three") => #<rx-match>

(or sre ...)

(|\|| sre ...)

Alternation.

   (regexp-search '(or "eeney" "meeney" "miney") "meeney") => #<rx-match>
   (regexp-search '(or "eeney" "meeney" "miney") "moe") => #f

(w/nocase sre ...)

Enclosed sres are case-insensitive.

   (regexp-search '(w/nocase "needle") "haynEEdlehay") => #<rx-match>

(w/case sre ...)

Enclosed sres are case-sensitive. This is the default.

   (regexp-search '(w/nocase "SMALL" (w/case "BIG")) "smallBIGsmall") => #<rx-match>

(w/ascii sre ...)

Enclosed sres are interpreted in an ASCII context. In practice many regular expressions are used for simple parsing and only ASCII characters are relevant. Switching to ASCII mode can improve performance in some implementations.

   (regexp-search '(w/ascii bos (* letter) eos) "English") => #<rx-match>
   (regexp-search '(w/ascii bos (* letter) eos) "Ελληνική") => #f

(w/unicode sre ...)

Enclosed sres are interpreted in a Unicode context - character sets with both an ASCII and Unicode definition take the latter. Has no effect if the regexp-unicode feature is not provided. This is the default.

   (regexp-search '(w/unicode bos (* letter) eos) "English") => #<rx-match>
   (regexp-search '(w/unicode bos (* letter) eos) "Ελληνική") => #<rx-match>

Repeating patterns

(? sre ...)

An optional pattern - matches 1 or 0 times.

   (regexp-search '(: "match" (? "es") "!") "matches!") => #<rx-match>
   (regexp-search '(: "match" (? "es") "!") "match!") => #<rx-match>
   (regexp-search '(: "match" (? "es") "!") "matche!") => #f

(* sre ...)

Kleene star, matches 0 or more times.

   (regexp-search '(: "<" (* (~ #\>)) ">") "<html>") => #<rx-match>
   (regexp-search '(: "<" (* (~ #\>)) ">") "<>") => #<rx-match>
   (regexp-search '(: "<" (* (~ #\>)) ">") "<html") => #f

(+ sre ...)

1 or more matches. Like * but requires at least a single match.

   (regexp-search '(: "<" (+ (~ #\>)) ">") "<html>") => #<rx-match>
   (regexp-search '(: "<" (+ (~ #\>)) ">") "<a>") => #<rx-match>
   (regexp-search '(: "<" (+ (~ #\>)) ">") "<>") => #f

(>= n sre ...)

More generally, n or more matches.

   (regexp-search '(: "<" (>= 3 (~ #\>)) ">") "<table>") => #<rx-match>
   (regexp-search '(: "<" (>= 3 (~ #\>)) ">") "<pre>") => #<rx-match>
   (regexp-search '(: "<" (>= 3 (~ #\>)) ">") "<tr>") => #f

(= n sre ...)

Exactly n matches.

   (regexp-search '(: "<" (= 4 (~ #\>)) ">") "<html>") => #<rx-match>
   (regexp-search '(: "<" (= 4 (~ #\>)) ">") "<table>") => #f

(** from to sre ...)

The most general form, from n to m matches, inclusive.

   (regexp-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.168.1.10") => #<rx-match>
   (regexp-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.0168.1.10") => #f

Submatch Patterns

(submatch sre ...)

($ sre ...)

A numbered submatch. The contents matching the pattern will be available in the resulting rx-match.

(submatch-named name sre ...)

(=> name sre ...)

A named submatch. Behaves just like submatch, but the field may also be referred to by name.

(backref n-or-name)

Optional: Match a previously matched submatch. The feature regexp-backrefs will be provided if this pattern is supported. Backreferences are expensive, and can trivially be shown to be NP-hard, so one should avoid their use even in implementations which support them.

Character Sets

A character set pattern matches a single character.

<char>

A singleton char set.

   (regexp-match '(* #\-) "---") => #<rx-match>
   (regexp-match '(* #\-) "-_-") => #f

"<char>"

A singleton char set written as a string of length one rather than a character. Equivalent to its interpretation as a literal string match, but included to clarify it can be composed in cset-sres.

<char-set>

A SRFI 14 character set, which matches any character in the set. Note that currently there is no portable written representation of SRFI 14 character sets, which means that this pattern is typically generated programmatically, such as with a quasiquoted expression.

   (regexp-partition `(+ ,char-set:vowels) "vowels")
   => ("v" "o" "w" "e" "ls")

Rationale: Many useful character sets are likely to be available as SRFI 14 char-sets, so it is desirable to want to reuse them in regular expressions. Since many Unicode character sets are extremely large, converting back and forth between an internal and external representation can be prohibitively expensive, so the option of direct embedding is necessary. When a readable external representation is needed, char-set->sre can be used.

(<string>)

The set of chars as formed by (string->char-set <string>).

   (regexp-match '(* ("aeiou")) "oui") => #<rx-match>
   (regexp-match '(* ("aeiou")) "ouais") => #f

(/ <range-spec> ...)

Ranged char set. The <range-spec> is a list of strings and characters. These are flattened and grouped into pairs of characters, and all ranges formed by the pairs are included in the char set.

   (regexp-match '(* (/ "AZ09")) "R2D2") => #<rx-match>
   (regexp-match '(* (/ "AZ09")) "C-3PO") => #f

(or <cset-sre> ...)

Char set union.

(~ <cset-sre> ...)

Char set complement (i.e. [^...]).

(- <cset-sre> ...)

Char set difference.

   (regexp-match '(* (- (/ "az") ("aeiou"))) "xyzzy") => #<rx-match>
   (regexp-match '(* (- (/ "az") ("aeiou"))) "vowels") => #f

(and <cset-sre> ...)

(& <cset-sre> ...)

Char set intersection.

   (regexp-match '(* (& (/ "az") (~ ("aeiou")))) "xyzzy") => #<rx-match>
   (regexp-match '(* (& (/ "az") (~ ("aeiou")))) "vowels") => #f

Named Character Sets

any

Match any character.

nonl

Match any character other than #\newline.

ascii

Match any ASCII character [0..127].

lower-case

lower

Matches any character for which char-lower-case? returns true. In a Unicode context this corresponds to the Lowercase (Ll) property. In an ASCII context corresponds to (/ "az").

upper-case

upper

Matches any character for which char-upper-case? returns true. In a Unicode context this corresponds to the Uppercase (Lu) property. In an ASCII context corresponds to (/ "AZ").

alphabetic

alpha

Matches any character for which char-alphabetic? returns true. In a Unicode context this corresponds to the Alphabetic (L) property. In an ASCII context corresponds to (w/nocase (/ "az")).

numeric

num

Matches any character for which char-numeric? returns true. For In a Unicode context this corresponds to the Numeric_Digit (Nd) property. In an ASCII context corresponds to (/ "09").

alphanumeric

alphanum

alnum

Matches any character which is either a letter or number. Equivalent to:

   (or alphabetic numeric)

punctuation

punct

Matches any punctuation character. In a Unicode context this corresponds to the Punctuation property (P). In an ASCII context this corresponds to "!\"#%&'()*,-./:;?@[\]_{}".

symbol

Matches any symbol character. In a Unicode context this corresponds to the Symbol property (Sm, Sc, Sk, or So). In an ASCII this corresponds to "$+<=>^`|~".

graphic

graph

Matches any graphic character. Equivalent to:

   (or alphanumeric punctuation symbol)

whitespace

white

space

Matches any whitespace character. In a Unicode context this corresponds to the Separator property (Zs, Zl or Zp). In an ASCII context this corresponds to space, tab, line feed, form feed, and carriage return.

printing

print

Matches any printing character. Equivalent to:

   (or graphic whitespace)

control

cntrl

Matches any control or other character. In a Unicode context this corresponds to the Other property (Cc, Cf, Co, Cs or Cn). In an ASCII context this corresponds to:

   `(/ ,(integer->char 0) ,(integer-char 31))

hex-digit

xdigit

Matches any valid digit in hexadecimal notation. Alway ASCII-only. Equivalent to:

   (w/ascii (w/nocase (or numeric "abcdef")))

Boundary Assertions

bos

eos

Matches at the beginning/end of string without consuming any characters (a zero-width assertion). If the search was initiated with start/end parameters, these are considered the end points, rather than the full string.

bol

eol

Matches at the beginning/end of a line without consuming any characters (a zero-width assertion). A line is a possibly empty sequence of characters followed by an end of line sequence as understood by the R7RS read-line procedure, specifically any of a linefeed character, carriage return character, or a carriage return followed by a linefeed character. The string is assumed to contain end of line sequences before the start and after the end of the string, even if the search was made on a substring and the actual surrounding characters differ.

bow

eow

Matches at the beginning/end of a word without consuming any characters (a zero-width assertion). In a Unicode context follows the default word boundary specification from TR29. In an ASCII context a word is a sequence of one or more characters that are either alphanumeric or the underscore character. The string is assumed to contain non-word characters immediately before the start and after the end, even if the search was made on a substring and word constituent characters appear immediately before the beginning or after the end.

nwb

Matches a non-word-boundary (i.e. \B in PCRE). Equivalent to (neg-look-ahead (or bow eow)).

(word sre ...)

Anchor a sequence to word boundaries. Equivalent to:

   (: bow sre ... eow)

(word+ cset-sre ...)

Matches a single word composed of characters in the intersection of the given cset-sre and the word constituent characters. Equivalent to:

   (word (+ (and (or alphanumeric "_") (or cset-sre ...))))

word

A shorthand for (word+ any).

bog

eog

Matches at the beginning/end of a single grapheme cluster without consuming any characters (a zero-width assertion). A grapheme cluster is defined in Unicode TR18 as a single non-combining codepoint followed by zero or more combining marks (characters with the Mark property, Mn, Mc, Me). The string is assumed to contain non-combining codepoints immediately before the start and after the end. These always succeed in an ASCII context.

grapheme

Matches a single grapheme cluster (i.e. \X in PCRE). This is what the end-user typically thinks of as a single character, comprised of a base non-combining codepoint followed by zero or more combining marks. In an ASCII context this is equivalent to any.

Assuming char-set:mark contains all characters with the Mark property, then this is equivalent to:

   `(: bog (~ ,char-set:mark) (* ,char-set:mark) eog)

Extended Patterns

The following patterns are only supported if the feature regexp-extended is provided.

(?? sre ...)

Non-greedy pattern, matches 0 or 1 times, preferring the shorter match.

(*? sre ...)

Non-greedy kleene star, matches 0 or more times, preferring the shorter match.

(**? m n sre ...)

Non-greedy kleene star, matches m to n times, preferring the shorter match.

(look-ahead sre ...)

Zero-width look-ahead assertion. Assert the sequence matches from the current position, without advancing the position.

(look-behind sre ...)

Zero-width look-behind assertion. Assert the sequence matches behind the current position, without advancing the position. It is an error if the sequence does not have a fixed length.

(neg-look-ahead sre ...)

Zero-width negative look-ahead assertion.

(neg-look-behind sre ...)

Zero-width negative look-behind assertion.

(atomic sre ...)

For (?>...) independent patterns.

(if test pass [fail])

Conditional patterns.

commit

Don't backtrack beyond this (i.e. cut).

Implementation

A reference implementation in portable R7RS is available at

    https://code.google.com/p/chibi-scheme/source/browse/lib/chibi/regexp.scm

depending only on SRFI 14, SRFI 33 and SRFI 69. This is implemented as a Thompson-style non-bactracking NFA, a discussion of which can be found at Russ Cox's Implementing Regexps. Note the reference implementation may not be up to date with the latest draft prior to finalization.

References

R7RS

      Alex Shinn, John Cowan, Arthur Gleckler, Revised⁷ Report on the Algorithmic Language Scheme
      http://trac.sacrideo.us/wg/raw-attachment/wiki/WikiStart/r7rs.pdf

SCSH

      Olin Shivers, A Scheme Shell
      Massachusetts Institute of Technology Cambridge, MA, USA, 1994
      http://www.scsh.net/docu/scsh-paper/scsh-paper.html

IrRegex

      Alex Shinn, IrRegex - IrRegular Expressions
      http://synthcode.com/scheme/irregex/

TR18

      Mark Davis, Andy Heninger, UTR #18: Unicode Regular Expressions
      http://www.unicode.org/reports/tr18/

UAX29

      Mark Davis, UAX #29: Unicode Text Segmentation
      http://www.unicode.org/reports/tr29/

SRFI 0

      Marc Feeley, Feature-based conditional expansion construct
      http://srfi.schemers.org/srfi-14/srfi-14.html

SRFI 14

      Olin Shivers, Character-set Library
      http://srfi.schemers.org/srfi-14/srfi-14.html

ImplementingRegexps

      Russ Cox, Implementing Regular Expressions
      http://swtch.com/~rsc/regexp/

Tcl

      Russ Cox, Henry Spencer's Tcl Regex Library
      http://compilers.iecc.com/comparch/article/07-10-026

Gauche

      Shiro Kawai, Gauche Scheme - Regular Expressions
      http://practical-scheme.net/gauche/man/gauche-refe_49.html

Perl6

      Damian Conway, Perl6 Exegesis 5 - Regular Expressions
      http://www.perl.com/pub/a/2002/08/22/exegesis5.html

PCRE

      Philip Hazel, PCRE - Perl Compatible Regular Expressions
      http://www.pcre.org/

Copyright

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Editor: Mike Sperber

Table of Contents

Procedure Index

Sre Syntax Index

Copyright