[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: english names for symbolic SREs



I propose breaking SREs completely free of PCREs. Do away with * + ? = ?? *? **? etc. Have a single way to specify each operation. The short names are meaningless: unless you already know
PCREs, '*' means multiplication and '**?' looks like comic book cuss words.

(zero-or-more <sre> ...)       ; 0 or more matches -- or 'zero...' or just keep '*'
(one-or-more <sre> ...)        ; 1 or more matches -- or 'one...' or just keep '+'
(maybe <sre> ...)              ; 0 or 1 matches --- or 'optional' or just keep '?'
(repeat <n> <sre> ...)         ; <n> or more matches
(repeat <m> <n> <sre> ...)     ; <m> to <n> matches
(lazy <n> <sre> ...)           ; <n> or more lazy matches
(lazy <m> <n> <sre> ...)       ; <m> to <n> lazy matches
(non-greedy <n> <sre> ...)     ; <n> or more non-greedy matches
(non-greedy <m> <n> <sre> ...) ; <m> to <n> non-greedy matches
(or <sre> ...)                 ; alternation
(and <sre> ...)                ; sequencing
(submatch <name> <sre> ...)    ; capturing a submatch -- do away with indexed submatches

'repeat', 'lazy', and 'non-greedy' are the general way to match a variable number of times: (zero-or-more <sre> ...) is the same as (repeat 0 <sre> ...).

(char-range <range-spec> ...)    ; ranges
(char-or <cset-sre> ...)         ; union
(char-and <cset-sre> ...)        ; intersection
(char-difference <cset-sre> ...) ; difference
(char-complement <cset-sre> ...) ; complement of union

I admit to preferring '*' for zero-or-more, '+' for one-or-more, and '?' for maybe, but I have already been corrupted by PCREs. But I think that we should have one or the other. Having two names for operations means there is that much more to remember in order to be able to read an SRE. I know that I was the one that proposed long names for everything in the first place. After thinking about it more, I think that having two names for operations is worse than having just a short name. But I really think that we should get rid of the short names and use the long names for everything -- or almost everything.

On 11/26/2013 5:01 AM, Alex Shinn wrote:
Traditionally SREs have had the following aliases
allowing the user to choose between brevity and
self-description:

From SCSH:

  | or
  & and
  : seq
 
From IrRegex (in this case introducing a new short form):

  $ submatch
  => submatch-named

For consistency Michael Montague suggested all
SREs have a short and long form.  John Cowan
suggests the following names:

 ? optional
 * zero-or-more
 + one-or-more
 >= at-least
 = exactly
 ** repeated
 ?? non-greedy-optional
 *? non-greedy-zero-or-more
 **? non-greedy-repeated

For the cset-sres we'd also need:

  / char-range (or cset-range?)
  - difference (or diff?)
  ~ complement (or not?)

I would suggest not introducing new short forms
of existing long names.  Comments welcome, but
if there are no objections I'll go with this.

-- 
Alex