[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

SRFI-115 issues



Alex Shinn scripsit:

> How to integrate with the PCRE regular expression library? The
> intention is to make this the primitive notation, and for POSIX
> require a separate wrapper such as (pcre->sre <str>). Alternately
> we could allow both in the same API, as in IrRegex, though this
> introduces an ambiguity. Finally, we could make this entirely separate
> from the PCRE API.

I think this is the best way: separate it from string-based REs, with
conversions to and from handled by some other library.

> From SCSH's SREs I've left out the dsm notation which doesn't seem
> as though it need be exposed to the user, the posix-string notation
> because it's better accomplished with pcre->sre, and uncase whose
> exact semantics and motivation I never quite understood. I also left
> out the blank character class since it's a GNU extension without an
> accepted Unicode definition.

+1 on all points.

> | and & are allowed, but the former must be escaped, which looks
> fairly ugly. For aesthetics they can also be written or and and,
> respectively.

Stick with just `and` and `or`, I think.

> I've kept most IrRegex extensions, but made many of the non-POSIX
> ones optional, designated by the regexp-extended feature, and backref
> specifically gets its own feature regexp-backrefs.

This troubles me.  It leaves things too much up to the implementation,
and not enough flexibility for the user.  These extensions work only if
you have a backtracking NFA, which is inherently less efficient.  In
order to provide both efficiency and power, the implementation would
have to provide both an NFA (to be used in the general case) and a DFA
or Thompson-NFA (to be used if the extensions are not needed).  This is
what Perl does, but Perl is a rag-bag by nature.

I'd say: leave these things out of the main library, but add another
library that provides them but using the same API.  This way, the user
can load (srfi 115) or (srfi 115 extended) and get the most suitable
engine.  Of course, they can be the same engine if the implementer
doesn't care that much about speed.  If the user needs to load both,
using the R6RS/R7RS prefix feature makes both APIs available.

> I left out the common utility patterns integer, domain, url, etc.,
> which can easily enough be included in libraries and unquoted into
> SREs.

I think the large language should have these, either in this SRFI or in
another SRFI, but in any case in (srfi 115 patterns).

> The => shorthand for named matches used by IrRegex would perhaps have
> better been named <-, the more common choice to represent binding in
> parsers, leaving => open for the send-to-procedure idiom used in cond.

No opinion on this, except that if a change is to be made, this is the
time to make it.

> The API uses string indices for start, end and match positions, which
> is slow for a UTF8 implementation.

That is the Right Thing.

> Many Unicode properties as well as Unicode script names that are
> available in PCRE are not provided as char-sets here.

I'm working on a Unicode properties API.

> SREs with embedded SRFI 14 char-sets can't be written and read back
> in portably. R7RS WG2 is considering external syntax representations,
> and may include them for SRFI 14 char-sets as well, making this a
> non-issue.

Not quite a non-issue, because if we have those things it will probably
be as macros, not as lexical representations.  So they will need to be
unquoted and won't work in data files.

> On the other hand SREs with embedded compiled regexps, as allowed in
> SCSH, are not supported, largely to preserve writeability. Instead you
> should embedded other SREs.

+1

> regexp->sre is frequently requested in IrRegex. It is useful and the
> only argument against it is that it would require more memory for
> compiled regexps (linearly more for most implementations), but I'll
> wait to see if it's requested in the discussion.

I think it's an important thing to have.  The wording should allow for
either caching within the regexp object or decompilation, and should
warn that caching may produce a space leak.

-- 
John Cowan    http://ccil.org/~cowan  cowan@xxxxxxxx
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash,
The day and hour soon are coming / When all the IT folks say "Gosh!"
It isn't from a clever lawsuit / That Windowsland will finally fall,
But thousands writing open source code / Like mice who nibble through a wall.
        --The Linux-nationale by Greg Baker