Re: w/ascii and w/unicode

This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.

On Thu, Oct 17, 2013 at 9:46 PM, Michael Montague <mikemon@xxxxxxxxx> wrote:

The statement: "Switching to ASCII mode can improve performance in some implementations." made me wonder if the primary motivation for w/ascii was to improve performance.

It is one of the motivations. The other is that regexps

are often used for simple parsing of ASCII-specific formats,

as the sentence immediately preceding that says:

In practice many regular expressions are used for simple

parsing and only ASCII characters are relevant.

If you want to parse say, URLs, then part of the pattern

corresponding to the domain name will include the "alpha"

char-set. This should _not_ match any Unicode letter, but

only the ASCII letters.

Alex

On 10/17/2013 1:52 AM, Alex Shinn wrote:

On Thu, Oct 17, 2013 at 12:33 PM, Michael Montague <mikemon@xxxxxxxxx> wrote:

Why are w/ascii and w/unicode necessary? The ascii character set can be used instead.

(regexp-search `(: bos (* ,char-set:ascii) eos) "English") => #<rx-match>
(regexp-search `(: bos (* ,char-set:ascii) eos) "Ελληνική") => #f

You seem to be misunderstanding these operators. They apply

to all contained patterns. The examples you are referring to

are operating on the "letter" character class. You could, if you

wanted, use intersection to restrict individual sets to ASCII-only:

(regexp-search `(: bos (* (& ascii letter)) eos) "English") => #<rx-match>
(regexp-search `(: bos (* (& ascii letter)) eos) "Ελληνική") => #f

(regexp-search `(: bos (* letter) eos) "Ελληνική") => #<rx-match>

However, this needs to be duplicated multiple times if there

are multiple nested csets, and is in fact impossible if the nested

cset is part of an external SRE, e.g. you can't do this here:

(import (only (mystuff regexp-common) rx:plurals))

(regexp-search `(w/ascii ,rx:plurals) "...")

--

Alex