[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: w/ascii and w/unicode



On Thu, Oct 17, 2013 at 9:46 PM, Michael Montague <mikemon@xxxxxxxxx> wrote:
The statement: "Switching to ASCII mode can improve performance in some implementations." made me wonder if the primary motivation for w/ascii was to improve performance.

It is one of the motivations.  The other is that regexps
are often used for simple parsing of ASCII-specific formats,
as the sentence immediately preceding that says:

  In practice many regular expressions are used for simple
  parsing and only ASCII characters are relevant.

If you want to parse say, URLs, then part of the pattern
corresponding to the domain name will include the "alpha"
char-set.  This should _not_ match any Unicode letter, but
only the ASCII letters.

-- 
Alex
 


On 10/17/2013 1:52 AM, Alex Shinn wrote:
On Thu, Oct 17, 2013 at 12:33 PM, Michael Montague <mikemon@xxxxxxxxx> wrote:
Why are w/ascii and w/unicode necessary? The ascii character set can be used instead.

(regexp-search `(: bos (* ,char-set:ascii) eos) "English") => #<rx-match>
(regexp-search `(: bos (* ,char-set:ascii) eos) "Ελληνική") => #f

You seem to be misunderstanding these operators.  They apply
to all contained patterns.  The examples you are referring to
are operating on the "letter" character class.  You could, if you
wanted, use intersection to restrict individual sets to ASCII-only:

(regexp-search `(: bos (* (& ascii letter)) eos) "English") => #<rx-match>
(regexp-search `(: bos (* (& ascii letter)) eos) "Ελληνική") => #f
(regexp-search `(: bos (* letter) eos) "Ελληνική") => #<rx-match>

However, this needs to be duplicated multiple times if there
are multiple nested csets, and is in fact impossible if the nested
cset is part of an external SRE, e.g. you can't do this here:

(import (only (mystuff regexp-common) rx:plurals))
(regexp-search `(w/ascii ,rx:plurals) "...")

-- 
Alex