Re: w/ascii and w/unicode

This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.

To: Alex Shinn <alexshinn@xxxxxxxxx>

Subject: Re: w/ascii and w/unicode

From: Michael Montague <mikemon@xxxxxxxxx>

Date: Thu, 17 Oct 2013 05:46:39 -0700

Cc: SRFI-115 discussion list <srfi-115@xxxxxxxxxxxxxxxxx>

Delivered-to: srfi-115@xxxxxxxxxxxxxxxxx

Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type; bh=mRKhNdQUO5MmLZPxG6intdiqE8bXl5oVQznnt/TT2+Q=; b=yysa+ekYl3lUeQUPg9kzaIqRh9AyHp5zXey0a/D+aWl2SwtLFpSRpoonREQkFeRUHA XDcAtq7bb8wZ0rSskYV3/9ZNzfuZND6Jlg2tXsLCcOxvi6clGsFxxIbNzebNEFF19NLu H0OW8UOovlaVp42vOQ6fIJjxFzTqYk4IdIO5BsZaYb5WiWgS7xIq5oWfMT0q7QIVpOgx /WMRCac/CNh0UWSbB2Z5NkfmQS9kelLHmLwR7q7fN2j3lLCLdnjao85OYeoZ3Q10V0nM FcJP2ESnulTH9ziwzp251ht807gL4sWLgtL9RucGt37guLNuQYvgMbmGAbMPSdWLYyzh xSMw==

In-reply-to: <CAMMPzYNa9O+tVW=BdFVMGL88MHch+1S5=5YTtPujiRRzoUeYiA@mail.gmail.com>

References: <525F5A9C.2040506@gmail.com> <CAMMPzYNa9O+tVW=BdFVMGL88MHch+1S5=5YTtPujiRRzoUeYiA@mail.gmail.com>

User-agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.0.1

The statement: "Switching to ASCII mode can improve performance in some implementations." made me wonder if the primary motivation for w/ascii was to improve performance.

On 10/17/2013 1:52 AM, Alex Shinn wrote:

On Thu, Oct 17, 2013 at 12:33 PM, Michael Montague <mikemon@xxxxxxxxx> wrote:

Why are w/ascii and w/unicode necessary? The ascii character set can be used instead.

(regexp-search `(: bos (* ,char-set:ascii) eos) "English") => #<rx-match>
(regexp-search `(: bos (* ,char-set:ascii) eos) "Ελληνική") => #f

You seem to be misunderstanding these operators. They apply

to all contained patterns. The examples you are referring to

are operating on the "letter" character class. You could, if you

wanted, use intersection to restrict individual sets to ASCII-only:

(regexp-search `(: bos (* (& ascii letter)) eos) "English") => #<rx-match>
(regexp-search `(: bos (* (& ascii letter)) eos) "Ελληνική") => #f

(regexp-search `(: bos (* letter) eos) "Ελληνική") => #<rx-match>

However, this needs to be duplicated multiple times if there

are multiple nested csets, and is in fact impossible if the nested

cset is part of an external SRE, e.g. you can't do this here:

(import (only (mystuff regexp-common) rx:plurals))

(regexp-search `(w/ascii ,rx:plurals) "...")

--

Alex

Follow-Ups:

Re: w/ascii and w/unicode
- From: Alex Shinn

References:

w/ascii and w/unicode
- From: Michael Montague
Re: w/ascii and w/unicode
- From: Alex Shinn