Re: revised w/nocase text, considering titlecase and cased

This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.

To: John Cowan <cowan@xxxxxxxxxxxxxxxx>

Subject: Re: revised w/nocase text, considering titlecase and cased

From: Alex Shinn <alexshinn@xxxxxxxxx>

Date: Tue, 13 May 2014 23:16:16 +0900

Cc: SRFI-115 discussion list <srfi-115@xxxxxxxxxxxxxxxxx>

Delivered-to: srfi-115@xxxxxxxxxxxxxxxxx

Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=jVWgobKp7AQN5er4NMLW/c5MuTfD1o5/rsZstvqZxpU=; b=wltfhlZ6n11BggLcp696z7GwXYEGUG/AtzxBGsBBUpCnhnlZxBhJmaN8tWJGF0eGvN E34B0PHo5TioeWra0UohQaycnQEKzFnBnN17uk/VTUzTmX1DBzertZkRf7sRSk++TPZ9 bF6cOZLMqVFl0feeFHNqwQ6N2svKXjrXVcWDZd4f0JXm5RSfS39e/Y0P8z1p2twGXY25 0ZQuRIgqki5e/CbDUS7/kUMqItJ5vTDUq9SfAtZUkfhP2bWSxR67nYJ4Su/qvoN83Yko nci0/4TKIdDt2sW6XjE6Nqvm146lxuOIW0kyduG2WYjHvKuFe7IBwdDoqlOK4WQYec3v hiSQ==

In-reply-to: <20140512055435.GU17946@mercury.ccil.org>

References: <CAMMPzYMg4wp2R9PetSGy+aF7TUJPWevkei8yLtrkZSt3NG=3SQ@mail.gmail.com> <20140509215947.GT32663@mercury.ccil.org> <CAMMPzYNFV-q9510W3nEa1ukrXpP8HObRH6XGmdnMf8UbpfF3aQ@mail.gmail.com> <20140510004929.GV32663@mercury.ccil.org> <CAMMPzYPdquEPtxxfT7jJ=3c5eaTQkZVe6nN2Wn=yPCWJn21wBg@mail.gmail.com> <20140510225646.GQ17946@mercury.ccil.org> <CAMMPzYOKa1Dwu6Tr4qpn7YJ41wX_e+MBV6S+gbk2VsKXP-JAWQ@mail.gmail.com> <20140512055435.GU17946@mercury.ccil.org>

On Mon, May 12, 2014 at 2:54 PM, John Cowan <cowan@xxxxxxxxxxxxxxxx> wrote:

Alex Shinn scripsit:

> As a special case, the pre-defined named character sets
> upper and lower (and their aliases upper-case and lower-case)
> are defined to match all characters with the cased property (L&).
> Note also all other pre-defined named character sets are
> equivalent to themselves under w/nocase.
>
> Rationale: The differences between the case insensitive
> lower and upper and the cased property are few and unlikely
> to match user intention. Moreover, unlike the algorithmically
> mapped upper and lower char-sets, the cased property is
> readily available in most Unicode implementations.

Looks good to me.

I think this language should also be added:

Note that placing a sequence consisting of a base character
and combining characters into a character string representing
a character set will not do what the user probably expects;
it will create a character set pattern containing the base
character and the combining character(s) as alternatives.
For the same reason, it is inadvisable to apply Unicode
normalization to such strings.

The description would go after the language for the (<string>)

literal char-set sre, which currently says:

The set of chars as formed by (string->char-set <string>).

How about just adding:

Note that string->char-set works on code points,

not grapheme clusters, so any combining characters in

<string> will be treated separately from any preceding

base characters.

The only remaining issue is do we want to expose the

cased and titlecase char-sets?

Alex