Re: Should SRFI-115 character sets match extended grapheme clusters?

This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.

To: John Cowan <cowan@xxxxxxxxxxxxxxxx>

Subject: Re: Should SRFI-115 character sets match extended grapheme clusters?

From: Alex Shinn <alexshinn@xxxxxxxxx>

Date: Mon, 12 May 2014 11:38:35 +0900

Cc: Mark H Weaver <mhw@xxxxxxxxxx>, SRFI-115 discussion list <srfi-115@xxxxxxxxxxxxxxxxx>

Delivered-to: srfi-115@xxxxxxxxxxxxxxxxx

Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=imNK6ZPVkNveRukmGdVfL9iTsdaR3ZLIBEE4nXnu9FI=; b=BiveCdwbO+sT+cIj/ESTiGcsdVkTxYf3EOAiHc8fqVV3Oy0bUR1ynLf4ReTDXg+W67 ZczbLBquApf2IApDADuMh99IukPE8ruQF2U3Zus/i150uRvKlH/vmipvg5nP8ZURRf6Z 13ksolj7E+T5Ji7NbSOD28MbF7zWQZFAR5q7+7Fvab/o9UFrE2AIlqP9sOd27uEdDomx tFXFSBjSciWlLjNuxOFDikhByZDo4GVXdObfBwfWazLVzyp+gg91N3c46h6NfVth2tn/ Vy023Yz4msVumHQ1Q9pbnAdid9qQ4GyfaqKw/nYIWzU+/UniTCILzGVgdVN3wumom3Zz 7NVQ==

In-reply-to: <20140511213925.GG17946@mercury.ccil.org>

References: <87bnv4ifwu.fsf@yeeloong.lan> <20140511180833.GD17946@mercury.ccil.org> <87wqdsgkhz.fsf@yeeloong.lan> <20140511213925.GG17946@mercury.ccil.org>

On Mon, May 12, 2014 at 6:39 AM, John Cowan <cowan@xxxxxxxxxxxxxxxx> wrote:

Mark H Weaver scripsit:

> Normalization is an important part of the solution, but it alone does
> not solve the problem where no precomposed character exists. Figure 5
> of TR15 gives some examples where NFC produces more than one codepoint
> per character.

Ah, I understand now. The trouble is that normalization of a char-set
pattern causes it to mean something completely different. Thus ("á")
(i.e. ("\xE1;") matches the character \#xE1;, whereas ("á")
(i.e. ("a\x301;")) although canonically equivalent to it, matches the
disjunction of #\x61; and #\x301;. They will never match the same thing,
which is counterintuitive. Unfortunately, I don't see what can be done
about this other than to issue stern warnings in the documentation.

Alex, do you think you can make a w/norm or norm-char-set SRE pattern
work? It would mean transforming a charset pattern containing "a\x301;"
to one that contains "\xE1;", and also transforming a pattern containing
"f\x301;" (which has no precomposed form) into (seq #\f #\x301;).
In the general case it would produce an alternation of sequences,
and would have to normalize the part of the text being matched as well
(unless it comes in two flavors, one for NFD and the other for NFC).

Normalization was in the early issues and dismissed because

of lack of implementation support and unclear costs in new

implementations. I think good recommended practice for now

is to just normalize both inputs and patterns separately.

Alex