[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Should SRFI-115 character sets match extended grapheme clusters?

This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.

To: SRFI-115 discussion list <srfi-115@xxxxxxxxxxxxxxxxx>
Subject: Should SRFI-115 character sets match extended grapheme clusters?
From: Mark H Weaver <mhw@xxxxxxxxxx>
Date: Sun, 11 May 2014 09:49:37 -0400
Delivered-to: srfi-115@xxxxxxxxxxxxxxxxx

Hello all,

It occurs to me that users of languages that make heavy use of combining
marks will likely find the behavior of "character sets" to be quite
unintuitive if they operate on code points.  For example, they might
reasonably expect ("éè") to match either of two graphemes, and never to
match a bare 'e' or a bare combining mark.  They might also expect
(~ ("aeiou")) to match "é", even when represented as multiple code
points.

I realize that most languages (including Scheme) treat code points as
characters, that SRFI-14 character sets are really sets of code points,
and that most regexp libraries probably do the same thing.  However, it
also seems to me that these are most likely mistakes, with bad
consequences for the usability of regexps in many languages.

Should SRFI-115 try to get this right, or stick to tradition?
Thoughts?

      Mark

Follow-Ups:
- Re: Should SRFI-115 character sets match extended grapheme clusters?
  - From: John Cowan
- Re: Should SRFI-115 character sets match extended grapheme clusters?
  - From: Alex Shinn

Prev by Date: Re: revised w/nocase text, considering titlecase and cased
Next by Date: Re: Should SRFI-115 character sets match extended grapheme clusters?
Previous by thread: Re: one last issue - non-capturing
Next by thread: Re: Should SRFI-115 character sets match extended grapheme clusters?
Index(es):
- Date
- Thread