[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Should SRFI-115 character sets match extended grapheme clusters?

This page is part of the web mail archives of SRFI 115 from before July 7th, 2015. The new archives for SRFI 115 contain all messages, not just those from before July 7th, 2015.

John Cowan <cowan@xxxxxxxxxxxxxxxx> writes:

> Mark H Weaver scripsit:
>> It occurs to me that users of languages that make heavy use of combining
>> marks will likely find the behavior of "character sets" to be quite
>> unintuitive if they operate on code points.  
> The way around that is normalization of the input, I think.

Normalization is an important part of the solution, but it alone does
not solve the problem where no precomposed character exists.  Figure 5
of TR15 gives some examples where NFC produces more than one codepoint
per character.

The question then becomes: Do we want ("ḍ̇q̣̇") to mean (or "ḍ̇" "q̣̇") or
should it mean (or "ḍ" "\x0307;" "q" "\x0323;" "\x0307;")?  It's a
question of how the string is split into elements.

There's also the question of whether (regexp-extract '(~ ("-")) "q̣̇")
should return ("q̣̇") or ("q" "\x0323;" "\x0307;").

> I will be proposing a normalization SRFI in future, presumably
> including the R6RS normalization procedures and some version of the
> normalized-comparison procedures that were rejected from R7RS-small.

Sounds useful.