[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Should SRFI-115 character sets match extended grapheme clusters?
John Cowan <cowan@xxxxxxxxxxxxxxxx> writes:
> Mark H Weaver scripsit:
>> It occurs to me that users of languages that make heavy use of combining
>> marks will likely find the behavior of "character sets" to be quite
>> unintuitive if they operate on code points.
> The way around that is normalization of the input, I think.
Normalization is an important part of the solution, but it alone does
not solve the problem where no precomposed character exists. Figure 5
of TR15 gives some examples where NFC produces more than one codepoint
The question then becomes: Do we want ("ḍ̇q̣̇") to mean (or "ḍ̇" "q̣̇") or
should it mean (or "ḍ" "\x0307;" "q" "\x0323;" "\x0307;")? It's a
question of how the string is split into elements.
There's also the question of whether (regexp-extract '(~ ("-")) "q̣̇")
should return ("q̣̇") or ("q" "\x0323;" "\x0307;").
> I will be proposing a normalization SRFI in future, presumably
> including the R6RS normalization procedures and some version of the
> normalized-comparison procedures that were rejected from R7RS-small.