[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: the discussion so far



Matthew Flatt <mflatt@xxxxxxxxxxx> writes:

> So, the `char-ci' operations should use the "simple case folding" table
> from CaseFolding.txt, and the `string-ci' operations should use the
> "full case folding" table from CaseFolding.txt. After folding, the
> comparison result is determined character-by-character.

Codepoint-by-codepoint, yes. (That is what you meant, I just
wanted to clarify. The terminology is a bit confusing, as
"character" is defined differently in Unicode than it is in this
SRFI)

> Meanwhile, `string-upcase' and `string-downcase' reflect the same
> improved handling at the string level (compared to the character level)
> by using SpecialCasing.txt in addition to UnicodeData.txt.
>
> Have I got that right?

Yes :-)


There's one last problem with this approach: It leaves out
normalization.

In Unicode, there are multiple sequences of code points that
represent the same character. For example, the code point
sequences (#\x00C4) and (#\x0041 #\x0308) are equivalent.

00C4  LATIN CAPITAL LETTER A WITH DIAERESIS
0041  LATIN CAPITAL LETTER A
0308  COMBINING DIAERESIS

Normalization maps those sequences to a common form (either to the
composed or the decomposed form) so that comparison can be done on
a codepoint-by-codepoint basis.

Luckily, case folding is specified in such a way that a normalized
sequence of code points remains normalized if case-folded.

So, to make STRING-CI=? or, indeed, STRING=? work, one option
would be for the SRFI to provide STRING-NORMALIZE-* procedures,
and require normalized strings to be passed to the comparison
procedures for them to work correctly.

Greetings,
        -- Jorgen

-- 
((email . "forcer@xxxxxxxxx") (www . "http://www.forcix.cx/";)
 (gpg   . "1024D/028AF63C")   (irc . "nick forcer on IRCnet"))