[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: the discussion so far
Matthew Flatt <mflatt@xxxxxxxxxxx> writes:
> So, the `char-ci' operations should use the "simple case folding" table
> from CaseFolding.txt, and the `string-ci' operations should use the
> "full case folding" table from CaseFolding.txt. After folding, the
> comparison result is determined character-by-character.
Codepoint-by-codepoint, yes. (That is what you meant, I just
wanted to clarify. The terminology is a bit confusing, as
"character" is defined differently in Unicode than it is in this
SRFI)
> Meanwhile, `string-upcase' and `string-downcase' reflect the same
> improved handling at the string level (compared to the character level)
> by using SpecialCasing.txt in addition to UnicodeData.txt.
>
> Have I got that right?
Yes :-)
There's one last problem with this approach: It leaves out
normalization.
In Unicode, there are multiple sequences of code points that
represent the same character. For example, the code point
sequences (#\x00C4) and (#\x0041 #\x0308) are equivalent.
00C4 LATIN CAPITAL LETTER A WITH DIAERESIS
0041 LATIN CAPITAL LETTER A
0308 COMBINING DIAERESIS
Normalization maps those sequences to a common form (either to the
composed or the decomposed form) so that comparison can be done on
a codepoint-by-codepoint basis.
Luckily, case folding is specified in such a way that a normalized
sequence of code points remains normalized if case-folded.
So, to make STRING-CI=? or, indeed, STRING=? work, one option
would be for the SRFI to provide STRING-NORMALIZE-* procedures,
and require normalized strings to be passed to the comparison
procedures for them to work correctly.
Greetings,
-- Jorgen
--
((email . "forcer@xxxxxxxxx") (www . "http://www.forcix.cx/")
(gpg . "1024D/028AF63C") (irc . "nick forcer on IRCnet"))