[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: the discussion so far

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



Matthew Flatt <mflatt@xxxxxxxxxxx> writes:

> So, the `char-ci' operations should use the "simple case folding" table
> from CaseFolding.txt, and the `string-ci' operations should use the
> "full case folding" table from CaseFolding.txt. After folding, the
> comparison result is determined character-by-character.

Codepoint-by-codepoint, yes. (That is what you meant, I just
wanted to clarify. The terminology is a bit confusing, as
"character" is defined differently in Unicode than it is in this
SRFI)

> Meanwhile, `string-upcase' and `string-downcase' reflect the same
> improved handling at the string level (compared to the character level)
> by using SpecialCasing.txt in addition to UnicodeData.txt.
>
> Have I got that right?

Yes :-)


There's one last problem with this approach: It leaves out
normalization.

In Unicode, there are multiple sequences of code points that
represent the same character. For example, the code point
sequences (#\x00C4) and (#\x0041 #\x0308) are equivalent.

00C4  LATIN CAPITAL LETTER A WITH DIAERESIS
0041  LATIN CAPITAL LETTER A
0308  COMBINING DIAERESIS

Normalization maps those sequences to a common form (either to the
composed or the decomposed form) so that comparison can be done on
a codepoint-by-codepoint basis.

Luckily, case folding is specified in such a way that a normalized
sequence of code points remains normalized if case-folded.

So, to make STRING-CI=? or, indeed, STRING=? work, one option
would be for the SRFI to provide STRING-NORMALIZE-* procedures,
and require normalized strings to be passed to the comparison
procedures for them to work correctly.

Greetings,
        -- Jorgen

-- 
((email . "forcer@xxxxxxxxx") (www . "http://www.forcix.cx/";)
 (gpg   . "1024D/028AF63C")   (irc . "nick forcer on IRCnet"))