This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.
Matthew Flatt <mflatt@xxxxxxxxxxx> writes: > So, the `char-ci' operations should use the "simple case folding" table > from CaseFolding.txt, and the `string-ci' operations should use the > "full case folding" table from CaseFolding.txt. After folding, the > comparison result is determined character-by-character. Codepoint-by-codepoint, yes. (That is what you meant, I just wanted to clarify. The terminology is a bit confusing, as "character" is defined differently in Unicode than it is in this SRFI) > Meanwhile, `string-upcase' and `string-downcase' reflect the same > improved handling at the string level (compared to the character level) > by using SpecialCasing.txt in addition to UnicodeData.txt. > > Have I got that right? Yes :-) There's one last problem with this approach: It leaves out normalization. In Unicode, there are multiple sequences of code points that represent the same character. For example, the code point sequences (#\x00C4) and (#\x0041 #\x0308) are equivalent. 00C4 LATIN CAPITAL LETTER A WITH DIAERESIS 0041 LATIN CAPITAL LETTER A 0308 COMBINING DIAERESIS Normalization maps those sequences to a common form (either to the composed or the decomposed form) so that comparison can be done on a codepoint-by-codepoint basis. Luckily, case folding is specified in such a way that a normalized sequence of code points remains normalized if case-folded. So, to make STRING-CI=? or, indeed, STRING=? work, one option would be for the SRFI to provide STRING-NORMALIZE-* procedures, and require normalized strings to be passed to the comparison procedures for them to work correctly. Greetings, -- Jorgen -- ((email . "forcer@xxxxxxxxx") (www . "http://www.forcix.cx/") (gpg . "1024D/028AF63C") (irc . "nick forcer on IRCnet"))