[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: the discussion so far

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.

To: srfi-75@xxxxxxxxxxxxxxxxx
Subject: Re: the discussion so far
From: Jorgen Schaefer <forcer@xxxxxxxxx>
Date: Sat, 16 Jul 2005 15:58:17 +0200
Delivered-to: srfi-75@xxxxxxxxxxxxxxxxx
In-reply-to: <E1DtmbQ-00028f-00@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> (Matthew Flatt's message of "Sat, 16 Jul 2005 07:21:24 -0600")
References: <E1Dtlz7-0000Mq-00@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> <871x5y296l.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxx> <E1DtmbQ-00028f-00@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (gnu/linux)

Matthew Flatt <mflatt@xxxxxxxxxxx> writes:

> So, the `char-ci' operations should use the "simple case folding" table
> from CaseFolding.txt, and the `string-ci' operations should use the
> "full case folding" table from CaseFolding.txt. After folding, the
> comparison result is determined character-by-character.

Codepoint-by-codepoint, yes. (That is what you meant, I just
wanted to clarify. The terminology is a bit confusing, as
"character" is defined differently in Unicode than it is in this
SRFI)

> Meanwhile, `string-upcase' and `string-downcase' reflect the same
> improved handling at the string level (compared to the character level)
> by using SpecialCasing.txt in addition to UnicodeData.txt.
>
> Have I got that right?

Yes :-)

There's one last problem with this approach: It leaves out
normalization.

In Unicode, there are multiple sequences of code points that
represent the same character. For example, the code point
sequences (#\x00C4) and (#\x0041 #\x0308) are equivalent.

00C4  LATIN CAPITAL LETTER A WITH DIAERESIS
0041  LATIN CAPITAL LETTER A
0308  COMBINING DIAERESIS

Normalization maps those sequences to a common form (either to the
composed or the decomposed form) so that comparison can be done on
a codepoint-by-codepoint basis.

Luckily, case folding is specified in such a way that a normalized
sequence of code points remains normalized if case-folded.

So, to make STRING-CI=? or, indeed, STRING=? work, one option
would be for the SRFI to provide STRING-NORMALIZE-* procedures,
and require normalized strings to be passed to the comparison
procedures for them to work correctly.

Greetings,
        -- Jorgen

-- 
((email . "forcer@xxxxxxxxx") (www . "http://www.forcix.cx/";)
 (gpg   . "1024D/028AF63C")   (irc . "nick forcer on IRCnet"))

References:
- the discussion so far
  - From: Matthew Flatt
- Re: the discussion so far
  - From: Jorgen Schaefer
- Re: the discussion so far
  - From: Matthew Flatt

Prev by Date: Re: the discussion so far
Next by Date: Re: collation algorithm
Previous by thread: Re: the discussion so far
Next by thread: Re: the discussion so far
Index(es):
- Date
- Thread