[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: the discussion so far

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



Michael Sperber scripsit:

> US-ASCII, ISO 8859-1, and UCS-2-based [...]
> subsets are all closed with respect to the case folding in
> UnicodeData.txt.  I don't know offhand if that's also the case with
> full Unicode case folding.

It is not true of either simple or full case folding as specified in
CaseFolding.txt; in particular, the 8859-1 character MICRO SIGN (0xB5,
U+00B5) folds to a proper GREEK SMALL LETTER MU (U+03BC) as a consequence
of the compatibility equivalence between the two.

There are also encodings which are not closed even under lowercasing:
of the 123 encodings I have information for, 30 are not closed under
lowercasing, 54 are not closed under simple folding, and 60 are not
closed under full folding.  (Details on request.)

Jorgen Schaefer scripsit:

> Luckily, case folding is specified in such a way that a normalized
> sequence of code points remains normalized if case-folded.

This is exactly backwards.  Case folding does *not* preserve normalization,
but *does* work correctly even on unnormalized input.  For example, 
the sequence <0130> is in normalization form C, but folds to
<0069,0307>, which is not.

I do agree that normalization functions are a Good Thing, though not
necessarily for the Scheme core.

-- 
Overhead, without any fuss, the stars were going out.
        --Arthur C. Clarke, "The Nine Billion Names of God"
                John Cowan <jcowan@xxxxxxxxxxxxxxxxx>