[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: collation algorithm

This page is part of the web mail archives of SRFI 75 from before July 7th, 2015. The new archives for SRFI 75 contain all messages, not just those from before July 7th, 2015.



bear scripsit:

> The proposed semantics for collation of strings
> (using string>? & friends) by pointwise comparison
> is in direct conflict with the unicode standard
> for locale-independent collation of strings, as
> expressed in
> 
> http://www.unicode.org/reports/tr10/

Note that the Unicode Collation Algorithm is not, strictly speaking,
part of the Unicode standard; it even has its own ISO number (14651
rather than 10646).  Compliance to the Unicode Standard neither
requires nor forbids conformance to the UCA.

> The unicode collation algorithm abstracts over
> representation issues such as how characters are
> rendered as sequences of individual codepoints,
> making the test for canonical (glyph) equivalence
> rather than codepoint equivalence.

(You're misusing the term "glyph"; see the Unicode Glossary.
I assume you mean something close to "grapheme".)

> Since I figure most language implementors will ignore
> it (and *are* ignoring it, in Java and C#) this part
> of the Unicode standard will probably eventually be
> abandoned.

That turns out not to be the case.  :-)

For Java, you can use either fast (binary) or smart (UCA) comparison
routines: the former are provided in the java.lang.String class, the
latter by java.text.Collator and related classes.  (The latter include the
UCA's provisions for tailoring collation order for specific locales: for
example, to make ä sort after z, as Swedes expect, rather than with a,
its normal place.)  UCA collation is also readily available for C and C++
programs via IBM's open-source ICU library.

> At the same time, I want to leave it legal for
> scheme implementors who are actually doing unicode
> support to conform to it if they want to.

That can be done by leaving the *-ci? procedures alone and allowing
implementers to provide their own UCA-compliant procedures.

-- 
John Cowan      http://www.ccil.org/~cowan      jcowan@xxxxxxxxxxxxxxxxx
Be yourself.  Especially do not feign a working knowledge of RDF where
no such knowledge exists.  Neither be cynical about RELAX NG; for in
the face of all aridity and disenchantment in the world of markup,
James Clark is as perennial as the grass.  --DeXiderata, Sean McGrath