[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

String comparison under Latin-1 and Unicode

This page is part of the web mail archives of SRFI 13 from before July 7th, 2015. The new archives for SRFI 13 contain all messages, not just those from before July 7th, 2015.



>... collation and string
> comparison in the wide Unicode world today. If I can't come up with
something
> reasonable that works in ASCII, Latin-1 *and* a Unicode setting

The STRING>? problem under Unicode differs from the problem under Latin-1
only in degree.  (Finns and Swedes use a different collation sequence from
Danes and Norwegians.  "AE" is a ligated character in English, but not in
Danish.  Spanish vs. French vs Traditional Spanish.  And much, much more.)
Hence even under Latin-1, STRING>? must take the domain language into
account.  Unicode merely makes more scripts - and so more languages -
convenient.

Proposal:

The string comparators take an optional final argument that is not of type
string, but a new type, language-specifier (abbrev. langid), which specifies
the language of a block of text.  The procedure CURRENT-LANGUAGE returns the
langid for whatever language Scheme uses for string comparators lacking this
optional final argument.  Scheme initially uses some default langid that it
inherits from its host environment; the procedure DEFAULT-LANGUAGE returns
the langid for this default.  The procedures CALL-WITH-LANGUAGE <i>langid
proc</i> and WITH-LANGUAGE <i>langid thunk</i> change the value returned by
CURRENT-LANGUAGE.  Finally, the procedure LANGUAGE takes the ISO 639
language code, specified as a string, and returns the correct langid.
LANGUAGE may be extended to take other values (perhaps a numeric language
code from the host OS).

This would allow correct collation of text using the current Scheme notion
of "string."  Building a higher-level "text" abstraction from this is purely
mechanical.

Ben