[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

String comparison under Latin-1 and Unicode



>... collation and string
> comparison in the wide Unicode world today. If I can't come up with
something
> reasonable that works in ASCII, Latin-1 *and* a Unicode setting

The STRING>? problem under Unicode differs from the problem under Latin-1
only in degree.  (Finns and Swedes use a different collation sequence from
Danes and Norwegians.  "AE" is a ligated character in English, but not in
Danish.  Spanish vs. French vs Traditional Spanish.  And much, much more.)
Hence even under Latin-1, STRING>? must take the domain language into
account.  Unicode merely makes more scripts - and so more languages -
convenient.

Proposal:

The string comparators take an optional final argument that is not of type
string, but a new type, language-specifier (abbrev. langid), which specifies
the language of a block of text.  The procedure CURRENT-LANGUAGE returns the
langid for whatever language Scheme uses for string comparators lacking this
optional final argument.  Scheme initially uses some default langid that it
inherits from its host environment; the procedure DEFAULT-LANGUAGE returns
the langid for this default.  The procedures CALL-WITH-LANGUAGE <i>langid
proc</i> and WITH-LANGUAGE <i>langid thunk</i> change the value returned by
CURRENT-LANGUAGE.  Finally, the procedure LANGUAGE takes the ISO 639
language code, specified as a string, and returns the correct langid.
LANGUAGE may be extended to take other values (perhaps a numeric language
code from the host OS).

This would allow correct collation of text using the current Scheme notion
of "string."  Building a higher-level "text" abstraction from this is purely
mechanical.

Ben