[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: String comparison under Latin-1 and Unicode



I don't agree with this proposal: it seems to me that STRING<? and
others are better left for trivial tasks like sorting strings of digits;
they have simple definition based on CHAR<? that, in its turn,
is based on internal encoding (ASCII or UNICODE). It is still
very useful as ordering predicate with no language-dependent
meaning; for example, if you want to implement string sets as
sorted lists, it's much better to use fast ordering predicate,
even if the induced ordering doesn't make any sense. From
the other hand, some Schemes have already implemented
extended versions of these predicates accepting more than
two arguments to make them similar to < and others
(arguments are in monotonically decreasing order).

I would suggest using new names for collation predicates,
especially because collation is actually a complex process
involving generation of "collation keys" which can be reused:

(string->collation-key str language-specifier) => c-key
(collation-key<? c-key1 c-key2) => bool
(collation-key<=? c-key1 c-key2) => bool
...
and then you can define your own collation predicates:

(define (esperanto-string<? s1 s2)
   (collation-key<?
      (string->collation-key s1 esperanto)
      (string->collation-key s1 esperanto)))

or make a macro to define them all at once:

(define-collation-predicates esperanto)
is expanded into
(begin
   (define esperanto-string<? ...)
   ...)

-- Sergei

----- Original Message -----
From: Ben Goetter <goetter@xxxxxxxxxxxxxxxx>
To: <srfi-13@xxxxxxxxxxxxxxxxx>
Sent: Friday, March 10, 2000 1:26 PM
Subject: String comparison under Latin-1 and Unicode


> >... collation and string
> > comparison in the wide Unicode world today. If I can't come up with
> something
> > reasonable that works in ASCII, Latin-1 *and* a Unicode setting
>
> The STRING>? problem under Unicode differs from the problem under Latin-1
> only in degree.  (Finns and Swedes use a different collation sequence from
> Danes and Norwegians.  "AE" is a ligated character in English, but not in
> Danish.  Spanish vs. French vs Traditional Spanish.  And much, much more.)
> Hence even under Latin-1, STRING>? must take the domain language into
> account.  Unicode merely makes more scripts - and so more languages -
> convenient.
>
> Proposal:
>
> The string comparators take an optional final argument that is not of type
> string, but a new type, language-specifier (abbrev. langid), which
specifies
> the language of a block of text.  The procedure CURRENT-LANGUAGE returns
the
> langid for whatever language Scheme uses for string comparators lacking
this
> optional final argument.  Scheme initially uses some default langid that
it
> inherits from its host environment; the procedure DEFAULT-LANGUAGE returns
> the langid for this default.  The procedures CALL-WITH-LANGUAGE <i>langid
> proc</i> and WITH-LANGUAGE <i>langid thunk</i> change the value returned
by
> CURRENT-LANGUAGE.  Finally, the procedure LANGUAGE takes the ISO 639
> language code, specified as a string, and returns the correct langid.
> LANGUAGE may be extended to take other values (perhaps a numeric language
> code from the host OS).
>
> This would allow correct collation of text using the current Scheme notion
> of "string."  Building a higher-level "text" abstraction from this is
purely
> mechanical.
>
> Ben
>