This page is part of the web mail archives of SRFI 13 from before July 7th, 2015. The new archives for SRFI 13 contain all messages, not just those from before July 7th, 2015.
I don't agree with this proposal: it seems to me that STRING<? and others are better left for trivial tasks like sorting strings of digits; they have simple definition based on CHAR<? that, in its turn, is based on internal encoding (ASCII or UNICODE). It is still very useful as ordering predicate with no language-dependent meaning; for example, if you want to implement string sets as sorted lists, it's much better to use fast ordering predicate, even if the induced ordering doesn't make any sense. From the other hand, some Schemes have already implemented extended versions of these predicates accepting more than two arguments to make them similar to < and others (arguments are in monotonically decreasing order). I would suggest using new names for collation predicates, especially because collation is actually a complex process involving generation of "collation keys" which can be reused: (string->collation-key str language-specifier) => c-key (collation-key<? c-key1 c-key2) => bool (collation-key<=? c-key1 c-key2) => bool ... and then you can define your own collation predicates: (define (esperanto-string<? s1 s2) (collation-key<? (string->collation-key s1 esperanto) (string->collation-key s1 esperanto))) or make a macro to define them all at once: (define-collation-predicates esperanto) is expanded into (begin (define esperanto-string<? ...) ...) -- Sergei ----- Original Message ----- From: Ben Goetter <goetter@xxxxxxxxxxxxxxxx> To: <srfi-13@xxxxxxxxxxxxxxxxx> Sent: Friday, March 10, 2000 1:26 PM Subject: String comparison under Latin-1 and Unicode > >... collation and string > > comparison in the wide Unicode world today. If I can't come up with > something > > reasonable that works in ASCII, Latin-1 *and* a Unicode setting > > The STRING>? problem under Unicode differs from the problem under Latin-1 > only in degree. (Finns and Swedes use a different collation sequence from > Danes and Norwegians. "AE" is a ligated character in English, but not in > Danish. Spanish vs. French vs Traditional Spanish. And much, much more.) > Hence even under Latin-1, STRING>? must take the domain language into > account. Unicode merely makes more scripts - and so more languages - > convenient. > > Proposal: > > The string comparators take an optional final argument that is not of type > string, but a new type, language-specifier (abbrev. langid), which specifies > the language of a block of text. The procedure CURRENT-LANGUAGE returns the > langid for whatever language Scheme uses for string comparators lacking this > optional final argument. Scheme initially uses some default langid that it > inherits from its host environment; the procedure DEFAULT-LANGUAGE returns > the langid for this default. The procedures CALL-WITH-LANGUAGE <i>langid > proc</i> and WITH-LANGUAGE <i>langid thunk</i> change the value returned by > CURRENT-LANGUAGE. Finally, the procedure LANGUAGE takes the ISO 639 > language code, specified as a string, and returns the correct langid. > LANGUAGE may be extended to take other values (perhaps a numeric language > code from the host OS). > > This would allow correct collation of text using the current Scheme notion > of "string." Building a higher-level "text" abstraction from this is purely > mechanical. > > Ben >