[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: text processes vs. string procedures

This page is part of the web mail archives of SRFI 13 from before July 7th, 2015. The new archives for SRFI 13 are here. Eventually, the entire history will be moved there, including any new messages.



   From: "Sergei Egorov" <esl@xxxxxxxxxxxxxxx>
   I understand your concern; many people do use ASCII and Latin-1 case
   mapping and are happy with what they get from the good old char-upcase and
   char-downcase.  And I am not against char-upcase and char-downcase as long
   as their definition is limited to ASCII; otherwise you will have to ignore
   three problems mentioned in the Unicode book: uppercase I may map to either
   i or dotless i (in Turkish), two uppercase letters SS may map to a single
   lowercase sharp s in German, and this thing with French \'e. We are lucky
   that there are just three problems with case folding, but collation is
   *much* worse. My suggestion would be to restrict char-upcase,
   char-downcase, and their derivatives to ASCII and explicitly specify that
   string>? and other comparisons are based on mechanical code-point
   comparison that might not correspond to any 'natural' comparison in a real
   language. This approach makes the library reasonably useful, simple to
   implement, and really fast. I believe that attempting to define
   language-dependent interface to collation based on strings is wrong:
   collation works best when it deals with language-specific units larger than
   one character, and the 'text' abstraction suits this task much better.

Wait wait wait -- I am *not* proposing CHAR-UPCASE and CHAR-DOWNCASE.
These procedures are *not* part of SRFI-13. You are quite right -- they have
real problems with non-ASCII char encodings. What is in SRFI-13 is
      STRING-UPCASE
      STRING-DOWNCASE
      STRING-TITLECASE
These can handle the various issues involved in case-mapping text (e.g.,
upcasing German es-szet expanding to 2 chars, Greek sigma downcasing in a
context-dependent way, titlecasing compound chars like "fi" or "dz"). No
problem. Unicode TR 21 explains clearly and carefully how to do it for
Unicode.

Note also that I punted the side-effecting STRING-UPCASE! et al. because
of the one-char->two-char case mapping issues. 

Your general point about these operations no longer being simply
char->char, but being string->string or text->text is right on the money.

However, I have nothing intelligent to say about collation and string
comparison in the wide Unicode world today. If I can't come up with something
reasonable that works in ASCII, Latin-1 *and* a Unicode setting, I'll punt the
string-comparison functions, which I think would be a huge blow to the
library.  
    -Olin