[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: text processes vs. string procedures



   From: "Sergei Egorov" <esl@xxxxxxxxxxxxxxx>
   I understand your concern; many people do use ASCII and Latin-1 case
   mapping and are happy with what they get from the good old char-upcase and
   char-downcase.  And I am not against char-upcase and char-downcase as long
   as their definition is limited to ASCII; otherwise you will have to ignore
   three problems mentioned in the Unicode book: uppercase I may map to either
   i or dotless i (in Turkish), two uppercase letters SS may map to a single
   lowercase sharp s in German, and this thing with French \'e. We are lucky
   that there are just three problems with case folding, but collation is
   *much* worse. My suggestion would be to restrict char-upcase,
   char-downcase, and their derivatives to ASCII and explicitly specify that
   string>? and other comparisons are based on mechanical code-point
   comparison that might not correspond to any 'natural' comparison in a real
   language. This approach makes the library reasonably useful, simple to
   implement, and really fast. I believe that attempting to define
   language-dependent interface to collation based on strings is wrong:
   collation works best when it deals with language-specific units larger than
   one character, and the 'text' abstraction suits this task much better.

Wait wait wait -- I am *not* proposing CHAR-UPCASE and CHAR-DOWNCASE.
These procedures are *not* part of SRFI-13. You are quite right -- they have
real problems with non-ASCII char encodings. What is in SRFI-13 is
      STRING-UPCASE
      STRING-DOWNCASE
      STRING-TITLECASE
These can handle the various issues involved in case-mapping text (e.g.,
upcasing German es-szet expanding to 2 chars, Greek sigma downcasing in a
context-dependent way, titlecasing compound chars like "fi" or "dz"). No
problem. Unicode TR 21 explains clearly and carefully how to do it for
Unicode.

Note also that I punted the side-effecting STRING-UPCASE! et al. because
of the one-char->two-char case mapping issues. 

Your general point about these operations no longer being simply
char->char, but being string->string or text->text is right on the money.

However, I have nothing intelligent to say about collation and string
comparison in the wide Unicode world today. If I can't come up with something
reasonable that works in ASCII, Latin-1 *and* a Unicode setting, I'll punt the
string-comparison functions, which I think would be a huge blow to the
library.  
    -Olin