[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: text processes vs. string procedures

This page is part of the web mail archives of SRFI 13 from before July 7th, 2015. The new archives for SRFI 13 contain all messages, not just those from before July 7th, 2015.



Olin Shiver writes:
[...] 
> - However, I think case-mapping and string-comparison are basic things, and
>   they can be given a generic, portable definition independent of the
>   underlying character encoding. Case-mapping does *not* require strings to be
>   well-formed text. ASCII, Latin-1 and Unicode all provide a clear,
>   language-independent definitions of this operation.
> 
>   I don't want the string library to be minimal. I want it to be useful.
>   People -- many of whom currently program with Latin-1 or ASCII Schemes --
>   case-map and compare strings frequently. These operations can be provided
>   with an API which is portable across ASCII, Latin-1 and Unicode. So there's
>   no barrier here.

I understand your concern; many people do use ASCII and Latin-1 case mapping
and are happy with what they get from the good old char-upcase and char-downcase.
And I am not against char-upcase and char-downcase as long as their definition
is limited to ASCII; otherwise you will have to ignore three problems
mentioned in the Unicode book: uppercase I may map to either i or dotless i
(in Turkish), two uppercase letters SS may map to a single lowercase
sharp s in German, and this thing with French \'e. We are lucky that
there are just three problems with case folding, but collation is
*much* worse. My suggestion would be to restrict char-upcase,
char-downcase, and their derivatives to ASCII and explicitly
specify that string>? and other comparisons are based on
mechanical code-point comparison that might not correspond
to any 'natural' comparison in a real language. This approach
makes the library reasonably useful, simple to implement, and
really fast. I believe that attempting to define language-dependent
interface to collation based on strings is wrong: collation works
best when it deals with language-specific units larger than one
character, and the 'text' abstraction suits this task much better.