[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: case mappings

Alex Shinn <alexshinn@xxxxxxxxx> writes:

> Because the proper definition is so complicated and slow, yet there
> are many uses of strict ASCII case mapping in computer languages
> and protocols, I think it makes sense to define the core case-mapping
> procedures as ASCII-specific.  Full linguistic case-handling should be
> provided by specialized library procedures which optionally accept locale,
> and only work at the string level, since single-char case-mappings are
> ill-defined.

I agree with almost everything you say, and then you say this part,
with which I earnestly disagree.  As you say, we do not want 90%
support in the core language, but defining ASCII-specific procedures
like this is exactly a 90% solution.  Actually, more like 70%, if

We should provide case-mapping procedures with optional locale
arguments.  The omission of a locale argument should be the same as
giving "current-locale" or "(current-locale)" or some other global
reference of the sort.

We should require a locale which represents the case mappings in
UnicodeData.txt and SpecialCasing.txt.  If we want another locale
which represents "ASCII-specific" case mapping, fine.  But don't make
that normal!

If you have two ways to do things, one standard and always-supported,
and one which is optional and loose, then everyone will use the
standard as a matter of course.  So don't make the standard the Wrong
Thing.  ASCII-specific case mappings are the Wrong Thing.

It would be much better to have no case mapping procedures in the
standard at all than to have half-assed ones.

I would therefore strongly urge the following approach:

All case-related procedures take an optional locale argument.
If the locale argument is omitted, it defaults to some global variable 
  (or the return value of a global procedure).
Specify as standard locales the unicode-default locale, and the
  ascii-only locale.
Specify that the default locale is initialized from the operating
  environment in a system-specific way.  (On Posix systems, it should
  come from LANG.)

Network protocol agents working in known-to-be-restricted character
sets can then just go ahead and switch to the ascii locale.  But the
default should be to just silently do the right thing for the current