[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: strings draft

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.



> Tom Lord wrote:
>> You have a choice.
>> 
>> 1) Standard Scheme becomes case-sensitive.  May as well drop the case
>>    mappings from the standard entirely, in this case.
>> 
>> 2) Standard Scheme specifies a deterministic case mapping for the
>>    portable character set in which portable programs may be written.
>>
>> 3) Standard Scheme does not provide for portable Scheme source texts.
>> 
>> I pick (2) ....

Alex Shinn wrote:
> As do I, I certainly was not advocating (3) .... I'm not arguing
> either way as to using a default (current-locale), I'm just pointing
> it out as a likely possibility ....

I think to really do a good job of text handling, a procedure must know
the language and encoding for both the source text (parameter values)
and the context (returned values). For example, the rules for embedding
Arabic text (right to left) in a Latin document (left to right) are
slightly different from the converse, IIRC. This suggests an encoding
and processing scheme where every text has an associated locale and
every text-processing procedure has a locale context parameter. For
convenience's sake, that information may be implicit or supplied via
global parameters (e.g., CURRENT-LOCALE), although there are
disadvantages to doing it that way (e.g., changing a global locale can
cause subtle data corruption or information loss problems).

On a slightly different note, there's also issue of program source vs
program data. Some languages, like C, separate the two. In principle,
that makes it easier to use different environments for compiler hosting,
program hosting, and program data. In practice, I think it causes
confusion more than it helps. Such an approach is even more dubious for
a language like Scheme, where self-hosting or metacircularity is
extremely common (i.e., the compiler uses the same reader both for
interpreting programs and reading program data).

Rather than taking cues from languages like C, it might be better to
look at the prior art in languages where the boundary between "program"
and "data" is less sharp. XML might be a good example. An XML reader may
recognize many languages and encodings, but the reader always begins in
a default, "standard" state that only recognizes a few. That default
state includes a way to specify a different locale as a kind of
"metadata." With this approach, you can write XML code/data in other
locales; the file begins in the standard locale, but you can then
"bootstrap" the reader into a different locale.

A Scheme reader could use the same technique. External representations
are in the "default Scheme source locale" by default, but they can
include metadata sexps to boot the reader into a different locale.
(Implementations may also provide extensions to change the default
locale.) This gives users a few options for making their source code and
program data portable between systems (in order of decreasing
portability):

1. Always use the standard Scheme locale. Any Scheme reader should be
   able to process your code/data, so long as the system supports a few
   basic assumptions (i.e., files are readable as octet streams).

2. Use your native language, and include the locale metadata at the
   start of the file (e.g., wrap the file with something like

       #,(LOCALE UTF-8 EN-US ( ... )))

3. Use your native language, and rely on local system conventions to
   change the default Scheme locale. For example, a Scheme interpreter
   on a Linux system might recognize

       LANG=en_US.UTF-8 scheme program ....
   
   as a valid way to start the interpreter with its reader in UTF-8
   encoded US English mode. This method is tricky, because it makes it
   harder to specify different locales for program and source data.

The XML "locale metadata" approach isn't perfect, but it seems like a
reasonable approach to provide locale flexibility in program code and
data. Unfortunately, I haven't had much experience with it; any comments
from people who have actually used this facility?
-- 
Bradd W. Szonye
http://www.szonye.com/bradd