[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: upcoming revision, need feedback

Derick Eddington wrote:
> I think the pathname component separators do need to be defined.
> [...] if they're undefined, the encoded set would not be clearly,
> precisely, completely specified.

The current draft sets the encoded set to be <a list of chars and the
path separator>. The set of path separators depends on a platform [1],
but the set of encoded characters should not (for portability reasons).
So you must include all the possible separators from all the supported
platforms in the encoded set -- after that specifying each of them
separately serves no purpose.

But in the end this point is of little importance; I will not object
either way.

[1] E.g. Windows uses both forward and back slashes as path separators.

>>> 7) Add #\; to the set of encoded characters, because a directory could be both
>>> in the SCHEME_LIB_PATH sequence and correspond to a library name component.
>>> Such a directory with a name including #\; is unusual but must be supported,
>>> otherwise an unencoded #\; would be misinterpreted in SCHEME_LIB_PATH.
>> I heard that when you strive to fail safety it's best to enumerate
>> allowed things, not the forbidden ones. 
> I don't think that justifies what you suggest below.

It is generally hard to list all the failure conditions, but easy to
list success conditions.

Let me illustrate: ~ is missing in the encoded set, since Windows
threats that character specially (e.g. "PROGRA~1" is a shortcut to the
first file starting with "Progra").

Another example is  (U+00A5). When represented in Japanese cp-932 it
maps to #x5C (just as \ does in ascii), which is treated as a path
separator. Because of this some programs (e.g. Cygwin) will choke on
filenames with U+00A5 when cp-932 is your local codepage, even though
U+00A5 itself is perfectly legal. This also applies to â (U+20A9) in
Korean (cp-949), and possibly more.

>> How about "Encode everything
>> except for [a-zA-Z0-9_.-]"? It's safe, short, simple and works for 99%
>> of libraries without any encoding at all.
> Other cultures' characters must be usable unencoded, especially since
> the targeted file systems support using them, and we want other
> cultures' use of Scheme to not be discriminated against growing to be
> more than 1% of libraries.

FWIW, using non-ascii symbols in source files is widely considered bad
manners in my culture. So while I do recognize value in not needing to
encode these symbols, I won't complain much about the discrimination.

Also note that file system support for localized characters in Windows
is (was?) problematic since it uses local codpeage in many places. Due
to this a filename with a Ukrainian 'Ñ' (U+0456) is not accessible via
an SMB mount from a Windows with Russian settings [2].

[2] Once upon a time this bit a fair share of accountants in Ukraine.