[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: upcoming revision, need feedback

This page is part of the web mail archives of SRFI 103 from before July 7th, 2015. The new archives for SRFI 103 contain all messages, not just those from before July 7th, 2015.

Derick Eddington wrote:
> I think the pathname component separators do need to be defined.
> [...] if they're undefined, the encoded set would not be clearly,
> precisely, completely specified.

The current draft sets the encoded set to be <a list of chars and the
path separator>. The set of path separators depends on a platform [1],
but the set of encoded characters should not (for portability reasons).
So you must include all the possible separators from all the supported
platforms in the encoded set -- after that specifying each of them
separately serves no purpose.

But in the end this point is of little importance; I will not object
either way.

[1] E.g. Windows uses both forward and back slashes as path separators.

>>> 7) Add #\; to the set of encoded characters, because a directory could be both
>>> in the SCHEME_LIB_PATH sequence and correspond to a library name component.
>>> Such a directory with a name including #\; is unusual but must be supported,
>>> otherwise an unencoded #\; would be misinterpreted in SCHEME_LIB_PATH.
>> I heard that when you strive to fail safety it's best to enumerate
>> allowed things, not the forbidden ones. 
> I don't think that justifies what you suggest below.

It is generally hard to list all the failure conditions, but easy to
list success conditions.

Let me illustrate: ~ is missing in the encoded set, since Windows
threats that character specially (e.g. "PROGRA~1" is a shortcut to the
first file starting with "Progra").

Another example is  (U+00A5). When represented in Japanese cp-932 it
maps to #x5C (just as \ does in ascii), which is treated as a path
separator. Because of this some programs (e.g. Cygwin) will choke on
filenames with U+00A5 when cp-932 is your local codepage, even though
U+00A5 itself is perfectly legal. This also applies to â (U+20A9) in
Korean (cp-949), and possibly more.

>> How about "Encode everything
>> except for [a-zA-Z0-9_.-]"? It's safe, short, simple and works for 99%
>> of libraries without any encoding at all.
> Other cultures' characters must be usable unencoded, especially since
> the targeted file systems support using them, and we want other
> cultures' use of Scheme to not be discriminated against growing to be
> more than 1% of libraries.

FWIW, using non-ascii symbols in source files is widely considered bad
manners in my culture. So while I do recognize value in not needing to
encode these symbols, I won't complain much about the discrimination.

Also note that file system support for localized characters in Windows
is (was?) problematic since it uses local codpeage in many places. Due
to this a filename with a Ukrainian 'Ñ' (U+0456) is not accessible via
an SMB mount from a Windows with Russian settings [2].

[2] Once upon a time this bit a fair share of accountants in Ukraine.