[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parsing Scheme [was Re: strings draft]



Tom Lord <lord@xxxxxxx> writes:

> We should also point readers in general to:
> 
>   http://www.unicode.org/reports/tr15/#Programming_Language_Identifiers
> 
> which is Annex 7 ("Programming Language Identifiers") of Unicode
> Technical Report 15 ("Unicode Normalization Forms").

Yes.  I think the Unicode suggestions for programming language
identifiers are good ones, and we should both point to them and
strongly suggest their use.  I'm not quite prepared to say that we
should standardize Scheme to require it (even on Unicode places)

> * (identifier? s) => <bool>

This is fine.  An implementation should be allowed to always return #t
from this function, even though not every such string could be parsed
as an identifier by the reader.  (This for the sake of eval, at least.)

>      The definition of FOLD-IDENTIFIER must be consistent with the
>      recommendations of Annex 7 ("Programming Language Identifiers" of
>      Unicode Technical Report 15 for identifier names comprised
>      entirely of Unicode characters.  

Again, I would suggest that we merely advocate this, but not require it.

>      For this purpose, the characters
>      of the portable Scheme character set are considered to be Unicode
>      characters.  (A short summary of the implications of this
>      requirement for portable identifiers is that given a portable
>      identifier, FOLD-IDENTIFIER must map #\A..#\Z to #\a..#\z.)

On the other hand, we should certainly specify exactly the behavior of
the function for the required character set, agreed.

>      (FOLD-IDENTIFIER is preferable to STRING-ID=? because it 
>      produces a canonical form of each identifier explicitly 
>      rather than implicitly.   The canonical form is useful because
>      it can be hashed, stored in a trie, etc.   It would be
>      impractical to implement, for example, a symbol table in a
>      compiler given only STRING-ID=?.)

I think my worry is that it is not obvious that an implementation even
has an implicit folding available, at least, not cheaply.  There
should perhaps be a hash function to go with string-id=? to help.  

Many implementations will of course implement these things by
folding.  But if you think that really string-id=? should be allowed
to implement arbitrary equivalence classes (provided that the standard
character set works right), it isn't obvious to me that
fold-identifier can be cheap, and that it might well be more expensive
than whatever straightforward test is used.

> * (concatenate-identifiers s0 s1 ...) => id
> 
>      Return a string ID, containing an identifier name which
>      is the concatenation of the arguments which must themselves
>      be identifier names.

>      (As nearly as I can tell, CONCATENATE-IDENTIFIERS is needed
>      because IDENTIFIER? won't be closed under STRING-APPEND -- but
>      I could be mistake about that.  More research is needed.)

In the cases where identifier? isn't closed under string-append,
concatenate-identifiers might need to do more work than just
concatenate.  (What does "the concatenation of the arguments" mean, if
not string-append?)

> * (char-id-start? c) => <bool>
>   Return #t if C is a valid first character in an identifier.
> 
> * (char-id-extend? c) => <bool>
>   Return #t if C is a valid non-first character in an identifier.

These may be contextual.  A character may be allowed in the beginning
of an identifier but only if, something else is true later on.
(Consider the "if it's not a number, it's an identifier" rule of the
current standard.)

Perhaps a system might want to have functions like this, but I'd like
to see more experience before standardizing something.

> What about case independent character ordering (e.g., CHAR-CI<? and
> STRING-CI<?)?  I see no compelling reason to eliminate them at this
> stage -- they're still useful.  I think they should be specified to be
> consistent with the single-character default case foldings of Unicode,
> where the portable character set is considered to consist of Unicode
> characters.  This will allow portable Scheme programs to use these
> procedures to write programs which accurately manipulate Scheme
> programs that use nothing but the portable character set.  

string-ci<? is fine, but must have a locale argument.  If you want to
have a standardly specified "default case foldings of Unicode" locale,
that's fine with me.  Ditto for char-ci<?.

> What about case mappings (CHAR-UPCASE and CHAR-DOWNCASE).  Again:
> retain them;  specify them as using the Unicode single character
> mappings; permit implementations to add parameters are new procedures
> -- the result allows portable Scheme programs to handle portable
> Scheme program texts and captures a useful Unicode text process.

No, no, no.  Don't make functions that are known to be wrong.  This is
a bad idea.  It's like requiring < to work for complex numbers, and
then comparing magnitude, and saying "well, that's close enough".
It's not.

You can case map strings, and this should certainly be allowed.  It
should also have a locale argument.

You cannot sensible case-map characters except in the "unicode single
character mappings" locale; and why should we have special privileged
functions there?  It will only encourage people to *use* the
functions, and their code will then be non-portable precisely when it
matters.  

At the very least, make it allowed for char-upcase to simply fail to
give any answer, and provide a locale argument.  Or allow char-upcase
to return a string.

> A final note: the desirability of the -CI, -UPCASE, and -DOWNCASE
> procedures hinges on the assumption that the portable Scheme character
> set is a proper subset of Unicode.   

I'm assuming that (or at least, I want to make it possible), but I do
*not* think that char-upcase and char-downcase are good ideas.

string-upcase and string-downcase, by contrast, are unobjectionable,
provided they get a locale argument.

Thomas