[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parsing Scheme [was Re: strings draft]

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.





    > From: tb@xxxxxxxxxx (Thomas Bushnell, BSG)

    > There should be string-id=? (or some other name) which implements the
    > Scheme identifier matching rules, which should be specified for the
    > required character set, and left unspecified for all other
    > characters.  

    > None of this requires or even implicitly uses a case mapping function.

    >> The standard would still need to specify CHAR-DOWNCASE.   

    > Why?  Is there some government bureau that will shut us down if the
    > next RnRS eleminates it?

    > I don't mind STRING-DOWNCASE, of course, which should have a locale
    > argument and be specified to permit the Correct Unicode Thing.

Ok -- I think we can agree on some things.   You're roughly right, I
think.

We should also point readers in general to:

  http://www.unicode.org/reports/tr15/#Programming_Language_Identifiers

which is Annex 7 ("Programming Language Identifiers") of Unicode
Technical Report 15 ("Unicode Normalization Forms").

Enclosed is a more fleshed-out and improved description of the
approach you're advocating, plus its reconciliation with my
suggestions for R6RS (which, frankly, don't need to change very much
-- mostly this just involves adding new material).

For SRFI-50 list relevence: let me point out that this doesn't change
the proposed char/string FFI at all.   On the other hand, the fact the
recommendations for R6RS continue to work out nicely is confirmation
that the analysis that leads to those FFI recommendations is sound.
So far we've more or less made peace with R5RS, my recommendations for
R6RS, Thomas Bushnell's thoughts on supporting linguistically sane
Scheme identifiers, Shiro's concerns about implementations using
character sets other than Unicode and its subsets/extensions, Bear's
work on infinite character sets, and the emerging design of Pika.

I think what Thomas B. is suggesting is better provided by this:


* (identifier? s) => <bool>

    Return #f unless `s' is a legal identifier name.

    It is required that:

	(identifier? (symbol->string s)) => #t 

    for all symbols s.


* (fold-identifier name) => folded

     Where NAME is a string containing an identifier
     name and FOLDED is a string containing an equivalent
     identifier name.

     Two identifiers are equivalent if and only if:

	(string=? (fold-identifier a)
                  (fold-identifier b))

     FOLD-IDENTIFIER is required to be idempotent:

	(string=? (fold-identifier a)
                  (fold-identifier (fold-identifier a)))
        => #t   ; for all identifiers a

     and, of course, IDENTIFIER? is closed under FOLD-IDENTIFIER:

	(or (not (identifier? s))
            (identifier? (fold-identifier s)))
        => #t  ; for all strings s

     The definition of FOLD-IDENTIFIER must be consistent with the
     recommendations of Annex 7 ("Programming Language Identifiers" of
     Unicode Technical Report 15 for identifier names comprised
     entirely of Unicode characters.  For this purpose, the characters
     of the portable Scheme character set are considered to be Unicode
     characters.  (A short summary of the implications of this
     requirement for portable identifiers is that given a portable
     identifier, FOLD-IDENTIFIER must map #\A..#\Z to #\a..#\z.)

     (FOLD-IDENTIFIER is preferable to STRING-ID=? because it 
     produces a canonical form of each identifier explicitly 
     rather than implicitly.   The canonical form is useful because
     it can be hashed, stored in a trie, etc.   It would be
     impractical to implement, for example, a symbol table in a
     compiler given only STRING-ID=?.)




* (concatenate-identifiers s0 s1 ...) => id

     Return a string ID, containing an identifier name which
     is the concatenation of the arguments which must themselves
     be identifier names.

     If all of the arguments are portable Scheme identifiers, then
     this function must behave identically to STRING-APPEND

     (As nearly as I can tell, CONCATENATE-IDENTIFIERS is needed
     because IDENTIFIER? won't be closed under STRING-APPEND -- but
     I could be mistake about that.  More research is needed.)



Now, what becomes of the character class procedures such as
CHAR-NUMERIC?  I think that these should be retained and corrected so
that one can write a portable Scheme lexical analyzer which can accept
as input programs using the character set extensions of its host
implementation.  From what I can tell, that would require the new
procedures:


* (char-id-start? c) => <bool>

  Return #t if C is a valid first character in an identifier.


* (char-id-extend? c) => <bool>

  Return #t if C is a valid non-first character in an identifier.


* (canonicalize-identifier s) => ID | #f

  Given a string S comprised of at least one CHAR-ID-START? character
  followed by any number of CHAR-ID-EXTEND? characters, return a
  valid identifier name (in the sense of IDENTIFIER?) corresponding 
  to S or #f if no such identifier name can be constructed.

  If S consists only of portable Scheme characters, the result must
  be STRING=? to S and not EQ? to S.


* (string->parsed-symbol s)

  S must be an IDENTIFIER? string.  Return the symbol denoted by that
  identifier if it were used in a quoted context in a Scheme expression.
  (Note how this differs from STRING->SYMBOL.)


* (string->parsed-character s) => <char> | #f

  Given a string S whose contents are syntactically a character
  constant, return the character that constant denotes or #f if
  there is no such character.

If we want to permit extended string syntaxes, at least this is
needed:

* (string->parsed-string s) => <string> | #f

  S must be a string whose contents are syntactically a string
  constant, return a string that constant denotes or #f if there
  is no such string.

Perhaps we'd also want similar procedures for other areas of syntactic
extensibility.

Now, what about the character ordering procedures (e.g. CHAR<?,
STRING<? etc.)?   I think these should remain unchanged -- they should
relate to the integer mappings of characters.  (Implementations or
future standards are free to add locale parameters or introduce
alternative procedures which are linguistically sensative.)

What about case independent character ordering (e.g., CHAR-CI<? and
STRING-CI<?)?  I see no compelling reason to eliminate them at this
stage -- they're still useful.  I think they should be specified to be
consistent with the single-character default case foldings of Unicode,
where the portable character set is considered to consist of Unicode
characters.  This will allow portable Scheme programs to use these
procedures to write programs which accurately manipulate Scheme
programs that use nothing but the portable character set.  It would,
for example, allow a portable-character-set implementation of
FOLD-IDENTIFIER.  It also reifies into Scheme a sanctioned (even if
non-preferred) sense of Unicode character case -- while Scheme should
_also_ evolve facilities for linguistically preferrable case handling,
these facilities will be useful for Scheme programs communicating with
other systems that use only the single-character case mappings.
(Again, implementations and future standards are not precluded from
adding additional parameters or new procedures for default or
locale-specific case handling).

What about case mappings (CHAR-UPCASE and CHAR-DOWNCASE).  Again:
retain them;  specify them as using the Unicode single character
mappings; permit implementations to add parameters are new procedures
-- the result allows portable Scheme programs to handle portable
Scheme program texts and captures a useful Unicode text process.

In terms of my "strings draft" -- there is one R6RS recommendation
that should change more substantially than the tweaks suggested above.

I wanted to modify 6.3.4 to say:

     These procedures [the character classes] return #t if their
     arguments are alphabetic, numeric, whitespace, upper case, or
     lower case characters, respectively, otherwise they return #f.
     These procedures _must_ be consistent with the procedure READ
     provided by the implementation.  For example, if a character is
     CHAR-ALPHABETIC?, then it must also be suitable for use as the
     first character of an identifier.

     `a..z' and `A..Z' _must_ be alphabetic and _must_ be respectively
     lower and upper case.  

     #\space, #\tab, and #\formfeed _must_ be CHAR-WHITESPACE?.

     `0..9' _must_ be CHAR-NUMERIC?.

     No character may cause more than one the procedures
     CHAR-ALPHABETIC?, CHAR-NUMERIC? and CHAR-WHITESPACE? to return
     #t.

     No character may cause more than one of the procedures
     CHAR-UPPER-CASE? and CHAR-LOWER-CASE? to return #t.

     Programmer's are advised that these procedures are unlikely to be
     suitable for linguistic programming in portable code while
     implementors are strongly encouraged to define them in ways that
     make them a reasonable approximation of their linguistic
     counterparts.  


It should say:

     These procedures [the character classes] return #t if their
     arguments are valid identifier start characters, valid identifier
     extension characters, alphabetic, numeric, whitespace, upper
     case, or lower case characters, respectively, otherwise they
     return #f.  These procedures _must_ be consistent with the
     procedure READ provided by the implementation.  For example, if a
     character is CHAR-ID-START?, then it must also be suitable for
     use as the first character of an identifier.

     `a..z' and `A..Z' _must_ be id-start and id-extend characters and
     _must_ be respectively lower and upper case.

     `a..z' and `A..Z' _must_ be alphabetic.  If the argument to 
     CHAR-ALPHABETIC? is a Unicode character, the it must return #t
     if and only-if the character is in one of the Unicode general
     categories

	Lu Ll Lt Lm Lo Nl


     #\space, #\tab, and #\formfeed _must_ be CHAR-WHITESPACE?.

     `0..9' _must_ be CHAR-NUMERIC?.

     No character may cause more than one the procedures
     CHAR-ID-START?, CHAR-NUMERIC? and CHAR-WHITESPACE? to return
     #t.

     No character may cause more than one of the procedures
     CHAR-UPPER-CASE? and CHAR-LOWER-CASE? to return #t.

     Programmer's are advised that these procedures are unlikely to be
     suitable for linguistic programming in portable code while
     implementors are strongly encouraged to define them in ways that
     make them a reasonable approximation of their linguistic
     counterparts.  


A final note: the desirability of the -CI, -UPCASE, and -DOWNCASE
procedures hinges on the assumption that the portable Scheme character
set is a proper subset of Unicode.   One can imagine a Scheme standard
that insisted on Unicode, and that requires a much larger set of valid
identifier characters.    Though abstractly attractive, such
requirements would preclude tiny implementations of Scheme.   Having a
small and simply structured portable character set, and then adding on
to that a level of _optional_ conformance for all of Unicode, is a far
more practical idea.

-t