[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parsing Scheme [was Re: strings draft]

This page is part of the web mail archives of SRFI 50 from before July 7th, 2015. The new archives for SRFI 50 contain all messages, not just those from before July 7th, 2015.




    > From: tb@xxxxxxxxxx (Thomas Bushnell, BSG)

    > > * (identifier? s) => <bool>

    > This is fine.  An implementation should be allowed to always return #t
    > from this function, even though not every such string could be parsed
    > as an identifier by the reader.  (This for the sake of eval, at least.)

Hmm.... I don't think so.   It should deal with source texts --
eval'able forms being something else.

Which makes me realize, incidentally, that this requirement that I
stated:

    It is required that:

	(identifier? (symbol->string s)) => #t 

    for all symbols s.

is wrong (and should just be dropped).



    > >      The definition of FOLD-IDENTIFIER must be consistent with the
    > >      recommendations of Annex 7 ("Programming Language Identifiers" of
    > >      Unicode Technical Report 15 for identifier names comprised
    > >      entirely of Unicode characters.  

    > Again, I would suggest that we merely advocate this, but not require it.

Things like that can be split into the R6RS part and parts for SRFIs
or later standards.   The key thing is to make sure that nothing R6RS
requires is inconsistent with that report.   The secondary thing is to
guide implementors towards that report.

    > >      (FOLD-IDENTIFIER is preferable to STRING-ID=? because it 
    > >      produces a canonical form of each identifier explicitly 
    > >      rather than implicitly.   The canonical form is useful because
    > >      it can be hashed, stored in a trie, etc.   It would be
    > >      impractical to implement, for example, a symbol table in a
    > >      compiler given only STRING-ID=?.)

    > I think my worry is that it is not obvious that an implementation even
    > has an implicit folding available, at least, not cheaply.  There
    > should perhaps be a hash function to go with string-id=? to help.  

    > Many implementations will of course implement these things by
    > folding.  But if you think that really string-id=? should be allowed
    > to implement arbitrary equivalence classes (provided that the standard
    > character set works right), it isn't obvious to me that
    > fold-identifier can be cheap, and that it might well be more expensive
    > than whatever straightforward test is used.

I'm having trouble imagining an implementation that doesn't have or
couldn't trivially implement a FOLD-IDENTIFIER procedure.
Mathematically, such a procedure is always possible.  The combination
of those cause me to prefer the more general FOLD-IDENTIFIER.


    > > * (concatenate-identifiers s0 s1 ...) => id

    > >      Return a string ID, containing an identifier name which
    > >      is the concatenation of the arguments which must themselves
    > >      be identifier names.

    > >      (As nearly as I can tell, CONCATENATE-IDENTIFIERS is needed
    > >      because IDENTIFIER? won't be closed under STRING-APPEND -- but
    > >      I could be mistake about that.  More research is needed.)

    > In the cases where identifier? isn't closed under string-append,
    > concatenate-identifiers might need to do more work than just
    > concatenate.  

That's right.  That's the rationale for having it instead of relying
on STRING-APPEND.

    > (What does "the concatenation of the arguments" mean, if
    > not string-append?)

It means "do those extra things".  I specifically want to ensure a
mechanism for doing things like making structure access procedure
names derived from structure names.  Absent CONCATENATE-IDENTIFIERS,
this does not appear to be possible except over the portable character
set.


    > > * (char-id-start? c) => <bool>
    > >   Return #t if C is a valid first character in an identifier.

    > > * (char-id-extend? c) => <bool>
    > >   Return #t if C is a valid non-first character in an identifier.

    > These may be contextual.  A character may be allowed in the beginning
    > of an identifier but only if, something else is true later on.
    > (Consider the "if it's not a number, it's an identifier" rule of the
    > current standard.)

    > Perhaps a system might want to have functions like this, but I'd like
    > to see more experience before standardizing something.

Disagree.   These are consistent both with Unicode "best practice" and
Scheme syntax.    Recall that CANONICALIZE-IDENTIFIER is permitted to
return #f (analogously to STRING->NUMBER).

(It might be worth explicitly requiring that any numeric syntax
extensions made by an implementation are such that they are consistent
with these definitions.  It's not absolutely necessary but it would
simplify lexing.   In other words:

	(or (not (string->number s))
            (= 0 (length s))
            (not (char-id-start? (string-ref s 0)))
            (not (map-and char-id-extend? (string->list (substring s 1)))))
        => #t

for all strings s.)


    > > What about case independent character ordering (e.g., CHAR-CI<? and
    > > STRING-CI<?)?  I see no compelling reason to eliminate them at this
    > > stage -- they're still useful.  I think they should be specified to be
    > > consistent with the single-character default case foldings of Unicode,
    > > where the portable character set is considered to consist of Unicode
    > > characters.  This will allow portable Scheme programs to use these
    > > procedures to write programs which accurately manipulate Scheme
    > > programs that use nothing but the portable character set.  

    > string-ci<? is fine, but must have a locale argument.  If you want to
    > have a standardly specified "default case foldings of Unicode" locale,
    > that's fine with me.  Ditto for char-ci<?.

Unicode provides roughly three classes of case
conversion/folding/matching:

	~ default length preserving -- linguistically suboptimal but
	   have useful closure properties and compatability
	   properties

	~ default length varying -- locale independent, linguistically 
          very good.

	~ locale length varying -- locale dependent, linguistically 
          perfect wrt. a given locale.

(I suppose in theory there are also implied locale-specific,
single-character mappings -- these can be seen, for my purposes here,
as a special case of locale length varying.)

Scheme's STRING-CI<? should use the first (default length preserving)
because it is maximally upward compatible with R5RS, sufficient for
processing programs that use only the portable character set, is a
needed tool to put in the Unicode toolbox, and is the interpretation
that best preserves the simple quasi-algebraic properties relating
character and string orderings (such as one might want for
implementing a trie of identifiers).

Nothing about that requirement precludes adding additional parameters
or procedures to handle the other two (or three) kinds of case mapping.


    > > What about case mappings (CHAR-UPCASE and CHAR-DOWNCASE).  Again:
    > > retain them;  specify them as using the Unicode single character
    > > mappings; permit implementations to add parameters are new procedures
    > > -- the result allows portable Scheme programs to handle portable
    > > Scheme program texts and captures a useful Unicode text process.

    > No, no, no.  Don't make functions that are known to be wrong.  This is
    > a bad idea.  It's like requiring < to work for complex numbers, and
    > then comparing magnitude, and saying "well, that's close enough".
    > It's not.

It's not like complex numbers.  Characters are, at best,
quasi-algebraic.  Numbers are algebraic.  Comparing complex numbers
that way is usually nonsensical; comparing characters this way is a
standardized text process with many uses.

Character and string orderings over the portable character set relate
on the basis of a partial ordering of characters (defined in terms of
the case of the portable characters) serving as the basis of a lexical
ordering of strings.  Regardless of any linguistic interpretation,
these are handy things to keep around for processing portable Scheme
source texts.

The Unicode extension (via single-character default case mappings) of
the partial order that applies to the portable Scheme character set is
the one that is both maximally upward compatible and the most
carefully thought-about/negotiated for approximating text processing.

A "systems programming" Scheme with full Unicode support will _need_
the default length preserving case mappings --- to talk with other
systems, if nothing else.   Any Scheme with full Unicode support and
length-varying case mappings can provide the default length preserving
mappings nearly for free.

At _most_, while we _should_ presumably be in full agreement about
what functionality should be available (all three kinds of case
mapping), we're arguing over the ridiculous question of which of those
functionalities forms like:

	(string-ci<? a b)

refer to.   The choice I'm advocating is the most upward compatible
one, by far.


    > You can case map strings, and this should certainly be allowed.  It
    > should also have a locale argument.

That functionality should be present in a good Unicode Scheme, I
agree.  My R6RS recommendations are perfectly consistent with that.


    > You cannot sensible case-map characters except in the "unicode single
    > character mappings" locale; and why should we have special privileged
    > functions there?  It will only encourage people to *use* the
    > functions, and their code will then be non-portable precisely when it
    > matters.  

    > At the very least, make it allowed for char-upcase to simply fail to
    > give any answer, and provide a locale argument.  Or allow char-upcase
    > to return a string.

I haven't precluded char-upcase from being extended to except an
optional locale argument, or from returning strings when that argument
is provided.

Of the behaviors one might request with a locale argument, I've picked
precisely the one defined by the Unicode standards for situations
where casemapping a character must return a character.



    > > A final note: the desirability of the -CI, -UPCASE, and -DOWNCASE
    > > procedures hinges on the assumption that the portable Scheme character
    > > set is a proper subset of Unicode.   

    > I'm assuming that (or at least, I want to make it possible), but I do
    > *not* think that char-upcase and char-downcase are good ideas.

They are valuable because they provide a simple model for processing
texts written using the portable Scheme character set and because they
can be compatibly extended to implement a standard Unicode text
process.

    > string-upcase and string-downcase, by contrast, are unobjectionable,
    > provided they get a locale argument.

Linguistic text processing is a separate matter from character-based
text processing and from processing portable Scheme source texts.

Character-based text processing is computationally useful and makes
perfectly good sense wrt. Unicode.   By non-coincidence, it is a
superset of what's needed for processing portable Scheme source texts.

Meanwhile, extensions such as FOLD-IDENTIFIER provide sufficient
mechanism for implementations and future standards to extend their
lexical syntax in linguistically sensitive ways without, at the same
time, requiring linguistic text processing facilities in the core of
Scheme.

Meanwhile, linguistic text processing facilities can be added as
libraries and extensions to standard procedures.

-t