[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: More comments, and the ANTLR code is too complex



Mark H Weaver:
> In the interest of encouraging implementors, I'd recommend making a
> serious effort to rewrite the grammar to be as conceptually simple and
> clear as possible.

Agree.  As I replied earlier, we definitely want it as conceptually simple
and clear as possible.

Is part of the problem just the really long multiple-branch
construct of it_expr (and maybe some similar cases)?
I wrote it that way because that makes it easier to
hand-translate to a recursive-descent implementation.
But if that's confusing, it'd be trivial to split those up into smaller rules
(where there are more BNF rules, but each one is simpler).

If splitting the rules into more-and-smaller rules would help, that's important to know.

Below are comments on the many specifics.  BTW, I *really* appreciate
the specifics, they make it much easier to *act* on!



> * APOSW, QUASIQUOTEW, UNQUOTEW, and UNQUOTE_SPLICEW are not defined.

They're defined in words above, e.g.:
"a traditional abbreviation followed by space, tab, or end-of-line is represented as
APOSW, QUASIQUOTEW, UNQUOTE_SPLICEW, and UNQUOTEW respectively."

We could define them more formally. The problem is an ANTLR limitation -
I don't think ANTLR supports look-ahead assertions once a rule has started.
If it doesn't, nothing prevents us from using the "ANTLR plus useful goodies" notation
in the SRFI if it makes things clearer.  Then we could write:

WWHITESPACE : SPACE | TAB | '\n' | '\r' ;

APOSW           : {indent_processing()}? => '\'' (?= WWHITESPACE ) ;
QUASIQUOTEW     : {indent_processing()}? => '\`'  (?= WWHITESPACE) ;
...


> * Inconsistent syntax is used within {} in the ANTLR.  In most places
>   standard Scheme syntax is used, but in 'collecting_tail', the syntax
>   is more like C.

As I mentioned in my previous post, that's my fault, sorry.
Basically, my translator... didn't.  We can fix that.  Let me know of any
other places where you noticed a screw-up, and we'll fix it.

BTW, I really appreciate you catching that.  That's what peer review is all about.

> * Why are the action rules in 'n_expr' simply expressions that refer to
>   values such as '$n1', but the action rules of 'collecting_tail' are
>   instead assignment statements that refer to values such as '$more.v'?

Same problem, a screwed-up automated translator.

> * Why is there special handling of (FF | VT)+ EOL ?

The FF|VT line is to allow lines with formfeeds or vertical tabs... but only if
they are the ONLY things on the line.  It'd be confusing if they were combined
with other things on a line.

The collecting_tail and it_expr rules, are separate because they are ended
differently (it_expr is ended by a blank line, while collecting_tail is ended with "*>").
Since FF|VT lines are handled the same way, the rule shows up both places.

> * What does 'isperiodp' do exactly?  What if the datum really is "." or
>   the symbol whose name is a single period? (written #{.}# in Guile).

Handling "." is always a challenge, sounds like we need to define
isperiodp more carefully.  "isperiodp" is only true when you have EXACTLY a
period character, not preceded by another characters and not followed by
additional symbol characters.  Thus, isperiodp would return FALSE for |.| or #{.}.

The goal is to ensure that:
aa1
! bb
! .
! cc
=> (aa1 bb . cc), an improper list.

While this is also true.
aa2
! bb
! |.|
! cc
=> (aa2 bb |.| cc), a proper list.


> * The non-terminals 'body' and 'it_expr' use the symbol 'same' even
>   though the text implies that no extra symbol is generated by the
>   preprocessing step in that case.  Where does 'same' come from?

The paragraph beginning "Indentation is not directly represented
in the following syntax definition.".  But I think you're right, it's not well-defined.

In the FULL ANTLR grammar it's defined as:
same  : ;  // Emphasizes where neither indent nor dedent has occurred

Transitions from one line to the next, with the same indentation,
don't actually generate a token.  I added the "same" non-token
as a comment to make it clear that's what is happening.
This also makes it easy to implement in a recursive descent parser.

> And here are some comments about the tutorial:
> 
> * "Schemeâs datum comments (#;datum) comment out the next neoteric
>   expression, not the next sweet expression (and please donât follow the
>   semicolon with whitespace)."
> 
>    I often put "#;" on the preceeding line, which you're now asking me
>    not to do.  What is the purpose of this request?  Also, "#;" becomes
>    much less useful if it cannot comment out an entire sweet expression.
>    Perhaps "#;" should have a similar rule as the traditional
>    abbreviations: if it is followed by whitespace, then the following
>    /sweet expression/ is ignored, otherwise the following /neoteric
>    expression/ is ignored.  What do you think?

The "please don't follow" text was because I wanted to *reserve*
"#; WHITESPACE" to comment out sweet-expressions, without
requiring their implementation.

Sounds like maybe we should just implement them and require them.
I didn't expect that to be particularly *useful*, but I don't know of a
reason we can't include them.


> * I'd like to see a few more examples for improper lists, such as:
> 
>      f
>        a .
>        b

Currently that's not an improper list.
Since the "." is followed by EOL, the presumption is that this
CAN'T be an improper list nothing follows!), so it's interpreted as
(in guile syntax):

(f (a #{.}#) b)

>      f
>        a b
>        . c

That would also not be an improper list, but instead:
(f (a b) c)


In neoteric-expressions, (. x) maps to "x", so that constructs like
"port(. options)" make sense.  We mapped "." at the beginning of a line,
but followed by something else on that line, to have the same semantics.

That doesn't mean they MUST have these semantics, discussion welcome,
but we were trying to be consistent...!

> * In the tutorial, I found the examples of $ (SUBLIST) a bit confusing:
> 
>     a b $ c d          ==>   (a b (c d))
> 
>     a b $ c d e f $ g  ==>   (a b (c d e f g))
>                              ; Not (a b (c d e f (g)))
> 
>    This leaves me uncertain of whether the second case is somehow
>    caused by two $'s on one line, or because there's only one item
>    after the $.  I'd like to see an example like "a b $ c" or
>    "a b $ c d e $ f g" to clarify.

Great point.

> * "A sweet-expression reader MUST support three modes: indentation
>   processing, enclosed, and initial indent."
>   [...]
>   "A marker MUST only have its special meaning when indentation
>   processing is enabled,"
> 
>    This sounds as if "*>" MUST not be recognized, because the reader
>    will be in "enclosed" mode at that point, no?

No, not when indentation processing is enabled.
"Enclosed" turns on only inside (...), [...], or {...}.  Once you
close all matching parens, you're normally back to indent processing.


> * "2. If top is the empty string and the indentation length is nonzero,
>    symbol INITIAL_INDENT is generated and the reader changes to initial
>    indent mode. When an end-of-line sequence is reached the mode changes
>    back to indentation processing."
> 
>    If the reader was in "enclosed" mode, then presumably the mode
>    should not change back to indentation processing, right?

Yes, correct.  All that indentation processing stuff is ignored
once you're in enclosed mode.

Hm, that clearly needs clarifying, thanks.


> * "1. If an end-of-line sequence immediately follows the indentation and
>       the indentation length is nonzero:
>        a. If the indentation contains â!â, it is ignored; an
>           implementation MUST consume the end-of-line sequence and start
>           applying these rules again, from the beginning, with the next
>           line.
>        b. If the indentation does not contain â!â, it is considered a
>           line with no characters (thus indentation has zero length) and
>           the rest of these rules are applied."
> 
>    I vaguely recall that the distinction here was going to be removed
>    as a simplification of the rules.  What that idea scrapped?

Yes.  Originally any all-indent-char lines were treated the same.

But recently there were several posted examples where it was pointed out that
a line containing only indent chars, but at least one "!", was CLEARLY
distinct from a blank line and it seemed odd to treat them the same way.
The goal of the new rule is to "do what I expect it to do" even if the
rule is more complicated.


> * "A marker MUST only have its special meaning when indentation
>   processing is enabled, it is preceded by indentation or hspace, it is
>   followed by an hspace or end-of-line, and when it starts with the
>   character shown (e.g., neither |$| nor '$ contains a marker)."
> 
>    The last clause here, "when it starts with the character shown", is
>    poorly worded IMO, and redundant with the requirement that "it is
>    preceded by indentation or hspace".

It's not redundant, but it may very well be poorly worded.
The point is that |$| is not a marker, nor is #{$}.  The "$" marker
MUST begin with the character "$".

Thanks for the critique!  It may be a few days before I can do very much
with this (real life and all that), but I'll definitely take this to heart.

--- David A. Wheeler