LOOKAHEAD Clarifications
                     Ken Beesley
              Xerox Research Centre Europe


Notational conventions that I've tried to follow:

        "LOOKAHEAD" refers to explicit, user-defined LOOKAHEAD()

        "lookahead" refers to explicit or implicit lookahead

        LOOKAHEAD can involve syntactic and/or semantic lookahead

        "syntactic lookahead" can be either explicit or implicit

        "semantic LOOKAHEAD" is always explicit, indicated in
                a LOOKAHEAD() specification


1.  Embedded _syntactic_ LOOKAHEAD specifications are ignored.

That is, if your parser is already at a choice
point, and is in the process of performing syntactic
lookahead (either explicit or implicit) to see which choice to follow,
and that lookahead comes to a another choice point, itself marked
with explicit syntactic LOOKAHEAD, that looks something like this,

        ...
        LOOKAHEAD ( <X> <Y> <Z> )
        Xpath ()
      |
        LOOKAHEAD ( <X> <M> <Q> )
        XPrimePath()
        ...

then these "embedded" syntactic LOOKAHEADs will be ignored.

2.  Syntactic lookahead can be initiated in two ways:

    a.  If your grammar has an explicit syntactic
        LOOKAHEAD specification, e.g.

        LOOKAHEAD ( Xexpansion() )
        Xexpansion()

        then any syntactic LOOKAHEADs inside Xexpansion()
        will be ignored during the lookahead process.

    b.  Your grammar may also initiate implicit syntactic
        lookaheads of 1 token, in particular at choice points
        involving the (...)?, (...)* and (...)+  constructions, e.g.

        ( Yproduction() )?   Zproduction()

        Here the parser is at a choice point: the choice is
        either to start parsing a Yproduction() or to skip it
        and proceed to parse the next token or production.

        When faced with such optional elements, JavaCC performs a default
        syntactic lookahead of 1 TOKEN to see if that next token is
        a possible beginning of a Yproduction().
        If the next token could indeed be the start of a Yproduction(),
        then the parser enters the parentheses and commits
        itself to parsing a Yproduction(); this implicit syntactic lookahead
        looks only at one token--any embedded syntactic
        LOOKAHEADs inside Yproduction() are ignored.

        *********

        ( Yproduction() )*  Zproduction()

        This is like the case above, except that the parser will
        repeatedly do a 1-token syntactic lookahead before committing
        itself to parsing a(nother) Yproduction().  Any embedded syntactic
        LOOKAHEADs inside Yproduction() will be ignored during this
        lookahead.

        *********

        ( Yproduction() )+   Zproduction()

        Like the case above, except that the parser will always
        commit itself to parsing a first Yproduction(),
        and then it repeatedly faces the usual choice: either
        parse another Yproduction() or move on.  So  ( Yproduction() )+
        is equivalent to

        Yproduction()  ( Yproduction() )*

        Again, whenever such implicit syntactic lookahead is
        performed, e.g. to see if the parser should parse
        another Yproduction(), it is a default syntactic
        lookahead of 1-token that is performed; and any embedded
        syntactic LOOKAHEADs inside Yproduction() are ignored.


        [ Of course, inside (...)?, (...)*, and (...)+, you
        can also specify explicit LOOKAHEAD of more than the
        default 1 token, e.g.

                (LOOKAHEAD(2) Yproduction())*

        This overrides the default 1-token lookahead depth,
        but any embedded syntactic LOOKAHEADs within Yproduction()
        are ignored as before.
        ]


3.  _Semantic_ lookaheads are always performed, even if they are
embedded.

If, for some reason, you _really_ need to have
lookahead that is always performed, even when embedded, then it must
be expressed as pure semantic lookahead.  Mixing syntactic lookahead
and semantic lookahead is always a bit suspicious; so make sure that
you really need it before trying the following trick.

For example, suppose that the intent of the following LOOKAHEAD
example is to succeed only if the next tokens are <A> and <B> and <C>,
AND if the <C> token starts with the upper-case letter 'A'.

        // dangerous mix of syntactic and semantic lookahead; when
        // embedded, the syntactic lookahead will be ignored, but
        // the semantic lookahead will still be performed
        LOOKAHEAD ( <A> <B> <C>, { getToken(3).image.charAt(0) == 'A' } )
        Mproduction()

[In my own project, I had a similar case that involved looking
up the third token in a symbol table.]

If this LOOKAHEAD were embedded, then the syntactic lookahead part
would be ignored, but the semantic lookahead would still be performed,
insisting on seeing the third token ahead, and testing the .image of that
third token, whatever that token might be.  Once the syntactic lookahead
is ignored, there is nothing to constrain that this third token be a
C, or that the previous tokens be A and B.
But the intent here is to first make sure that the syntax (the indicated
tokens) is there,
and then perform an addition semantic check on that syntax.  So
if such lookahead is _really_ necessary, even when embedded, it
should be recast as pure semantic lookahead like this:

        LOOKAHEAD ( {  getToken(1).kind == A
                    && getToken(2).kind == B
                    && getToken(3).kind == C
                    && getToken(2).image.charAt(0) == 'A'
                    }
                  )
        Mproduction()

This lookahead, even when embedded, will first make sure that
the next three tokens are of type A, B, and C, in that order,
and then it will perform the additional semantic check.  The
AND (&&) expression if performed from beginning to end and
fails (shortstops) as soon as any one of the conjuncts
fails.

Again, mixing syntactic and semantic lookahead is always a bit
suspicious.  [I used it in a project that depends heavily on
sophisticated lookahead to distinguish two completely different
kinds of expressions, regular expressions and arithmetic
expressions, in the same language.  That's a highly unusual
requirement.]


4.  Syntactic lookahead, unlike the parsing itself, will backtrack
as necessary, trying to succeed.

Mind Tuning:

        a.  The parser itself does not backtrack.  Once the
                parser has committed itself to parsing
                something, it plows ahead and either succeeds
                or fails, with failure generating an error.

        b.  The Whole Point of LOOKAHEAD (syntactic and/or semantic)
                is to tell the parser, at choice points, which
                path to commit to.

        c.  While the parser itself does not backtrack, syntactic
                LOOKAHEAD itself _does_ backtrack if necessary,
                when looking ahead for justification to commit
                the parser to a particular choice.  (Syntactic
                lookahead does backtrack, if necessary, so that
                the parser never has to backtrack.)

                So, for example, when the parser is at a top-level
                choice point such as

                ...
                LOOKAHEAD ( Foo() )
                Foo()
              |
                Bar()
                ...

                And the Foo() expansion itself looks like this,
                with two possible paths (and so a choice)

                void Foo() :
                {}
                {
                   LOOKAHEAD(3)
                   <A> <B> <M>       // first choice
                |
                   <A> <B> <C>       // second choice
                }

                Then the top-level syntactic LOOKAHEAD is effectively
                peeking ahead at the token stream to see
                if the parser should commit itself to parsing a Foo().
                As always, the syntactic LOOKAHEAD in Foo(),
                here indicating 3 tokens, will be ignored when it
                is embedded inside another lookahead.
                If the next tokens are in fact <A>, <B> and <C>,
                then the lookahead will effectively fail
                to match the first choice inside Foo() and
                will backtrack to succeed on the second choice.
                Then with the (backtracking) syntactic LOOKAHEAD
                satisfied, the non-backtracking parser will
                commit itself to parsing a Foo().


5.  Even though any syntactic LOOKAHEAD inside the Foo()
expansion just above would be ignored when it is embedded
inside any higher-level syntactic lookahead, the
JavaCC parser recognizes that the Foo() expansion itself
has a choice point, needing LOOKAHEAD when the actual
parsing is done.  If you don't include the LOOKAHEAD() inside
Foo(), JavaCC will complain and suggest that you add it.

[When lookahead succeeds, and tells the parser to commit
to a certain path, the lookahead itself may have explored
several ramifications of that path before succeeding.
However, it has no way of telling the parser which particular
ramifications to follow; the parser has to find its own
way down the choice path.]

6.  In some grammars, e.g. examples/JavaGrammars/Java1.0.2.jj,
you find (nice) examples like the following

void ClassBodyDeclaration() :
{}
{
  LOOKAHEAD(2)
  StaticInitializer()
|
  LOOKAHEAD( [ "public" | "protected" | "private" ] Name() "(" )
  ConstructorDeclaration()
|
  LOOKAHEAD( MethodDeclarationLookahead() )
  MethodDeclaration()
|
  FieldDeclaration()
}

// This production is to determine lookahead only.
void MethodDeclarationLookahead() :
{}
{
  ( "public" | "protected" | "private" | "static" | "abstract" | "final" |
"native" | "synchronized" )*
  ResultType() <IDENTIFIER> "("
}

where one production invokes a dedicated "lookahead production"

        LOOKAHEAD( MethodDeclarationLookahead() )

where MethodDeclarationLookahead() is designed and
used _only_ for lookahead,
never for actual parsing.  This can be an effective way of
making syntactic lookahead more efficient when dealing with
complex syntactic structures.  If you performed

        LOOKAHEAD( MethodDeclaration() )

(which is completely legal)
then LOOKAHEAD would have to confirm that an entire MethodDeclaration()
could be parsed before actually starting the parsing.  That could
involve looking ahead to examine an arbitarily large number of tokens,
which could of course be very inefficient.  In contrast,
the dedicated lookahead production MethodDeclarationLookahead() can
be written, as in this case, to succeed as soon as it sees some
token(s) that indicate unambiguously that a whole MethodDeclaration()
_should_ be parsable from the choice point.  (For efficiency,
lookahead should examine as few tokens as possible.)  And once you
know that a MethodDeclaration() _should_ follow, then you commit
the parser.

If they are used as intended, such dedicated lookahead productions will
always be "embedded" in an upper-level LOOKAHEAD, and so any
syntactic LOOKAHEADs inside the dedicated lookahead production will
be ignored.

However, JavaCC has no way of knowing that a dedicated lookahead
production is to be used only for LOOKAHEAD; it is named ...Lookahead()
only as a helpful mnemonic for the programmer.  In cases like
the following, from the same grammar

void UnaryExpressionNotPlusMinus() :
{}
{
  ( "~" | "!" ) UnaryExpression()
|
  LOOKAHEAD( CastLookahead() )
  CastExpression()
|
  PostfixExpression()
}

// This production is to determine lookahead only.  The LOOKAHEAD
specifications
// below are not used, but they are there just to indicate that we know
about
// this.
void CastLookahead() :
{}
{
  LOOKAHEAD(2)
  "(" PrimitiveType()
|
  LOOKAHEAD("(" Name() "[")
  "(" Name() "[" "]"
|
  "(" Name() ")" ( "~" | "!" | "(" | <IDENTIFIER> | "this" | "super" | "new"
|
Literal() )
}

the production CastLookahead() is in fact used only for lookahead, and
the syntactic LOOKAHEADs inside CastLookahead() will therefore
always be ignored; syntactic lookahead will explore the three
choices, backtracking as necessary, before succeeding or
failing.
But JavaCC will handle the CastLookahead() production like any
other, assuming that it might be used for parsing.  So if you
leave out the syntactic LOOKAHEADs shown in the CastLookahead()
production, JavaCC will complain that it can't resolve the _parsing_
choice on the basis of the default 1-token lookahead (all the
choices start with "(" ), and it will suggest that you specify
LOOKAHEAD of 2 or more tokens.  The comment in this example
reflects this behavior--specifying the LOOKAHEADs overtly
"indicates [to JavaCC] that we know about" the choice, and
specifying the LOOKAHEADs overtly therefore suppresses the
JavaCC warning messages.

So in cases like this, in dedicated lookahead productions, you
may want to specify syntactic LOOKAHEADs even if you know
perfectly well that they will always be ignored.

*************** End of Documentation ***********************