LOOKAHEAD Clarifications Ken Beesley Xerox Research Centre Europe Notational conventions that I've tried to follow: "LOOKAHEAD" refers to explicit, user-defined LOOKAHEAD() "lookahead" refers to explicit or implicit lookahead LOOKAHEAD can involve syntactic and/or semantic lookahead "syntactic lookahead" can be either explicit or implicit "semantic LOOKAHEAD" is always explicit, indicated in a LOOKAHEAD() specification 1. Embedded _syntactic_ LOOKAHEAD specifications are ignored. That is, if your parser is already at a choice point, and is in the process of performing syntactic lookahead (either explicit or implicit) to see which choice to follow, and that lookahead comes to a another choice point, itself marked with explicit syntactic LOOKAHEAD, that looks something like this, ... LOOKAHEAD ( ) Xpath () | LOOKAHEAD ( ) XPrimePath() ... then these "embedded" syntactic LOOKAHEADs will be ignored. 2. Syntactic lookahead can be initiated in two ways: a. If your grammar has an explicit syntactic LOOKAHEAD specification, e.g. LOOKAHEAD ( Xexpansion() ) Xexpansion() then any syntactic LOOKAHEADs inside Xexpansion() will be ignored during the lookahead process. b. Your grammar may also initiate implicit syntactic lookaheads of 1 token, in particular at choice points involving the (...)?, (...)* and (...)+ constructions, e.g. ( Yproduction() )? Zproduction() Here the parser is at a choice point: the choice is either to start parsing a Yproduction() or to skip it and proceed to parse the next token or production. When faced with such optional elements, JavaCC performs a default syntactic lookahead of 1 TOKEN to see if that next token is a possible beginning of a Yproduction(). If the next token could indeed be the start of a Yproduction(), then the parser enters the parentheses and commits itself to parsing a Yproduction(); this implicit syntactic lookahead looks only at one token--any embedded syntactic LOOKAHEADs inside Yproduction() are ignored. ********* ( Yproduction() )* Zproduction() This is like the case above, except that the parser will repeatedly do a 1-token syntactic lookahead before committing itself to parsing a(nother) Yproduction(). Any embedded syntactic LOOKAHEADs inside Yproduction() will be ignored during this lookahead. ********* ( Yproduction() )+ Zproduction() Like the case above, except that the parser will always commit itself to parsing a first Yproduction(), and then it repeatedly faces the usual choice: either parse another Yproduction() or move on. So ( Yproduction() )+ is equivalent to Yproduction() ( Yproduction() )* Again, whenever such implicit syntactic lookahead is performed, e.g. to see if the parser should parse another Yproduction(), it is a default syntactic lookahead of 1-token that is performed; and any embedded syntactic LOOKAHEADs inside Yproduction() are ignored. [ Of course, inside (...)?, (...)*, and (...)+, you can also specify explicit LOOKAHEAD of more than the default 1 token, e.g. (LOOKAHEAD(2) Yproduction())* This overrides the default 1-token lookahead depth, but any embedded syntactic LOOKAHEADs within Yproduction() are ignored as before. ] 3. _Semantic_ lookaheads are always performed, even if they are embedded. If, for some reason, you _really_ need to have lookahead that is always performed, even when embedded, then it must be expressed as pure semantic lookahead. Mixing syntactic lookahead and semantic lookahead is always a bit suspicious; so make sure that you really need it before trying the following trick. For example, suppose that the intent of the following LOOKAHEAD example is to succeed only if the next tokens are and and , AND if the token starts with the upper-case letter 'A'. // dangerous mix of syntactic and semantic lookahead; when // embedded, the syntactic lookahead will be ignored, but // the semantic lookahead will still be performed LOOKAHEAD ( , { getToken(3).image.charAt(0) == 'A' } ) Mproduction() [In my own project, I had a similar case that involved looking up the third token in a symbol table.] If this LOOKAHEAD were embedded, then the syntactic lookahead part would be ignored, but the semantic lookahead would still be performed, insisting on seeing the third token ahead, and testing the .image of that third token, whatever that token might be. Once the syntactic lookahead is ignored, there is nothing to constrain that this third token be a C, or that the previous tokens be A and B. But the intent here is to first make sure that the syntax (the indicated tokens) is there, and then perform an addition semantic check on that syntax. So if such lookahead is _really_ necessary, even when embedded, it should be recast as pure semantic lookahead like this: LOOKAHEAD ( { getToken(1).kind == A && getToken(2).kind == B && getToken(3).kind == C && getToken(2).image.charAt(0) == 'A' } ) Mproduction() This lookahead, even when embedded, will first make sure that the next three tokens are of type A, B, and C, in that order, and then it will perform the additional semantic check. The AND (&&) expression if performed from beginning to end and fails (shortstops) as soon as any one of the conjuncts fails. Again, mixing syntactic and semantic lookahead is always a bit suspicious. [I used it in a project that depends heavily on sophisticated lookahead to distinguish two completely different kinds of expressions, regular expressions and arithmetic expressions, in the same language. That's a highly unusual requirement.] 4. Syntactic lookahead, unlike the parsing itself, will backtrack as necessary, trying to succeed. Mind Tuning: a. The parser itself does not backtrack. Once the parser has committed itself to parsing something, it plows ahead and either succeeds or fails, with failure generating an error. b. The Whole Point of LOOKAHEAD (syntactic and/or semantic) is to tell the parser, at choice points, which path to commit to. c. While the parser itself does not backtrack, syntactic LOOKAHEAD itself _does_ backtrack if necessary, when looking ahead for justification to commit the parser to a particular choice. (Syntactic lookahead does backtrack, if necessary, so that the parser never has to backtrack.) So, for example, when the parser is at a top-level choice point such as ... LOOKAHEAD ( Foo() ) Foo() | Bar() ... And the Foo() expansion itself looks like this, with two possible paths (and so a choice) void Foo() : {} { LOOKAHEAD(3) // first choice | // second choice } Then the top-level syntactic LOOKAHEAD is effectively peeking ahead at the token stream to see if the parser should commit itself to parsing a Foo(). As always, the syntactic LOOKAHEAD in Foo(), here indicating 3 tokens, will be ignored when it is embedded inside another lookahead. If the next tokens are in fact , and , then the lookahead will effectively fail to match the first choice inside Foo() and will backtrack to succeed on the second choice. Then with the (backtracking) syntactic LOOKAHEAD satisfied, the non-backtracking parser will commit itself to parsing a Foo(). 5. Even though any syntactic LOOKAHEAD inside the Foo() expansion just above would be ignored when it is embedded inside any higher-level syntactic lookahead, the JavaCC parser recognizes that the Foo() expansion itself has a choice point, needing LOOKAHEAD when the actual parsing is done. If you don't include the LOOKAHEAD() inside Foo(), JavaCC will complain and suggest that you add it. [When lookahead succeeds, and tells the parser to commit to a certain path, the lookahead itself may have explored several ramifications of that path before succeeding. However, it has no way of telling the parser which particular ramifications to follow; the parser has to find its own way down the choice path.] 6. In some grammars, e.g. examples/JavaGrammars/Java1.0.2.jj, you find (nice) examples like the following void ClassBodyDeclaration() : {} { LOOKAHEAD(2) StaticInitializer() | LOOKAHEAD( [ "public" | "protected" | "private" ] Name() "(" ) ConstructorDeclaration() | LOOKAHEAD( MethodDeclarationLookahead() ) MethodDeclaration() | FieldDeclaration() } // This production is to determine lookahead only. void MethodDeclarationLookahead() : {} { ( "public" | "protected" | "private" | "static" | "abstract" | "final" | "native" | "synchronized" )* ResultType() "(" } where one production invokes a dedicated "lookahead production" LOOKAHEAD( MethodDeclarationLookahead() ) where MethodDeclarationLookahead() is designed and used _only_ for lookahead, never for actual parsing. This can be an effective way of making syntactic lookahead more efficient when dealing with complex syntactic structures. If you performed LOOKAHEAD( MethodDeclaration() ) (which is completely legal) then LOOKAHEAD would have to confirm that an entire MethodDeclaration() could be parsed before actually starting the parsing. That could involve looking ahead to examine an arbitarily large number of tokens, which could of course be very inefficient. In contrast, the dedicated lookahead production MethodDeclarationLookahead() can be written, as in this case, to succeed as soon as it sees some token(s) that indicate unambiguously that a whole MethodDeclaration() _should_ be parsable from the choice point. (For efficiency, lookahead should examine as few tokens as possible.) And once you know that a MethodDeclaration() _should_ follow, then you commit the parser. If they are used as intended, such dedicated lookahead productions will always be "embedded" in an upper-level LOOKAHEAD, and so any syntactic LOOKAHEADs inside the dedicated lookahead production will be ignored. However, JavaCC has no way of knowing that a dedicated lookahead production is to be used only for LOOKAHEAD; it is named ...Lookahead() only as a helpful mnemonic for the programmer. In cases like the following, from the same grammar void UnaryExpressionNotPlusMinus() : {} { ( "~" | "!" ) UnaryExpression() | LOOKAHEAD( CastLookahead() ) CastExpression() | PostfixExpression() } // This production is to determine lookahead only. The LOOKAHEAD specifications // below are not used, but they are there just to indicate that we know about // this. void CastLookahead() : {} { LOOKAHEAD(2) "(" PrimitiveType() | LOOKAHEAD("(" Name() "[") "(" Name() "[" "]" | "(" Name() ")" ( "~" | "!" | "(" | | "this" | "super" | "new" | Literal() ) } the production CastLookahead() is in fact used only for lookahead, and the syntactic LOOKAHEADs inside CastLookahead() will therefore always be ignored; syntactic lookahead will explore the three choices, backtracking as necessary, before succeeding or failing. But JavaCC will handle the CastLookahead() production like any other, assuming that it might be used for parsing. So if you leave out the syntactic LOOKAHEADs shown in the CastLookahead() production, JavaCC will complain that it can't resolve the _parsing_ choice on the basis of the default 1-token lookahead (all the choices start with "(" ), and it will suggest that you specify LOOKAHEAD of 2 or more tokens. The comment in this example reflects this behavior--specifying the LOOKAHEADs overtly "indicates [to JavaCC] that we know about" the choice, and specifying the LOOKAHEADs overtly therefore suppresses the JavaCC warning messages. So in cases like this, in dedicated lookahead productions, you may want to specify syntactic LOOKAHEADs even if you know perfectly well that they will always be ignored. *************** End of Documentation ***********************