Theodore S. Norvell (C) 2016
An earlier article by me explored a number of approaches to recursive descent parsing of expressions, including a particularly simple, efficient, and flexible algorithm invented (as nearly as I can tell) by Martin Richards [7], which I called "precedence climbing".
Recently it was pointed out to me by Andy Chu that precedence climbing is really a special case of Pratt parsing, an algorithm (or approach) invented by Vaughn Pratt [6] in the early 70s. Andy wrote a blog post about it [1]. He goes as far as to say they are the same algorithm; I'd prefer to say that Pratt parsing is a generalization of precedence climbing.
The purpose of this post is to explore this connection by starting with precedence climbing and refactoring it to use the command pattern until we arrive at a Pratt parser. We can then expand the language that can be parsed; this expansion is extremely modular in the sense that we can add new syntactic constructs just by adding a few entries to tables.
For background, read my earlier article on expression parsing [5], especially the part on precedence climbing.
If your purpose is learn about Pratt parsing, this post might or might not be for you.
This section is a quick review. For a full explanation see my earlier article. We'll start with an ambiguous left-recursive grammar.
S --> E0 end E0 --> E10 E10 --> E20 | E20 "=" E20 E20 --> E30 | E20 "+" E30 | E20 "-" E30 E30 --> E40 | E30 "*" E40 | E30 "/" E40 E40 --> E50 | E40 "!" E50 --> E60 | E60 "^" E50 E60 --> P P --> "-" E30 | "(" E0 ")" | v
We have (in order of increasing precedence)
The set of tokens includes "=", "+", "-", "*", "/", "!", "^", "(", ")", and a set of variables represented by the symbol v in the grammar. A special token "end" marks the end of the input.
To see that the grammar is ambiguous, notice that the string -a*b can be derived in two ways. The ambiguity is resolved by saying that -, as a prefix operator, should have higher precedence than the binary operators *, /, !, and ^. For example -a*b should mean the same -(a*b), not (-a)*b. Unlike the precedence between binary operators, this precedence is not fully captured by the grammar. The reason is that I don't know a nice way to do that.
Note that the postfix operator ! has lower precedence than the binary operator ^, so that "a^b!" means the same as "(a^b)!" and "a! ^ b" is not an expression. We'll look at how to add expressions like "a! ^b" to the language later.
The interface to the lexical analyser is via two procedures: next returns the next token as a string and consume advances the lexical analyser so that next returns the subsequent token. When there are no more tokens, next will return special token value end; further calls to consume have no effect on next.
After a few grammar transformations, converting the grammar to code, some code transformations, and adding some tree building code, we get the following algorithm, which I call preceedence climbing. The "climbing is because if you build a parse tree above the sequence of input tokens, the loop in procedure E climbs up (toward the root) the tree along the left edge of the tree and each recursive call to E climbs up the mini-tree that forms the right operand of a binary or prefix operand.
S is const t := E(0) ; expect( end ) ; output(t) E(p) is precondition 0 ≤ p var t := P var r := +inf loop const l := next exit unless p≤prec(next)≤r consume if isBinary(l) then const y := E( rightPrec(l) ) t := mknode(binary(l), t, y) else t := mknode(postfix(l), t) r := nextPrec(l) return t P is const n := next consume if n="-" then const t:= E(30) return mknode(prefix('-', t)) else if n = "(" then const t := E(0) expect(")") return else if n is a variable then const t := mkleaf(n) consume return t else error( "Unexpected token '" +n+ "'" ) expect( tok ) is if next = tok consume else error( "Unexpected token '" +next+ "'; a '" +tok+ "' was expected." )
and four tables
b | = | + | - | * | / | ! | ^ |
---|---|---|---|---|---|---|---|
prec(b) | 10 | 20 | 20 | 30 | 30 | 40 | 50 |
rightPrec(b) | 20 | 21 | 21 | 31 | 31 | NA | 50 |
nextPrec(b) | 9 | 20 | 20 | 30 | 30 | 40 | 49 |
isBinary(b) | true | true | true | true | true | false | true |
Additionally, every other token should have a prec of -1. Since p is nonnegative, this will force an exit from the loop in E when a token is encountered that can't be a binary or postfix operator.
The algorithm handles prefix, postfix, and infix operators in a compact way. We can modify the algorithm to handle other operators such as, say, a ternary
y if x else z
operator, or an array indexing operation like
x [ y ]
I have done this, for example for my C++ expression parser. However, these extensions are ad hoc and complicate the E and P procedures. To accommodate operators like these and additional prefix operators in a modular way, we can use Pratt parsing.
Before going on, let's look at a way of classifying occurrences of tokens into three categories
Let's look at an example
a + - b * ( c - b ! ) end N L N N L N N L N L O O
In the precedence climbing algorithm: N occurrences are those that are used to make the initial choice in the P procedure and that are then consumed as opposed to leading to an error being reported; and L occurrences are those directly consumed by the E procedure. In the example, the first "-" is consumed by P, so it is an N, while the second "-"is consumed by E, so it is an L.
O token occurences are usually associated with some earlier L or N token occurrence. E.g., each ")" is associated with an earlier occurance"(" which is an N occurrance. When we look at expressions likey if x else z , we will see that the "if" is an L occurrence and the "else" is an O occurence. The "end" token is a bit of an exception in that it's not connected to anything; but if we had a special "begin" token to mark the start of the token stream, then the "begin" would be an N occurrence and the "end" would be connected to the "begin".
So that's a three way classification of token occurrences We can also classify tokens
So a token can be both N and L token (for example "-"), but O tokens can't be anything else and every token will be in (at least) one of these these three categories.
First we'll refactor the treatment of binary and postfix operations. The idea is to make procedure E more data-driven by treating binary and postfix operators uniformly. We'll map each token to a command object via a table, which I'll call leftComm.
Each command object has three methods
abstract class LeftCommand abstract LeD( x, op ) returns Tree abstract LBP returns int NBP returns int is return LBP
Let's look at each method
We'll rewrite the E procedure to use commands. The old E is
E(p) is precondition 0 ≤ p var t := P var r := +inf loop const l := next exit unless p≤prec(next)≤r consume if isBinary(l) then const y := E( rightPrec(l) ) t := mknode(binary(l), t, y) else t := mknode(postfix(l), t) r := nextPrec(l) return t
The new E is
E(p) is precondition 0 ≤ p var t := P var r := +inf loop const l := next const c := leftComm[l] exit unless p ≤ c.LBP ≤ r consume t := c.LeD( t, l ) r := c.NBP return t
As happens often in OO programming, we've replaced a choice made by an if command with dynamic dispatch. Here we've replaced the lines
if isBinary(l) then const y := E( rightPrec(l) ) t := mknode(binary(l), t, y) else t := mknode(postfix(l), t)
With the line t := c.LeD( t, l ).
So the responsibility of the LeD method is to parse any operands that come after the operator and to build a representation (denotation) of the operation.
For every token that isn't an L token, the leftComm table uses a direct instance of DefaultLeftCommand, (below) ensuring that its LBP is -1. This will force an exit from the loop in E when a non-L token is encountered, as the p parameter is never less than 0.
class DefaultLeftCommand extends LeftCommand override LBP is return -1 override LeD( x, op ) is // Will not be called! assert false
leftComm[ "(" ] := new DefaultLeftCommand leftComm[ ")" ] := new DefaultLeftCommand leftComm[ "end"] := new DefaultLeftCommand leftComm[ v ] := new DefaultLeftCommand
for all variables v.
Binary and postfix operators have an LBP that is nonnegative
abstract class BinaryOrPost( lbp ) extends LeftCommand precondition 0 ≤ lbp override LBP is return lbp
I should explain that the lbp variable used in the precondition and the implementation of LBP refers to the value of the constructor parameter.
Binary operators have an LeD method that parses the right operand. There are 3 variations. The RBP method indicates the "right binding power" and determines the left binding power of the lowest precedence operator that can be in a right operand.
abstract class BinaryOperator( lbp ) extends BinaryOrPost( lpb ) protected abstract RBP override LeD( x, op ) is const y := E( this.RBP ) return mknode( binary(op), x, y ) class BopLeftAssoc( lbp ) extends BinaryOrPost( lpb ) override RBP is return 1 + this.LBP
class BopRightAssoc( lbp ) extends BinaryOrPost( lpb ) override RBP is return this.LBP class BopNonAssoc( lbp ) extends BinaryOrPost( lpb override RBP is return 1 + this.LBP override NBP is return this.LBP - 1
We can declare some binary operators
leftComm[ "=" ] := new BopNonAssoc( 10 ) leftComm[ "+" ] := new BopLeftAssoc( 20 ) leftComm[ "-" ] := new BopLeftAssoc( 20 ) leftComm[ "*" ] := new BopLeftAssoc( 30 ) leftComm[ "/" ] := new BopLeftAssoc( 30 ) leftComm[ "^" ] := new BopRightAssoc( 50 )
Let's try some examples.
Postfix operators don't have a right operand and so they don't parse one in their LeD method. Otherwise, they are similar to binary operators.
class Postfix( lbp ) extends BinaryOrPost( lpb ) override LeD( x, OK ) is return mknode( postfix(op), x )
We'll add our postfix operator to the table
leftComm[ "!" ] := new Postfix( 40 )
Note that "a! ^ b" is not allowed, just as it is not allowed in the original grammar. Recall that the NBP value for a binary or postfix operator gives the highest precedence of an operator that this operator can be a left operand of. Since the NBP of ! is 40 --the default is that it is the same as the LBP -- the "a!" can not be the left operand of a ^. If we want to allow expressions like "a! ^ b", we just need to arrange that leftComm[ "!" ].NBP = +inf. That allows expressions that end with a bang to be left operands of any binary operator, which seems reasonable.
class Postfix( lbp) extends BinaryOrPost( lpb ) override LeD( x, OK ) is return mknode( postfix(op), x ) override NBP is return +inf
Now "a! ^ b" will be parsed. So will "a ^ b ! ^ c", which will have the same meaning as "(a ^ b)! ^ c" since ! has lower precedence.
At this point, we can see how to implement the indexing (x[y]) operation and the conditional (y if x else z) operation. I'll give these precedences of 60 and 5 respectively.
leftComm[ "[" ] := new IndexOp( 60 ) leftComm[ "if"] := new ConditionalOp( 5 )
The other new tokens are O tokens and so have the default left command.
leftComm[ "]" ] := new DefaultLeftCommand leftComm[ "then"] := new DefaultLeftCommand leftComm[ "else"] := new DefaultLeftCommand class IndexOp( lbp ) extends BinaryOrPost( lbp ) override LeD( x, op ) is const y := E( 0 ) expect( "]" ) return mknode( binary( "index" ), x, y ) class ConditionalOp( lbp ) extends BinaryOrPost( lpb ) override LeD( y, op ) is const x := E( 0 ) expect( "else" ) const z := E( RPB ) return mknode( ternary( "if" ), x, y, z ) override NBP is return this.LBP - 1 protected RBP is return this.LBP + 1
I've made the conditional nonassociative to rule out "confusing" expressions like x if a else y if b else z. (OK, so you don't find that confusing? Then make it right associative by making NBP=LBP=RBP or left associative by making NBP=LBP=RBP-1.)
Let's treat = as a "chaining operator" rather than nonassociative. That is x=y=z should mean (x=y) and (y=z) (supposing we had an "and" operator).
class Chaining( lbp ) extends BinaryOrPost( lbp ) override LeD( x, op ) is const y := E( 1 + this.LBP ) const t := mknode( binary( op ), x, y ) if leftComm[ next ].LPB = this.LBP const nextOp := next consume const t1 := this.LeD( y, nextOp ) return mkNode( binary( "and", t, t1 ) else return t override NBP is return this.LBP - 1
and we change the entry for = in the table to
leftComm[ "=" ] := new Chaining( 10 )
Now let's give "=" some friends
leftComm[ "<" ] := new Chaining( 10 ) leftComm[ "≤" ] := new Chaining( 10 ) leftComm[ ">" ] := new Chaining( 10 ) leftComm[ "≥" ] := new Chaining( 10 )
Now "a ≤ b = c < d" means just what it should.
From a software engineering point of view, the important point is how modular these additions and alterations are. The basic parsing procedure E did not need to change; we just added some new classes and altered the table.
Can we make the treatment of prefix operators and other N tokens equally modular? Well, I wouldn't ask if we couldn't. Remember the N tokens are "-", "(", and the variables v.
Here is P again
P is const n := next consume if n="-" then const t:= E(30) return mknode(prefix('-', t)) else if n = "(" then const t := E(0) expect(")") return else if n is a variable then const t := mkleaf(n) consume return t else error( "Unexpected token '" +n+ "'" )
Our next change is to make the P procedure data driven; i.e. to use dynamic dispatch to choose the course of action.
P is const n := next consume return nullComm[n].NuD( n )
nullComm is a table analogous to leftComm, but used for N tokens.
nullComm[ "-" ] := new UnaryOp( 30 ) nullComm[ "(" ] := new Parens nullComm[ v ] := new Leaf
The last line needs to be repeated for each possible variable -- although the same command object can be shared for all entries. All other tokens (all non-N tokens) should map to an instance of ErrorNullCommand.
nullComm[ "end" ] := new ErrorNullCommand nullComm[ ")" ] := new ErrorNullCommand nullComm[ "]" ] := new ErrorNullCommand nullComm[ "if" ] := new ErrorNullCommand nullComm[ "then" ] := new ErrorNullCommand nullComm[ "else" ] := new ErrorNullCommand nullComm[ "=" ] := new ErrorNullCommand ... nullComm[ "^" ] := new ErrorNullCommand
Now we need some classes for commands
abstract class NullCommand abstract NuD( op ) returns Tree class ErrorNullCommand extends NullCommand override NuD( op ) is error( "Unexpected token '" +op+ "'" ) class UnaryOp( rbp ) extends NullCommand protected const rbp := rbp override NuD( op ) is const y := Exp( this.rbp ) return mkNode( prefix( op ), y ) class Parens extends NullCommand override NuD( op ) is const t := E( 0 ) expect( ")" ) return t class Leaf extends NullCommand override NuD( op ) is return mkLeaf( op )
Now it's easy to add new kinds of prefix operators and such. For example, we can add expressions like if x then y else z to the language like this.
nullComm[ "if" ] := new IfThenElse( ) class IfThenElse extends NullCommand override NuD( op ) is const x := Exp( 0 ) expect( "then" ) const y := Exp( 0 ) expect( "else" ) const z := Exp( 0 ) return mknode( ternary( "if" ), x, y, z )
It might not be the best language design choice, but it is interesting that we can use both syntaxes for conditional expressions in the same language; there is no difficulty for the parser to handle, for example
if a if b else c then d if e else f else if g then h else i if j else k
This works because the parser can classify an occurrence as N or L based only on what the token is and what came before it. This works for "if" for the same reason that it works for "-". Similarly we can add a prefix ! operator. There is no problem understanding "!a!!!" as long as you know the relative precedence of prefix and postfix !.
So far "-" and possibly "if" are can be either L and N tokens depending on context. Identifiers are N tokens only.
The The Scala language allows identifiers as binary operators. For example you can write "s contains b" instead of "s.contains(b)". Since an N occurrence can't follow an identifier, the parser can classify "contains" as an L occurrence. (I don't know if the Scala compiler uses Pratt parsing and I'm not advocating this as a language design choice.) Let's for fun allow identifiers to be binary operators with the highest precedence. Add to the table.
leftComm[ v ] := new BopLeftAssoc( 70 )
for all variables v.
Alternatively, Haskell takes a different approach from Scala. In Haskell an expression followed by another expression represents application, so, for example "f (g x) y" should parse to a tree apply[apply[f,apply[g, x]], y]. Again we need to treat variables as L tokens as well as N tokens; and this time also "(", and in fact any N token that isn't already also an L token. The problem is that the occurrences of "(", "x", and "y" in the example are both L occurrences and N occurrences They are L occurrences because they have a complete subexpression to the left and N occurrences because they start a complete subexpression. We can mange this in two ways.
The first is, provided we can put a token back into the token stream, we can treat it as an L token first, then put it back, and then treat it as an N token. Let's suppose the procedure "unconsume" puts a token back into the stream of incoming tokens. We can design a command
class Application( lbp ) extends BinaryOperator( lpb ) override RBP is return 1 + this.LBP override LeD( x, op ) is unconsume( op ) const y := E( this.RBP ) return mknode( binary("apply"), x, y )
We add to the table
leftComm[ "(" ] := new Application( 70 ) leftComm[ v ] := new Application( 70 )
for every v and so on for every N token that isn't already also an L token. In particular we leave "-" alone, so that "x - y" results in -[x, y] rather than apply[x, -[y]].
A second way to achieve same end is to remove the call to consume from the loop of the E procedure. Then we don't need the "unconsume" call in the LeD for application, but we need to add a consume to every other implementation of LeD.
These examples, I think, illustrate how flexible the parsing technique it.
What about the "end" token. Did I forget about it? Yes I did. Let's address that now. "end" is always an O token. Any token that can't be an L token or an N token is an O token and has the default LeftCommand and the default NullCommand. Any O token has the default LeftCommand and so it has a LBP of -1, meaning that if it is encountered where an L token is expected it ends any expression. It has the default NullCommand, meaning that if it's encountered where an N token is expected it is an error. So the 'end' token is just like any other O token. The only thing that makes "end" special is that our main routine uses it as follows and it is not used anywhere else.
S is var t := E(0) expect("end") return t
Here is the final version of the algorithm after inlining P.
S is var t := E(0) expect("end") return t E(p) is precondition 0 ≤ p const n := next consume var t := nullComm[n].NuD( n ) var r := +inf loop const l := next const c := leftComm[l] exit unless p ≤ c.LBP ≤ r consume t := c.LeD( t, l ) r := c.NBP return t
Pratt parsing works great when there is essentially only one kind of thing that you are parsing. Consider a grammar for a simple imperative programming language
...TODO...
Pratt's paper was not written in an object-oriented style. Instead he puts procedures directly in tables. For this reason, he has 3 tables (NuD, LeD, LBP) rather than 2. (The addition of NBP is my own; if Pratt had had it he'd have had 4 tables.) Using classes and objects makes it easy to share code, for example sharing the LeD method between the 3 subclasses of binary operators.
Pratt assumes dynamic scoping of variables, which would have been the norm in LISP implementations at the time. He also describes the main algorithm with a diagram that seems to miss some of the details.
I took the names LeD, NuD, and LBP from Pratt. I would have used other names myself, but I think it's worth the price of poor naming to be consistent with Pratt's and other presentations.
One thing that might bother you is the need to have tables that map a large (perhaps infinite) set of tokens. For most applications, there is no need to change the tables at run time, so the tables can be implemented as a pair of procedures. (In OO terms we need a factory that has methods leftComm and nullComm.) And in most cases all we need to know is the category of token, not the actual token string; so we can use an array indexed by the token categories.
In some cases we do want to be able to change the tables during run time. For example, Haskell lets the programmer declare the precedence of binary operators; and most Prolog dialects allow one to declare new operators along with associativity and precedence. If we do want to change the table on the fly, a hash map can be used.
Having reduced the number of tables from 3 in Pratt's paper to 2, the next step might be to have one table and only one kind of command. I.e. ensure that every concrete command has all 4 methods (LeD, LBP, NBP, and NuD). I didn't do this, because we might want to vary the tables independently. Consider adding function application syntax x(y) to the language. What does this have to do with parenthesization? Nothing really. Yet if I only had one command class for "(", it would have responsibilities for both parenthesization and function application. The treatment of a token when it's used as an N token and when it's used as an L token are two different responsibilities and shouldn't be in the same class.
Another approach is to do away with the tables and use objects as tokens that have the LeD, LBP, NBP, and NuD methods built into them. The comments of the previous paragraph apply.
As mentioned above, the NBP method is my own addition. It's used for making binary operators --and the like-- nonassociative. If you don't have need of it, don't implement it and change the E procedure to
E(p) is var token := next consume var t := nullComm[token].NuD( token ) loop token := next const c := leftComm[token] exit unless p ≤ c.LBP consume t := c.LeD( t, token ) return t
[0] Eli Bendersky, "Top-Down operator precedence parsing", January 2010, http://eli.thegreenplace.net/2010/01/02/top-down-operator-precedence-parsing
[1] Andy Chu, "Pratt Parsing and Precedence Climbing Are the Same Algorithm", Nov 2016. http://www.oilshell.org/blog/2016/11/01.html
[2] Andy Chu, "Review of Pratt/TDOP Parsing Tutorials", Nov 2016. http://www.oilshell.org/blog/2016/11/02.html
[3] Andy Chu, "Pratt Parsing Without Prototypal Inheritance, Global Variables, Virtual Dispatch, or Java". "http://www.oilshell.org/blog/2016/11/03.html"
[4] Keith Clarke, "The top-down parsing of expressions", Research Report 383, Dept of Computer Science, Queen Mary College. Archived at http://antlr.org/papers/Clarke-expr-parsing-1986.pdf.
[5] Theodore S. Norvell, "Parsing Expressions by Recursive Descent", 1999 (revised 2013), http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm
[6] Vaughn R. Pratt, "Top down operator precedence", Proceedings of the 1st symposium on principles of programming languages (POPL), 1973, http://dl.acm.org/citation.cfm?id=512931
[7] Martin Richards and Collin Whitby-Stevens, BCPL -- the language and its compiler, Cambridge University Press, 1979.
Thanks to Andy Chu for pointing out the connection between precedence climbing and Pratt parsing.