From Precedence Climbing to Pratt Parsing

Theodore S. Norvell (C) 2016

An earlier article by me explored a number of approaches to recursive descent parsing of expressions, including a particularly simple, effcient, and flexible algorithm invented (as nearly as I can tell) by Martin Richards [7], which I called "precedence climbing".

Recently it was pointed out to me by Andy Chu that precedence climbing is really a special case of Pratt parsing, an algorithm (or approach) invented by Vaughn Pratt [6] in the early 70s. Andy wrote a blog post about it [1]. He goes as far as to say they are the same algorithm; I'd prefer to say that one is a fairly straight-forward generalization of the other.

The purpose of this post is to explore this connection by starting with precedence climbing and refactoring it to use the command pattern until we arrive at a Pratt parser. We can then expand the language that can be parsed in a modular way.

For background read my earlier article on expression parsing [5], especially the part on precedence climbing.

If your purpose is learn about Pratt parsing, this post might or might not be for you. If you already know about precedence climbing or are willing to learn about it first, then this article provides a gentle introduction to Pratt parsing, since it shows how Pratt parsing is a refactoring and generalization of the precedence climbing algorithm you already (will) know.

But if you just want to learn Pratt parsing and don't already know about and don't want to learn about precedence climbing first, this article is probably not the best. There are many tutorials on the web; I haven't read many of them, since I had access to Pratt's original paper; a guide to some of these can be found at http://www.oilshell.org/blog/2016/11/02.html [2].

The precedence climbing algorithm

This section is a quick review. For a full explanation see my earlier article. We'll start with an ambiguous left-recursive grammar.

S --> E0 end
E0 --> E10
E10 --> E20 | E20 "=" E20
E20 --> E30 | E20 "+" E30 | E20 "-" E30
E30 --> E40 | E30 "*" E40 | E30 "/" E40
E40 --> E50 | E40 "!"
E50 --> E60 | E60 "^" E50
E50 --> P
P --> "-" E30 | "(" E0 ")" | v

We have (in order of increasing precedence)

The set of tokens includes "=", "+", "-", "*", "/", "!", "^", "(", ")", and a set of variables represented by the symbol v in the grammar. A special token "end" marks the end of the input.

The ambiguity is resolved by saying that, for example -a*b should parse as -(a*b). This precedence is not fully captured by the grammar, since I don't know a nice way to do that.

The interface to the lexical analyser is via two procedures: next returns the next token as a string and consume advances the lexical analyser so that next returns the subsequent token. When there are no more tokens, next will have special value "end"; further calls to consume have no effect on next.

After a few grammar transformations, converting the grammar to code, some code transformations, and adding some tree building code, we get the following algorithm:

S is const t := E(0) ; expect( end ) ; output(t)

E(p) is  
    precondition 0 ≤ p
    var t := P
    var r := +inf
    while p≤prec(next)≤r
        const b := next
        consume
        if isBinary(b) then 
            const y := E( rightPrec(b) )
            t := mknode(binary(b), t, y)
        else t := mknode(postfix(b), t) 
        r := nextPrec(b) 
    return t

P is 
    if next="-" then ( consume ; const t:= E(30) ; return mknode(prefix('-', t)) )
    else if next = "(" then ( consume ; const t := E(0) ; expect(")") ; return t )
    else if next is a v then ( const t :=  mkleaf(next) ; consume ;  return t )
    else error( "Unexpected token '" +next+ "'" ) 

expect( tok ) is
   if next = tok 
       consume
   else
       error( "Unexpected token '" +next+ "'; a '" +tok+ "' was expected." ) 

and four tables

b = + - * / ! ^
prec(b) 10 20 20 30 30 40 50
rightPrec(b) 20 21 21 31 31 NA 50
nextPrec(b) 9 20 20 30 30 40 49
isBinary(b) true true true true true false true

Additionally every other token should have a prec of -1. Since p is nonnegative, this will force an exit from the loop in E when a token is encounted that can't be a binary or postfix operator.

The algorithm handles prefix, postfix, and infix operators in a compact way. We can extend it to handle other operators such as say a ternary

y if x else z

operator, or an array indexing operation like

x [ y ]

I have done this, for example for my C++ expression parser. However, these extentions are ad hoc and complicate the E and P procedures. To accomodate operators like these and additional prefix operators in a modular way, we can use Pratt parsing.

Categories of tokens

Before going on, let's look at a way of classifying tokens into three categories. "L" tokens represent operators and such that take some operand on the left. So far this includes binary and postfix operators. Later it will include "[" and "if". "N" tokens represent tokens that have some meaning, but don't have a left operand. These include prefix operations and variables. I'll also include "(" as an "N" token, since it functions a bit like a prefix operator. "O" tokens are all others. So far only ")" and the end of input marker are "O" tokens; later we'll add "]" and "then". The "-" operator may be L or N depending on how it is used. Here is an example of classifying tokens

     a + - b * ( c - b ! ) end
     N L N N L N N L N L O O

In the precedence climbing algorithm: N tokens are those that are used to make the initial choice in the P procedure and that are then consumed as opposed to leading to an error being reported; and L token are those directly consumed by the E procedure. In the example, the first - is consumed by P, so it is an N, while the second is consumed by E, so it is an L.

Dealing with L tokens

First we'll refactor the treatment of binary and postfix operations. The idea is to make procedure E even more data-driven by treating binary and postfix operators uniformly. We'll map each token to a command object via a table, which I'll call leftComm.

Each command object has three methods

    abstract class LeftCommand

        abstract LeD( x, op ) returns Tree

        abstract LBP returns int

        NBP returns int is
            return LBP

Let's look at each method

We'll rewrite the E procedure to use commands.

E(p) is  
    precondition 0 ≤ p
    var t := P
    var r := +inf
    var b
    loop
        b := next
        var c := leftComm[b]
        exit unless p ≤ c.LBP ≤ r
        consume
        t := c.LeD( t, b )
        r := c.NBP
    return t

As happens often in OO programming, we've replaced a choice made by an if command with dynamic dispatch. Here we've replaced the lines

    if isBinary(b) then
        const y := E( rightPrec(b) )
        t := mknode(binary(b), t, y)
    else
        t := mknode(postfix(b), t) 

With the line t := c.LeD( t, b ).

So the responsibility of the LeD method is to parse any operands that come after the operator and to build a representation (denotation) of the operation.

Binary and postfix operators have an LBP that is nonnegative

abstract class BinaryOrPost( lbp ) extends LeftCommand
    
    protected const lbp := lbp

    override LBP is
        return lbp

What I mean by the second line is that the field named lbp is inialized to the value of the constructor parameter named lbp. In Java we'd implement this with a line in the constructor that says this.lbp = lbp.

Binary operators have a LeD method that parses the right operand. There are 3 variations

abstract class BinaryOperator( lbp ) extends BinaryOrPost( lpb )

    abstract RBP

    overide LeD( x, op ) is
        const y := E( this.RBP )
        return mknode( binary(op), x, y )


class BopLeftAssoc( lbp ) extends BinaryOrPost( lpb )

    override RBP is return 1 + this.LBP
class BopRightAssoc( lbp ) extends BinaryOrPost( lpb )

    overide RBP is return this.LBP

class BopNonAssoc( lbp ) extends BinaryOrPost( lpb 

    override RBP is return 1 + this.LBP

    override NBP is return this.LBP - 1

Postfix operators don't have a right operand and so they don't parse one in their LeD method

class Postfix( lbp ) extends BinaryOrPost( lpb )

    overide LeD( x, ok ) is
        return mknode( postfix(op), x )

We fill the table like this

leftComm[ "=" ] := new BopNonAssoc( 10 )
leftComm[ "+" ] := new BopLeftAssoc( 20 )
leftComm[ "-" ] :=  new BopLeftAssoc( 20 )
leftComm[ "*" ] :=  new BopLeftAssoc( 30 )
leftComm[ "/" ] :=  new BopLeftAssoc( 30 )
leftComm[ "!" ] :=  new Postfix( 40 )
leftComm[ "^" ] :=  new BopRightAssoc( 50 )

For every other token, the leftComm table uses a direct instance of DefaultLeftCommand, (below) ensuring that its LBP is -1. This will force an exit from the loop in E when a non-L token is encountered, as the p parameter is never less than 0.

    class DefaultLeftCommand extends LeftCommand

        override LBP is return -1

        override LeD( x, op ) is // Should not be called!
            assert false

At this point, we can see how to implement the indexing (x[y]) operation and the conditional (y if x else z) operation. I'll give these precedences of 60 and 5 respectively.

leftComm[ "[" ] := new IndexOp( 60 )
leftComm[ "if"] := new ConditionalOp( 5 )

class IndexOp( lbp ) extends BinaryOrPost( lbp )

    overide LeD( x, op ) is
        const y := E( 0 )
        expect( "]" )
        return mknode( binary( "index" ), x, y )

class ConditionalOp( lbp ) extends BinaryOrPost( lpb )

    overide LeD( y, op ) is
        const x := E( 0 )
        expect( "then" )
        const z := E( 1 + this.LBP )
        return mknode( ternary( "if" ), x, y, z )

    override NBP is
        return this.LBP - 1

I've made the conditional nonassociative to rule out confusing expressions like x if a else y if b else z.

Let's treat = as a "chaining operator" rather than non-associative. That is x=y=z should mean (x=y) and (y=z). While we're at it, we can introduce operators <, ≤, >, and ≥ that can also be used in chains of comparisons

leftComm[ "=" ] := new Chaining( 10 )
leftComm[ "<" ] := new Chaining( 10 )
leftComm[ "≤" ] := new Chaining( 10 )
leftComm[ ">" ] := new Chaining( 10 )
leftComm[ "≥" ] := new Chaining( 10 )

class Chaining( lbp ) extends BinaryOrPost( lbp )

    overide LeD( x, op ) is
        const y := E( 1 + this.LBP )
        const t := mknode( binary( op ), x, y ) 
        if next ∈ {"=", "<", "≤", ">", "≥"}
            const nextOp := next
            consume
            const t1 := this.LeD( y, nextOp )
            return mkNode( binary( "and", t, t1 )
        else
            return t

    override NBP is
        return this.LBP - 1

From a software engineering point of view, the important point is how modular these additions and alterations are. The basic parsing procedure E did not need to change, only the table.

Dealing with N tokens

Can we make the treatment of prefix operators and other N tokens equally modular? Remember the N tokens are "-", "(", and the variables v.

These are currently dealt with in the P procededure.

P is 
    if next="-" then ( consume ; const t:= E(30) ; return mknode(prefix('-', t)) )
    else if next = "(" then ( consume ; const t := E(0) ; expect(")") ; return t )
    else if next is a v then ( const t :=  mkleaf(next) ; consume ;  return t )
    else error( "Unexpected token '" +next+ "'" ) 

Our next change is to make the P procedure data driven; i.e. to use dynamic dispatch to choose the course of action.

P is 
    const token := next
    consume
    return nullComm[token].NuD( token )

nullComm is a table analogous to leftComm, but used for N tokens.

nullComm[ "-" ] := new UnaryOp( 30 )
nullComm[ "(" ] := new Parens
nullComm[ v ] := new Leaf

The last line needs to be repeated for each possible variable -- although the same command object can be shared for all entries. All other tokens (all non-N tokens) should map to an instance of ErrorNullCommand.

Now we need some classes for commands

class NullCommand
    abstract NuD( op ) returns Tree

class ErrorNullCommand extends NullCommand

    override NuD( op ) is
        error( "Unexpected token '" +op+ "'" )

class UnaryOp( rbp ) extends NullCommand

    protected const rbp := rbp

    override NuD( op ) is
        const y := Exp( this.rbp )
        return mkNode( prefix( op ), y )

class Parens extends NullCommand

    override NuD( op ) is
        const t := E( 0 ) 
        expect( ")" )
        return t

class Leaf extends NullCommand

    override NuD( op ) is
        return mkLeaf( op )

Now it's easy to add new kinds of prefix operators and such. For example we can add expressions like if x then y else z to the language like this.

nullComm[ "if" ] := new IfThenElse( )

class IfThenElse extends NullCommand

    override NuD( op ) is
        const x := Exp( 0 )
        expect( "then" )
        const y := Exp( 0 ) 
        expect( "else" )
        const z := Exp( 0 )
        return mknode( ternary( "if" ), x, y, z )

It might not be the best language design choice, but it is interesting that we can use both syntaxes for conditional expressions in the same language; there is no difficulty for the parser to handle, for example

    if a if b else c then d if e else f else if g then h else i if j else k

Summary

Here is the final version of the core algorithm after inlining P.

E(p) is  
    precondition 0 ≤ p
    var token := next
    consume
    var t := nullComm[token].NuD( token )
    var r := +inf
    loop
        token := next
        const c := leftComm[token]
        exit unless p ≤ c.LBP ≤ r
        consume
        t := c.LeD( t, token )
        r := c.NBP
    return t

Comparing with Pratt's paper

Pratt's paper was not written in an object-oriented style. Instead he puts procedures directly in tables. For this reason he has 3 tables rather than 2: NuD, LeD, LBP. (The addition of NBP is my own; see below.)  Using classes and objects makes it easy to share code, for example sharing the LeD method between the 3 subclasses of binary operators.

Pratt assumes dynamic scoping of variables, which would have been the norm in LISP implementations at the time. He also describes the main algorithm with a diagram that seems to miss some of the details.

I took the names LeD, NuD, and LBP from Pratt. I would have used other names myself, but it's worth being consistent with Pratt and other presentations.

Implementation details and other design choices

One thing that might bother the reader is the need to have tables that map a large (perhaps infinite) set of tokens. For most applications, there is no need to change the tables at run time, so the tables can be implemented as a pair of procedures. (In OO terms we need a factory that has methods leftComm and nullComm.) And in most cases all we need to know is the category of token, not the actual token string; so we can use an array indexed by the token categories.

In some cases we do want to be able to change the tables during run time. For example, Haskell lets the programmer declare the precedence of binary operators; and most Prolog dialects allow one to declare new operators along with associativity and precedence. If we do want to change the table on the fly, a hash map can be used.

Having reduced the number of tables from 3 in Pratt's paper to 2, the next step might be to have one table and only one kind of command. I.e. ensure that every concrete command has all 4 methods (LeD, LBP, NBP, and Nud). I didn't do this, because we might want to vary the tables independently. Consider adding function application syntax x(y) to the language. What does this have to do with parenthesization? Nothing really. Yet if I only had one command class for "(", it would have responsibilities for both parenthesization and function application. The treatment of a token when it's used as an N token and when it's used as an L token are two different responsibilities and shouldn't be in the same class.

Another approach is to do away with the tables and use objects as tokens that have the LeD, LBP, NBP, and NuD methods built into them. The comments of the previous paragraph apply.

As mentioned above, the NBP method is my own addition. It's used for making binary operators --and the like-- nonassociative. If you don't have need of it, don't implement it and change the E procedure to

E(p) is  
    var token := next
    consume
    var t := nullComm[token].NuD( token )
    loop
        token := next
        const c := leftComm[token]
        exit unless p ≤ c.LBP
        consume
        t := c.LeD( t, token )
    return t

 

References

[0] Eli Bendersky, "Top-Down operator precedence parsing", January 2010, http://eli.thegreenplace.net/2010/01/02/top-down-operator-precedence-parsing

[1] Andy Chu, "Pratt Parsing and Precedence Climbing Are the Same Algorithm", Nov 2016. http://www.oilshell.org/blog/2016/11/01.html

[2] Andy Chu, "Review of Pratt/TDOP Parsing Tutorials", Nov 2016. http://www.oilshell.org/blog/2016/11/02.html

[3] Andy Chu, "Pratt Parsing Without Prototypal Inheritance, Global Variables, Virtual Dispatch, or Java". "http://www.oilshell.org/blog/2016/11/03.html"

[4] Keith Clarke, "The top-down parsing of expressions", Research Report 383, Dept of Computer Science, Queen Mary College. Archived at http://antlr.org/papers/Clarke-expr-parsing-1986.pdf.

[5] Theodore S. Norvell, "Parsing Expressions by Recursive Descent", 1999 (revised 2013), http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm

[6] Vaughn R. Pratt, "Top down operator precedence", Proceedings of the 1st symposium on principles of programming languages (POPL), 1973, http://dl.acm.org/citation.cfm?id=512931

[7] Martin Richards and Collin Whitby-Stevens, BCPL -- the language and its compiler, Cambridge University Press, 1979.

Acknowledgement

Thanks to Andy Chu for pointing out the connection between precedence climbing and Pratt parsing.