From recursive descent to LR parsing
December 31, 2021
Recursive descent parsers are attractive for their simplicity. A collection of functions, one for each part of your language, looks at the next token to decide what to parse next. The structure and control flow of the program matches the structure of the grammar, so you can write a parser by hand in your favorite language and use familiar tools for testing and debugging.
Infix expressions are a common pain point for recursive descent parsers, but there is a common solution that mostly preserves this direct style: Pratt parsing or “topdown operator precedence.” My goal here is to take your familiarity with recursive descent and Pratt parsing to build an intuition for the moregeneral but oftenmysterious LR parsing.
If you’re new to Pratt parsing, I’ll cover some of it in this post, but here are some introductions I’ve appreciated:
Parser control flow and pushdown automata
There are two important kinds of control flow in a recursive descent parser:
 Within a function, the sequence of checks, branches, and loops corresponds to the righthand sides of the rules in the language’s grammar.
 Calls between functions correspond to occurrences of nonterminal symbols (symbols on the lefthand sides of rules) in the language’s grammar.
For example, consider this grammar (extended with regular expressions, just for fun) for JSON numbers and arrays, and pseudo Rust code for a recursive descent parser:
value = NUMBER
 array
array = "[" (value ("," value)*)? "]"
fn value() > Json {
match peek() {
NUMBER => { Json::Number(bump(NUMBER)) }
L_BRACKET => { array() }
_ => { /* Unexpected token */ }
}
}
fn array() > Json {
let mut values = vec![];
bump(L_BRACKET);
while peek() != R_BRACKET {
values.push(value());
if !eat(COMMA) { break; }
}
bump(R_BRACKET);
Json::Array(values)
}
We can also express this more abstractly as a pushdown automaton, or a finite state machine whose edges can match input symbols and push and pop elements from a stack. There are a lot of ways to construct a PDA for a grammar, but our approach here will mirror the control flow and call stack of our recursive descent pseudocode.
To begin with, here are the functionlocal control flow graphs:
Solid edges match input symbols, while dashed edges match nonterminals. Final states have a double border. To turn this into a PDA, we’ll need to convert the dashed edges into explicit stack operations. Edges are now labeled x ; A / B
, to read an input symbol x
, pop a stack symbol A
, and push a stack symbol B
, with ϵ (“epsilon”) standing for the empty string. Parsing is complete when we reach a final state and have an empty stack.
This is a bit of a rat’s nest, but if you pick it apart you can see that any state at the source of a dashed edge now has a new edge to push a symbol and transition to the called function, and the end of that function has a new edge to pop the symbol and return to the target of the dashed edge.
For example, parsing the input [1, [], 3]
looks like this:
Input  Stack  Action 

[1, [], 3] 
Push V , read [ 

1, [], 3] 
V 
Push A , read 1 
, [], 3] 
A V 
Pop A , read , 
[], 3] 
V 
Push A , push V , read [ 
], 3] 
V A V 
Read ] 
, 3] 
V A V 
Pop V , pop A , read , 
3] 
V 
Push A , read 3 
] 
A V 
Pop A , read ] 
V 
Pop V , done 
In the general case, a function might be called from multiple places. That’s why we have multiple distinct stack symbols—they tell us where to return to, just like the return address on the call stack of the recursive descent parser.
Pratt parsing
Pratt parsing solves two distinct problems faced by recursive descent parsers:
 Naive recursive descent doesn’t support left recursion (rules which start with their own nonterminal symbol, like
expr = expr "+" NUMBER
), which are the most natural way to describe infix operators.  Naive recursive descent needs to know which rule it is parsing with only one token of lookahead, but infix operators don’t appear until arbitrarily far into the expression.
I say “naive” because Pratt parsing isn’t the only place these problems are solved—most recursive descent parsers deal with them somehow. It’s just a nice illustrative example where they’re all solved in one place.
In this JSON PDA, we can already see signs of the first problem: the automaton must take some number of ϵ edges that just push or pop stack symbols (in recursive descent this means calling or returning from functions) before seeing all the possible inputs it needs to consider. For example, from the first state it may immediately read a NUMBER
, or it may push a V
before reading a "["
.
This is not actually an issue for JSON, where this extra exploration is limited enough that we can hardcode it at the call sites of array
and value
by mechanically following those ϵ edges ahead of time. But consider this grammar and PDA for leftassociative addition expressions:
expr = expr "+" NUMBER
 NUMBER
Using the same approach of mechanically following the ϵ edges to find a set of tokens to read next, we can see that the first state must read a NUMBER
, and the last state must read a "+"
or the end of the input. And indeed, we expect an alternating sequence of NUMBER
and "+"
tokens. This PDA even makes it obvious that the sequence must start with a NUMBER
.
However, we no longer have enough information to determine what to do with the stack. Seeing a NUMBER
might mean we should parse the expr = NUMBER
rule, or it might mean we should push an E
to start parsing the expr = expr "+" NUMBER
rule. And if we pick the second option, we’re back where we started and need to make the same decision again! For this to work out, the number of E
symbols we should push depends on the number of unprocessed "+"
tokens in the input.
In these terms, Pratt parsing tackles this problem by relaxing the association between the parser’s call stack and the PDA’s stack. We can’t know up front how many calls to make, so instead we’ll just pretend we made exactly the right number and then discover what that number was as we go—each time we parse a complete expr
, if we see another "+"
, it’s as if we have just returned from one more expr()
call and are ready to continue the expr = expr "+" NUMBER
rule:
fn expr() > Expr {
// Make some unknown number of imaginary "calls" to `expr()`, then parse the
// `expr = NUMBER` rule.
let mut lhs = Expr::Number(bump(NUMBER));
// Each time there is a `"+"`, "return" from one more imaginary "call," then
// continue parsing a `expr = expr "+" NUMBER` one level up.
while eat(PLUS) {
let rhs = Expr::Number(bump(NUMBER));
lhs = Expr::Add(lhs, rhs);
}
// When there are no more `"+"` tokens, return from the toplevel "call."
// Depending on how many `"+"` tokens we read, the result may be either a
// number or an add expression.
lhs
}
Despite this stack trickery, the resulting pseudocode looks just like what we would write from scratch, without thinking about PDAs.
In general, Pratt parsing supports multiple operators. If they have different precedence levels, we can handle that using actual function calls like normal. But at the same precedence level, we run into the second problem—consider what happens when we add subtraction to the grammar:
expr = expr "+" NUMBER
 expr "" NUMBER
 NUMBER
Not only do we need to take an arbitrary number of ϵ edges up front, we also need to decide for each one whether it corresponds to addition or subtraction, so the automaton can return to the right place.
Pratt parsing resolves this by exploiting the fact that, while the two rules share a common prefix (which is why the PDA needs to choose P
vs M
up front), they diverge in the middle with different operators. We can use the operator to decide which rule to “return” into:
fn expr() > Expr {
let mut lhs = Expr::Number(bump(NUMBER));
loop {
match peek() {
PLUS => {
bump(PLUS);
let rhs = Expr::Number(bump(NUMBER));
lhs = Expr::Add(lhs, rhs);
}
MINUS => {
bump(MINUS);
let rhs = Expr::Number(bump(NUMBER));
lhs = Expr::Sub(lhs, rhs);
}
_ => { break }
}
}
lhs
}
The full version of Pratt parsing uses two more tricks, which make it nicer to work with but aren’t directly relevant to this post. A quick summary for completeness:
 Infix operator rules have a common suffix as well as a common prefix, so Pratt parsing can combine them instead of handling them separately peroperator.
 Naive recursive descent would use a separate function for each operator precedence level, but they all use the same control flow, so Pratt parsing combines them into one function with a parameter for the current precedence.
LR parsing
PDAs may or may not be the most intuitive way to think about Pratt parsing, but they are a good way to understand LR parsing, which gets its power from the same two tricks! In fact, I would argue that the name “topdown operator precedence” is a misnomer, and that Pratt parsing is actually bottomup just like LR: it parses the leaves of the AST first, then combines them into larger and larger trees.
LR parsing is usually presented as a tablebased algorithm, constructing a PDA similar to ours ahead of time, then parsing by looking up its edges based on the current state and input token. It’s also possible to generate code that runs the PDA directly, which is sometimes called recursive ascent. (I tried writing one of these by hand once.) This is one of the benefits of thinking in terms of PDAs—you can apply what you figure out to multiple implementation styles.
There are three major differences between the PDAs we constructed above (which mirror the control flow of recursive descent) and the automata used for LR parsing. One is a tweak to how the stack is used, and the other two are fairly standard state machine transformations that implement the Pratt parsing tricks.
The LR stack
Above, we used the PDA stack to keep track of where to continue after finishing a rule, with an edge back to each call site. LR uses a different scheme, which is simpler in some ways but doesn’t match recursive descent as closely.
Instead of percall site return edges, an LR automaton pushes each state it enters onto the stack, and when it reaches the end of a rule it pops the group of states from that rule all at once, in a “reduce” action. This means rules need to have fixed lengths so we know how many states to pop, but it also means the edge labels no longer need to include stack operations. We also bring back the nonterminal edges, so we know where to go after reduce actions.
The automaton for the JSON grammar (now lightly refactored to give the rules fixed lengths) looks like this:
value = NUMBER
 array
array = "[" elems "]"
 "[" "]"
elems = elems "," value
 value
At this point, the dashed ϵ edges still match recursive descent’s function calls, which has one subtle complication: to make the fixedlength rule scheme work, you would need to avoid pushing states reached via those ϵ edges. The actual LR algorithm doesn’t need to worry about this, though, because of how it implements the first Pratt parsing trick:
ϵ closure
The first trick was to skip “call” edges and rely only on the “return” edges (or reduce actions) edges to keep track of what rule we’re in. A finite state machine with ϵ edges can always be converted to an equivalent state machine without them, which will bake this trick directly into the automaton.
This is done by finding the “ϵ closure” of each state, or the set of all states reachable by following ϵ edges. Then we can introduce shortcut edges past them like this:
This is, again, a bit of a rat’s nest. The main thing to notice here is that, except for the value
rule, all the rules’ initial states have disappeared. We skip past them every time we enter a new rule, directly from its “call” site.
Determinization
The second trick was to merge common prefixes between rules. A nondeterministic finite state machine, or one with multiple edges leaving a state with the same symbol, can always be converted to an equivalent deterministic state machine, which again will bake this trick directly into the automaton.
This is done by introducing a new state for each combination of states that the nondeterministic state machine might be in at the same time. For our JSON grammar, this actually shrinks the automaton a bit:
The result in this case is still probably too messy to be worth writing by hand, and in my opinion these two transformations are what make LR parsers so opaque—this state machine no longer corresponds directly to the grammar. On the other hand, in the world of PDAs they are the same transformations we used for Pratt parsing, where applying them selectively is a powerful tool.
LR(k) parsing
What we’ve built so far is an LR(0) automaton—one that uses 0 tokens of lookahead to determine whether to reduce. This is sufficient for the grammars we have seen so far, but it is not enough to handle operator precedence. Let’s take a look at one final grammar, and the automaton we get from ϵ closure and determinization:
expr = expr "+" term
 term
term = term "*" NUMBER
 NUMBER
Something funny is going on with the final state with the two incoming term
edges. Imagine parsing the expression 1 + 1
. After taking the first NUMBER
edge, we reduce back to the first state and take the term
edge. Without lookahead, we have no way of knowing whether we are in the middle of a term = term "*" NUMBER
rule, or have just finished an expr = term
rule!
This is called a “shift/reduce” conflict—an ambiguity between reading more input (which LR calls “shifting” a token) or reducing the state stack. LR’s clever approach to the PDA stack has come back to bite us. What we want to do, and what Pratt parsing does, is peek at the next token when faced with this choice. If it is a "*"
, then we are in the middle of a term = term "*" NUMBER
rule and we continue at the same precedence level. Otherwise, it must be a "+"
and we can “return” up to an expr = expr "+" term
rule.
There is a lot of variety in how LR parsers pull this off, which is outside the scope of this post. So here are some places you might start if you want to learn more:
“Canonical” LR(k) parsing resolves this conflict by tracking which strings of k tokens we expect after each “call” to a rule. (k is almost always 1, and in fact any grammar that LR(k) supports can be made to work with LR(1). In the other direction, there is an extension called “LRregular” where these strings are generalized into regular languages!) Each “call” with a different set of expected lookahead tokens gets its own separate copy of the states for the associated rules, which can propagate down through any additional “calls.”
All these copies can lead to a much larger state machine, so there are several flavors of LR(k) that give up some of this disambiguation power to avoid making quite so many copies. “Simple LR” or “SLR” does not copy any states, and disambiguates purely based on whether the lookahead token can follow the rule’s nonterminal anywhere in the grammar. “Lookahead LR” or “LALR” is slightly more precise, and propagates the same information as Canonical LR but merges lookahead information rather than copying states. There are also algorithms like “lane tables” or “IELR” that can detect when copying will help, and fall back to LALR otherwise.
Yet another approach is to accept these ambiguities in the automaton, and process them all. This class of algorithms can even parse ambiguous grammars, producing a parse “forest” instead of a tree.
I hope this made sense! Thinking in terms of abstract state machines like this is a great tool for understanding and refactoring all kinds of programs at a high level.