Taming Left Recursion in PEG Parsers: A Practical Guide
This article explains why left‑recursive rules break recursive‑descent parsers, demonstrates how naïve grammar rewrites alter parse trees, and introduces an oracle‑based memoization technique with a custom @memoize_left_rec decorator to correctly handle left recursion in PEG parsers.
Left recursion is a classic obstacle for recursive‑descent parsers because it causes stack overflow. A typical left‑recursive rule such as expr: expr '+' term | term cannot be translated directly into a recursive‑descent function.
The traditional remedy is to rewrite the grammar, for example to expr: term '+' expr | term , but this changes the shape of the parse tree and can break associativity when operators like ‘-’ are added.
PEG parsers offer stronger features; the rule can be expressed as expr: term ('+' term)* , which matches Python’s current grammar. However, this still requires post‑processing to build a left‑associative tree for binary operators.
To preserve the original left‑recursive structure, an “oracle” function is introduced. The oracle decides whether to take the left‑recursive alternative or the term alternative based on the current recursion depth compared to the number of operators remaining.
<code>def expr():
if oracle() and expr() and expect('+') and term():
return True
if term():
return True
return False</code>A global variable saved_result stores the most recent successful parse, and the oracle_expr() function returns this cached result on subsequent calls.
The core of the solution is the @memoize_left_rec decorator. It primes the memo cache with a failure, then repeatedly calls the undecorated function, updating the cache with longer results until no longer improvement is possible.
<code>def memoize_left_rec(func):
def wrapper(self, *args):
pos = self.mark()
memo = self.memos.get(pos) or {}
key = (func, args)
if key in memo:
res, endpos = memo[key]
self.reset(endpos)
else:
memo[key] = (None, pos) # prime with failure
while True:
self.reset(pos)
res = func(self, *args)
endpos = self.mark()
if endpos <= memo[key][1]:
break
memo[key] = (res, endpos)
return res
return wrapper</code>When parsing the expression foo + bar + baz , the decorator first returns foo , then iteratively expands to foo + bar , and finally to (foo + bar) + baz , each time caching the longer result. Once no further ‘+’ is found, the loop stops and the longest left‑associative tree is returned.
The article concludes that left recursion can be successfully tamed in a PEG‑ish parser using this oracle‑driven memoization approach, and hints at future work adding actions to customize node construction.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.