An Introduction to Parser Combinators
If you’ve ever had to write a parser before, you know that creating parsers can be a tedious and complicated process. The good news is that it doesn’t have to be this way. In this post, I’m going to introduce parser combinators - a technique for building parsers that I’ve found to be both practical and fun to play around with1.
But first, a demo of what you can do with parser combinators.
In this post, we’re going to build up parser combinators from first-principles. We’ll use it to create a command parser inspired by old-school text adventures. Although we’ll focus on matching simple English phrases, parser combinators can just as easily be used for parsing code, query languages, and more.
This blog post consists of interactive, runnable code cells. Each cell is an ES module - use the
exportkeyword to define values that can be accessed by other cells. Other cells can access exported values using the
Table of Contents
- What is A Parser, Exactly?
- Making Things A Little More Abstract
- Matching the End of a Stream
- Fuzzy Matching
- Making Things Happen with
- Putting it All Together
What is A Parser, Exactly?
The core idea behind parser combinators is first to create very simple parsers. These parsers will only parse a single word and thus are not particularly useful on their own. However, we will then “combine” these simple parsers to create more complex parsers that are actually useful.
To start, we first need to define what a parser actually is. Let’s assume that we start with a stream of tokens 2, which means that we have already run a tokenizer on our input. For a text-adventure, the tokenizer can be really simple - just split the entered command by whitespace 3.
For the purposes of this tutorial, we will define a parser as something that takes in a token stream and either accepts it (meaning the parse has succeeded), or rejects it (meaning the parse has failed) 4.
Let’s take a stab at implementing a parser that checks if the first word in a token stream is
Simple enough. But there are some ways we can improve this.
Right now we can only apply our
hello parser to the beginning of a token stream. We might want to apply it to later parts of the stream without redefining the parser, for example to check if the second word in a phrase is
"hello". To implement this, we’ll take in an index as an argument that tells us where in the token stream to start.
Secondly, in the case where a parse was successful, it will be useful for the caller to know how many tokens were successfully matched. Instead of just returning a boolean result, we’ll return an object. This object will have both a field that indicates if the parse was successful, and a field that stores what the “next” index in the token stream is (indicating where the parser left off).
Let’s implement these extensions.
Okay, we know have a parser for the token
"hello". Let’s create one for
"world" using the exact same structure.
Now let’s say we want to parse the entire phrase
["hello", "world"]. Instead of creating an entirely new parser from scratch, we can simply use our previously created parsers. All we do is run the
world parsers in sequence, each picking up from the index where the other left off.
So far so good. The two extensions we made to our parser (taking in an index and returning a new index) have made it possible to directly wire multiple parsers together.
Making Things A Little More Abstract
Now is where things start to get interesting. You may have noticed that our
world parsers have the exact same structure. It turns out all one-word parsers look exactly the same.
Instead of repeating the same code for for every one-word parser that we create, we can take this pattern and abstract it away into a constructor. We will create a function called
word which takes in a word (as a string) and returns a parser that parses only that word.
Let’s look at our
helloWorld parser again. It turns out it also follows a very predictable structure - it runs each parser in sequence, failing early if any of the parsers fail. We can also implement this pattern generically.
This is progress - we can now write a parser for an entire phrase in a single statement! We’ve defined our first “combinator” - a function that takes in parsers and returns a new, combined parser.
Usually parsers don’t just parse a single phrase. Typically, they can parse one out of many choices. For example, in a text adventure, a player may be able to type either
look around or
We can easily parse this by simply creating parsers for each distinct command. We then try each parser until one succeeds. Rather than implementing this pattern from scratch each time, we will define an abstract
anyOf combinator which implements this generically.
Imagine, in our text-adventure scenario, the player wants to refer to an object. They might type either
the book - the article here is optional. We want to be able to handle both of these cases. To do this, we will define an optional combinator. This combinator is simple - it returns a new parser that optionally matches a given parser.
Matching the End of a Stream
Right now, all of our parsers ignore tokens after the matched phrase. In some cases we want to match only if our phrase is the only thing in the input. To do this, we will create a parser that matches the end of the token sequence.
One way we can extend our parser is to allow for misspellings of long words. We can do this using “fuzzy” matching. The central idea is that, if a given token is “close enough” to the desired word, we will accept it. We will measure “close enough” using something called Levenshtein edit distance, which measures how easily one word can be mutated to produce another word through a series of operations.
I won’t go into detail about implementing Levenshtein here. If you want to know more, the Wikipedia page is a good place to start.
Making Things Happen with
Before we can return to our text-adventure example, we have one last thing we need to implement. For a game, we don’t just want to return
false from our parsers - we actually need to perform actions in response to certain commands.
We can implement this pattern using a simple function called
onSuccess that wraps a parser and invokes a callback when that parser succeeds.
Putting it All Together
Now that we have all the pieces, let’s return to our text-adventure example. We’ll build a very simple room with some objects and navigational commands.
This post covers the basics of parser combinators, but there are many more aspects to explore. In future posts, I’ll show how we can parse real programming languages. In particular, we will look at:
- Creating abstract syntax trees (ASTs).
- Reporting human-readable errors when parsing fails.
- Handling infix operator precedence.
- Limiting backtracking to improve performance.
I’ve used parser combinators previously in my text-adventure inspired game FeedVid Live, as well as in a parser for StarCraft II’s Galaxy scripting language. ↩
Parser combinators don’t actually need to be used with tokenizers - they can operate directly on streams of characters, but the code is much cleaner if we operate over streams of tokens. ↩
Tokenizers are themselves an interesting topic worthy of their own blog post. ↩
This is a simplified notion of what a parser is - at this point it’s really just a “matcher.” Usually parsers construct and return some kind of result representing the data that they parsed. ↩