Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ericseppanen/json-parser-toy
Let's write a parser in Rust!
https://github.com/ericseppanen/json-parser-toy
Last synced: 3 months ago
JSON representation
Let's write a parser in Rust!
- Host: GitHub
- URL: https://github.com/ericseppanen/json-parser-toy
- Owner: ericseppanen
- Created: 2020-05-14T05:56:32.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-01-06T23:58:49.000Z (10 months ago)
- Last Synced: 2024-06-30T14:47:31.289Z (5 months ago)
- Language: Rust
- Size: 82 KB
- Stars: 28
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Let's build a parser!
[*Read the original blog post at* **codeandbitters**.](https://codeandbitters.com/lets-build-a-parser/)
> Updated 10/2021 to use `nom 7.0`!
This is a demonstration of building a parser in Rust using the
[`nom`](https://docs.rs/nom/) crate. I recently built a parser for
the [`cddl-cat`](https://docs.rs/cddl-cat/) crate using nom,
and I found it a surprisingly pleasant experience, much better than my past
experiences with other parser-generators in other languages.Since I like Rust a lot, and I need an excuse to do more writing about Rust, I
thought I'd do another demonstration project. I decided to choose a simple
syntax, to keep this a short project. So I'm going to build a parser for JSON.There are a million JSON parsers in the world already, so I don't expect this
code to have much non-educational value. But, hey, you never know.All of the source code and markdown source for this post is
[available on GitHub](https://github.com/ericseppanen/json-parser-toy). If you
see anything wrong, please let me know by raising an issue there.## Part 1. Introduction.
A few details, before I write the first lines of code:
1. I'm going to use [RFC8259](https://tools.ietf.org/html/rfc8259) as my
authoritative reference for the JSON grammar.
2. I'm not going to build a JSON serializer. My goal will only be to consume
JSON text and output a structured tree containing the data (a lot like
[`serde_json::Value`](https://docs.serde.rs/serde_json/value/enum.Value.html) ).
3. I'll be using [`nom` 7.0](https://docs.rs/nom/7.0/nom/). I'll try to keep
this post updated when new major versions are released.
4. Some of the code I write will violate the usual `rustfmt` style. This isn't
because I hate `rustfmt`; far from it! But as you'll see, `nom` code can look a
little weird, so it's sometimes more readable if we bend the styling rules a
little bit. Do what you like in your own code.
5. All of my source code will be
[available on GitHub](https://github.com/ericseppanen/json-parser-toy). If you
have comments or suggestions, or see a bug or something wrong in this post,
please open an issue there.Let's start with a few words about `nom`. It can take a little bit of time to
adjust to writing a parser with `nom`, because it doesn't work by first
tokenizing the input and then parsing those tokens. Both of those steps can be
tackled at once.Older versions of `nom` used a lot of macros. Starting with `nom 7.0`, the
macros are gone, and the only way to use nom is with the function combinators.
This is a nice change, because while `nom` combinators can be tricky, the
function-based style is a lot friendlier to work with than the old macros.A bit of advice for reading the
[`nom` documentation](https://docs.rs/nom/7.0.0/nom/), if you're following
along with this implementation:
- Start from the [modules](https://docs.rs/nom/7.0.0/nom/#modules) section of
the documentation.
- We'll be starting with the
[character](https://docs.rs/nom/7.0.0/nom/character/index.html) and
[number](https://docs.rs/nom/7.0.0/nom/number/index.html) modules.
- We'll use the
[combinator](https://docs.rs/nom/7.0.0/nom/combinator/index.html),
[multi](https://docs.rs/nom/7.0.0/nom/multi/index.html),
[sequence](https://docs.rs/nom/7.0.0/nom/sequence/index.html),
and [branch](https://docs.rs/nom/7.0.0/nom/branch/index.html) modules to tie
things together. I'll try to link to the relevant documentation as we go.## Part 2. Our first bit of parser code.
I've started a new library project (`cargo init --lib json-parser-toy`), and
added the `nom 7.0` dependency in `Cargo.toml`. Let's add a very simple parser
function, just to verify that we can build and test our code. We'll try to
parse the strings "true" and "false". In other words, the grammar for our json
subset is:```txt
value = "false" / "true"
```Here's our first bit of code:
```rust
use nom::{branch::alt, bytes::complete::tag, IResult};fn json_bool(input: &str) -> IResult<&str, &str> {
alt((
tag("false"),
tag("true")
))
(input)
}#[test]
fn test_bool() {
assert_eq!(json_bool("false"), Ok(("", "false")));
assert_eq!(json_bool("true"), Ok(("", "true")));
assert!(json_bool("foo").is_err());
}
```I got the [`tag`](https://docs.rs/nom/7.0.0/nom/bytes/complete/fn.tag.html)
function from `nom::bytes`, though it's not specific to byte-arrays; it works
just fine with text strings as well. It's not a big deal; it's just a minor
quirk of the way `nom` is organized.We use [`alt`](https://docs.rs/nom/7.0.0/nom/branch/fn.alt.html) to express
"one of these choices". This is a common style in `nom`, and we'll see it
again when we use other combinators from `nom::sequence`.There are a few other things that should be explained.
[`IResult`](https://docs.rs/nom/7.0.0/nom/type.IResult.html) is an important
part of working with `nom`. It's a specialized `Result`, where an `Ok` always
returns a tuple of two values. In this case, `IResult<&str, &str>` returns two
string slices. The first is the "remainder": this is everything that wasn't
parsed. The second part is the output from a successful parse; in this case we
just return the string we matched. For example, I could add this to my test,
and it would work:```rust
assert_eq!(json_bool("false more"), Ok((" more", "false")));
```The `json_bool` function consumed the `false` part of the string, and left the
rest for somebody else to deal with.When `json_bool` returns an error, that doesn't necessarily mean that something
is wrong. Our top-level parser isn't going to give up. It just means that
this particular bit of grammar didn't match. Depending on how we write our
code, other parser functions might be called instead. You can actually see
this in action if you look at how the `alt` combinator works. It first calls a
parser function `tag("false")`, and if that returns an error, it instead feeds
the same input into `tag("true")`, to see if it might succeed instead.This probably still looks kind of strange, because `tag("false")` isn't a
complete parser function; it's a function that returns a parser function. See
how our code calls `alt` and `tag` (twice)? The return value from that code is
another function, and that function gets called with the argument `(input)`.Don't be scared off by the intimidating-looking parameters of the `tag`
function in the documentation— look at the
[examples](https://docs.rs/nom/7.0.0/nom/bytes/complete/fn.tag.html#example).
Despite the extra layer of indirection, it's still pretty easy to use.## Part 3. Returning structs.
We don't want to just return the strings that we matched; we want to return
some Rust structs that we can put into a tree form.We could copy the previous function to add another simple JSON element:
```rust
fn json_null(input: &str) -> IResult<&str, &str> {
tag("null")
(input)
}
```That would work, but let's rewrite our two parser functions to return enums or
structs instead.
```rust
use nom::combinator::map;#[derive(PartialEq, Debug)]
pub enum JsonBool {
False,
True,
}#[derive(PartialEq, Debug)]
pub struct JsonNull {}fn json_bool(input: &str) -> IResult<&str, JsonBool> {
let parser = alt((
tag("false"),
tag("true")
));
map(parser, |s| {
match s {
"false" => JsonBool::False,
"true" => JsonBool::True,
_ => unreachable!(),
}
})
(input)
}fn json_null(input: &str) -> IResult<&str, JsonNull> {
map(tag("null"), |_| JsonNull {})
(input)
}#[test]
fn test_bool() {
assert_eq!(json_bool("false"), Ok(("", JsonBool::False)));
assert_eq!(json_bool("true"), Ok(("", JsonBool::True)));
assert!(json_bool("foo").is_err());
}#[test]
fn test_null() {
assert_eq!(json_null("null"), Ok(("", JsonNull {})));
}```
First, notice that the parser functions' return value has changed. The first
part of the `IResult` tuple is still the remainder, so it's still `&str`. But
the second part now returns one of our new data structures.To change the return value, we use `nom`'s
[`map`](https://docs.rs/nom/7.0.0/nom/combinator/fn.map.html) combinator
function. It allows us to apply a closure to convert the matched string into
something else: in the `json_bool` case, one of the `JsonBool` variants. You
will probably smell something funny about that code, though: we already matched
the `"true"` and `"false"` strings once in the parser generated by the `tag`
function, so why are we doing it again? Your instincts are right on— we should
probably back up and fix that, but let's wrap up this discussion first.The `json_null` function does almost exactly the same thing, though it doesn't
need a `match` because it could only have matched one thing.We need to derive `PartialEq` and `Debug` for our structs and enums so that the
`assert_eq!` will work. Our tests are now using the new data structures
`JsonBool` and `JsonNull`.## Part 4. Another way of doing the same thing.
In `nom`, there are often multiple ways of achieving the same goal. In our
case, `map` is a little bit overkill for this use case. Let's instead use the
[`value`](https://docs.rs/nom/7.0.0/nom/combinator/fn.value.html) combinator
instead, which is specialized for the case where we only care that the child
parser succeeded.We'll also refactor `json_bool` so that we don't need to do extra work: we'll
apply our combinator a little earlier, before we lose track of which branch
we're on.```rust
use nom::combinator::value;#[derive(PartialEq, Debug, Clone, Copy)]
pub enum JsonBool {
False,
True,
}#[derive(PartialEq, Debug, Clone, Copy)]
pub struct JsonNull {}fn json_bool(input: &str) -> IResult<&str, JsonBool> {
alt((
value(JsonBool::False, tag("false")),
value(JsonBool::True, tag("true")),
))
(input)
}fn json_null(input: &str) -> IResult<&str, JsonNull> {
value(JsonNull {}, tag("null"))
(input)
}
```Hopefully this is pretty straightforward. The `value` combinator returns its
first argument (e.g. `JsonNull {}`), if the second argument succeeds
(`tag("null")`). That description is a bit of a lazy mental shortcut,
because `value` doesn't do any parsing itself. Remember, it's a function that
consumes one parser function and returns another parser function. But because
`nom` makes things so easy, it's sometimes a lot easier to use the lazy way of
thinking when you're plugging combinators together like Lego bricks.Note that I added `Clone` to the data structures, because `value` requires it.
I also added `Copy` because these are trivially small structs & enums.## Part 5. Prepare to tree.
Our final output should be some tree-like data structure, similar to
[`serde_json::Value`](https://docs.serde.rs/serde_json/value/enum.Value.html).
I'm partial to the word "node" to describe the parts of a tree, so let's start
here:```rust
pub enum Node {
Null(JsonNull),
Bool(JsonBool),
}
```Right away, I don't like where this is going. Here are all the things I'm
unhappy with:1. The redundant naming. I have `Node::Null` and `JsonNull`, for a value that
contains no additional data.
2. The null and bool types don't really seem like they need their own data
structure name, outside of the tree node. If this were a complex value type
that I might want to pass around on its own, sure. But for this simple case, I
think this is a lot simpler:```rust
#[derive(PartialEq, Debug, Clone)]
pub enum Node {
Null,
Bool(bool),
}fn json_bool(input: &str) -> IResult<&str, Node> {
alt((
value(Node::Bool(false), tag("false")),
value(Node::Bool(true), tag("true")),
))
(input)
}fn json_null(input: &str) -> IResult<&str, Node> {
value(Node::Null, tag("null"))
(input)
}#[test]
fn test_bool() {
assert_eq!(json_bool("false"), Ok(("", Node::Bool(false))));
assert_eq!(json_bool("true"), Ok(("", Node::Bool(true))));
assert!(json_bool("foo").is_err());
}#[test]
fn test_null() {
assert_eq!(json_null("null"), Ok(("", Node::Null)));
}
```We got rid of JsonNull and JsonBool entirely. For your parser you can choose
any output structure that makes sense; different grammars have different
properties, and they may not map easily onto Rust's prelude types.## Part 6. Parsing numbers is hard.
The other remaining literal types in JSON are strings and numbers. Let's
tackle numbers first. Referring to
[RFC8259](https://tools.ietf.org/html/rfc8259), the grammar for a JSON number
is:```txt
number = [ minus ] int [ frac ] [ exp ]decimal-point = %x2E ; .
digit1-9 = %x31-39 ; 1-9
e = %x65 / %x45 ; e E
exp = e [ minus / plus ] 1*DIGIT
frac = decimal-point 1*DIGIT
int = zero / ( digit1-9 *DIGIT )
minus = %x2D ; -
plus = %x2B ; +
zero = %x30 ; 0
```That grammar can represent any integer or floating point value; it would be
grammatically correct to have an integer a thousand digits long, or a floating
point value with huge exponent. It's our decision how to handle these values.JSON (like JavaScript) is a bit unusual in not distinguishing integers from
floating-point values. To make this tutorial a little more widely useful,
let's output integers and floats as separate types:```rust
pub enum Node {
Null,
Bool(bool),
Integer(i64),
Float(f64),
}
```We'll need to do something when we encounter values that are grammatically
correct (e.g. 1000 digits), that we can't handle. This is a common problem,
since most grammars don't attempt to set limits on the size of numbers. Often
there will be a limit set somewhere, but it's not part of the formal grammar.
JSON doesn't set such limits, which can lead to compatibility problems between
implementations.It will be important in most parsers to set limits and make sure things fail
gracefully. In Rust you're not likely to have problems with buffer overruns,
but it might be possible to trigger a denial of service, or perhaps even a
crash by triggering excessive recursion.Let's start by making the parser functions we need, and we'll see where we need
error handling.Let's build a little helper function for the `digit1-9` part, since `nom` only
offers `digit`, which includes `0-9`.```rust
fn digit1to9(input: &str) -> IResult<&str, &str> {
one_of("123456789")
(input)
}
```Unfortunately, it doesn't compile:
```txt
error[E0308]: mismatched types
--> src/lib.rs:21:5
|
21 | / one_of("123456789")
22 | | (input)
| |___________^ expected `&str`, found `char`
|
= note: expected enum `std::result::Result<(&str, &str), nom::internal::Err<(&str, nom::error::ErrorKind)>>`
found enum `std::result::Result<(&str, char), nom::internal::Err<_>>`
```This is a pretty easy mistake to make— we tried to create a parser function
that returns a string slice, but it's returning `char` instead, because, well,
that's how `one_of` works. It's not a big problem for us; just fix the return
type to match:```rust
fn digit1to9(input: &str) -> IResult<&str, char> {
one_of("123456789")
(input)
}
```We can now build the next function, one that recognizes integers:
```rust
fn uint(input: &str) -> IResult<&str, &str> {
alt((
tag("0"),
recognize(
pair(
digit1to9,
digit0
)
)
))
(input)
}
```Again, we use `alt` to specify that an integer is either `0`, or a nonzero
digit, possibly followed by more additional digits.The new combinator here is `recognize`. Let's back up and look at the return
type of this hypothetical function:```rust
fn nonzero_integer(input: &str) -> IResult<&str, ____> {
pair(
digit1to9,
digit0
)
(input)
}
```Because we used `pair`, the return type would be a 2-tuple. The first element
would be a `char` (because that's what we returned from `digit1to9`), and the
other element would be a `&str`. So the blank above would be filled in like
this:```rust
fn nonzero_integer(input: &str) -> IResult<&str, (char, &str)> {
...
}
```In this context, not very helpful. What we'd like to say is, "match this bunch
of stuff, but just return the string slice that covers what we matched."
That's exactly what `recognize` does.Because we're going to store integers in a different `Node` variant, we should
also do one last call to `map`. But that might make life difficult if we want
to re-use this code as part of a token that's representing a floating-point
number.So let's leave the `uint` function alone; we'll use it as a building block of
another function.Note also that we can't finish parsing an integer until we've consumed the
optional leading "minus" symbol.```rust
fn json_integer(input: &str) -> IResult<&str, &str> {
recognize(
pair(
opt(tag("-")),
uint
)
)
(input)
}
```The `opt` function is another `nom` combinator; it means "optional", and
unsurprisingly it will return an `Option` where `T` in this case is `&str`
(because that's what `tag("-")` will returns. But that return type is ignored;
`recognize` will throw it away and just give us back the characters that were
consumed by the successful match.Let's add one more step to our function: convert the resulting string into a
`Node::Integer`.```rust
fn json_integer(input: &str) -> IResult<&str, Node> {
let parser = recognize(
pair(
opt(tag("-")),
uint
)
);
map(parser, |s| {
// FIXME: unwrap() may panic if the value is out of range
let n = s.parse::().unwrap();
Node::Integer(n)
})
(input)
}
```Finally, we discover a point where we'll need some error handling.
[`str::parse`](https://doc.rust-lang.org/std/primitive.str.html#method.parse)
returns a `Result`, and will certainly return `Err` if we try to parse
something too big.I am going to leave proper error handling until the end, so for now I will just
`unwrap` the result. This means the parser will panic if we give it a huge
integer, so we definitely need to come back and fix this later.For now we'll finish up this section with a few unit tests:
```rust
#[test]
fn test_integer() {
assert_eq!(json_integer("42"), Ok(("", Node::Integer(42))));
assert_eq!(json_integer("-123"), Ok(("", Node::Integer(-123))));
assert_eq!(json_integer("0"), Ok(("", Node::Integer(0))));
assert_eq!(json_integer("01"), Ok(("1", Node::Integer(0))));
}
```Note the fourth test case— this might not be what you expected. We know that
integers with a leading zero aren't allowed by this grammar— so why did the
call to `json_integer` succeed? It has to do with the way `nom` operates. Each
parser only consumes the part of the string it matches, and leaves the rest for
some other parser. So attempting to parse `01` results in a success, returning
a result `Node::Integer(0)` along with a remainder string `1`.`nom` does have ways for parsers to trigger a fatal error if they're unhappy
with the sequence of characters, but this grammar probably won't need them.## Part 7. Parsing numbers some more.
Let's piece together the bits we need to parse floating point numbers.
```rust
fn frac(input: &str) -> IResult<&str, &str> {
recognize(
pair(
tag("."),
digit1
)
)
(input)
}fn exp(input: &str) -> IResult<&str, &str> {
recognize(
tuple((
tag("e"),
opt(alt((
tag("-"),
tag("+")
))),
digit1
))
)
(input)
}fn json_float(input: &str) -> IResult<&str, Node> {
let parser = recognize(
tuple((
opt(tag("-")),
uint,
opt(frac),
opt(exp)
))
);
map(parser, |s| {
// FIXME: unwrap() may panic if the value is out of range
let n = s.parse::().unwrap();
Node::Float(n)
})
(input)
}
```The only new parts here are:
- `nom::character::complete::digit1`: just like `digit0`, except this matches
one-or-more digits.
- `nom::sequence::tuple` is a lot like `pair`, but accepts an arbitrary number
of other parsers. Each sub-parser must match in sequence, and the return value
is a tuple of results.I added some straightforward unit tests here, and they all pass. Despite that,
I've made a significant mistake, but one that we won't notice until we start
stitching the various parts together. Let's do that now.When a parser executes, it obviously won't know which elements are arriving in
which order, so we need a parser function to handle everything we've built so
far. Thanks to the magic of `nom`, this part is really easy.```rust
fn json_literal(input: &str) -> IResult<&str, Node> {
alt((
json_integer,
json_float,
json_bool,
json_null
))
(input)
}
```And now we discover that something is wrong:
```rust
#[test]
fn test_literal() {
assert_eq!(json_literal("56"), Ok(("", Node::Integer(56))));
assert_eq!(json_literal("78.0"), Ok(("", Node::Float(78.0))));
}
``````txt
test test_literal ... FAILEDfailures:
---- test_literal stdout ----
thread 'test_literal' panicked at 'assertion failed: `(left == right)`
left: `Ok((".0", Integer(78)))`,
right: `Ok(("", Float(78.0)))`', src/lib.rs:163:5
```Because we put `json_integer` first, it grabbed the `78` part and declared
success, leaving `.0` for someone else to deal with. Not so big a deal,
right? Let's just swap the order of the parsers:```rust
fn json_literal(input: &str) -> IResult<&str, Node> {
alt((
json_float,
json_integer,
json_bool,
json_null
))
(input)
}
``````txt
test test_literal ... FAILEDfailures:
---- test_literal stdout ----
thread 'test_literal' panicked at 'assertion failed: `(left == right)`
left: `Ok(("", Float(56.0)))`,
right: `Ok(("", Integer(56)))`', src/lib.rs:162:5
```We've traded one problem for another. This time, `json_float` runs first,
consumes the input `56` input and declares success, returning `Float(56.0)`.
This isn't wrong, exactly. Had we decided at the beginning to treat all
numbers as floating-point (as JavaScript does) this would be the expected
outcome. But since we committed to storing integers and floats as separate
tree nodes, we have a problem.Since we can't allow either the `json_float` parser or the `json_integer`
parser to run first (at least as currently written), let's imagine what we'd
like to see happen. Ideally, we would start parsing the `[ minus ] int` part
of the grammar, and if that succeeds we have a possible integer-or-float
match. We should then continue on, trying to match the `[ frac ] [ exp ]`
part, and if _either of those_ succeeds, we have a float.There are a few different ways to implement that logic.
One way would be to get `json_float` to fail if the next character after the
integer part is _not_ a `.` or `e` character— without that it can't possibly be
a valid float (according to our grammar), so if `json_float` fails at that
point we know the `json_integer` parser will run next (and succeed).```rust
fn json_float(input: &str) -> IResult<&str, Node> {
let parser = recognize(
tuple((
opt(tag("-")),
uint,
peek(alt((
tag("."),
tag("e"),
))),
opt(frac),
opt(exp)
))
);
map(parser, |s| {
let n = s.parse::().unwrap();
Node::Float(n)
})
(input)
}
```This code has one small annoyance, though it's not a problem in the overall
JSON context. Imagine that we took this `json_float` parser code, and tried to
reuse it in another language, where this other language's grammar would allow
the input `123.size()`. This code would `peek` ahead and see the `.`
character, and because of that it would parse `123` as a float rather than an
integer. In other words, this `json_float` implementation decides that this
input is a float before it's actually finished parsing all the characters
making up that float.There is a slightly better way, though. Remember, our original problem is that
`json_float` will succeed in all of the following cases:
- `123`
- `123.0`
- `123e9`
- `123.0e9`
What we'd rather have is a parser that succeeds at the last three, but not the
first. There isn't a combinator in `nom` that implements "A or B or AB", but
it's not that hard to implement ourselves:```rust
fn json_float(input: &str) -> IResult<&str, Node> {
let parser = recognize(
tuple((
opt(tag("-")),
uint,
alt((
recognize(pair(
frac,
opt(exp)
)),
exp
)),
))
);
map(parser, |s| {
let n = s.parse::().unwrap();
Node::Float(n)
})
(input)
}
```This new logic uses `alt` to allow two choices: either a `frac` must be present
(with an optional `exp`) following, or an `exp` must be present by itself. An
input with neither a valid `frac` or `exp` will now fail, which makes
everything work the way we want it to.## Part 8. Handling string literals
So far we support literal null, boolean, integer, and float types. There's
only one more literal type left to handle: strings.In the JSON grammar, a string is basically a series of Unicode characters that
starts and ends with a quote, plus a few extra rules:1. Certain characters must be escaped (ASCII control characters, quotes, and
backslashes)
2. Any character may be escaped, using `\u` plus 4 hexadecimal digits, e.g.
`\uF903`.
3. A small number of common characters have two-character escapes:
`\"` `\\` `\/` `\b` `\f` `\n` `\r` `\t`.That's how RFC 8259 does things, anyway. Different implementations may have
subtle differences.This means there are many possible ways to represent a certain string. We're
only building a parser, so we just need to make sure we can parse all the valid
JSON representations (and hopefully return an error on all the invalid ones).The presence of escape characters makes our job more difficult. There are
different ways we might choose to address this. I'm going to choose to break
escape handling into a separate phase. This means we will only use `nom` to do
the lexing part (finding the bounds of the string literal), and we'll follow up
with an "un-escaping" pass to decode the escaped characters.Bad inputs must be rejected by one of the two phases, but we don't care which
one. For example, `"\ud800"` looks like a valid JSON string, but can't be
decoded because U+D800 is a magic "surrogate" character, meaning it's half of a
character that needs more than 16 bits to encode. We should also reject things
like `"\x"` (a nonexistent escape), `"\u001"` (not enough hex digits), and
`"\"` (which is unterminated because the trailing quote is escaped). We also
need to reject "naked" (non-escaped) control characters (ASCII 0x00-0x1F),
though for some reason 0x7F (ASCII DELETE) is legal.Let's begin by building a parser for "a string of valid non-escaped
characters": everything except control characters, backslash, and quote. We
don't need to check the upper limit 0x10FFFF because those characters will
never appear in a Rust `char`.```rust
use nom::bytes::complete::take_while1;fn is_nonescaped_string_char(c: char) -> bool {
let cv = c as u32;
(cv >= 0x20) && (cv != 0x22) && (cv != 0x5C)
}// One or more unescaped text characters
fn nonescaped_string(input: &str) -> IResult<&str, &str> {
take_while1(is_nonescaped_string_char)
(input)
}
```The `take_while1` function comes from the nom `bytes` module (which, remember,
isn't specific to byte sequences). `nom` offers a few different `take`
functions in this module; `take_while1` consumes characters that match some
condition, requiring at least 1 matching character.Next, let's add a parser that can detect one escape sequence. Actually, we're
going to be even lazier than that; we'll pretend that `\u` is an escape
sequence all by itself, and let the unescape function determine whether the
characters that follow make sense. We could easily do it differently, but
since the unescape code will need to look at those characters in detail later,
we won't waste time doing that work twice.```rust
fn escape_code(input: &str) -> IResult<&str, &str> {
recognize(
pair(
tag("\\"),
alt((
tag("\""),
tag("\\"),
tag("/"),
tag("b"),
tag("f"),
tag("n"),
tag("r"),
tag("t"),
tag("u"),
))
)
)
(input)
}
```Using those two pieces, we can now connect them together to parse the entire
body of a JSON string (minus the quotes that surround it):```rust
use nom::multi::many0;fn string_body(input: &str) -> IResult<&str, &str> {
recognize(
many0(
alt((
nonescaped_string,
escape_code
))
)
)
(input)
}
```We've seen most of the pieces here before.
`many0` tries to apply a parser function repeatedly, gathering all of the
results into a vector. This version gathers "zero or more" of whatever we were
searching for (which is desirable because `""` is a valid JSON string). There
is also a `many1`, (if you want "one or more") and several other variations.The final `recognize` throws away the output of `many0` (a vector), and instead
just returns to us the string that was matched. It's a little unfortunate that
we're throwing away the information we developed about where escapes appear—
perhaps another implementation could do the unescaping work right here. It
seems pretty typical (in my limited experience) to have to make tradeoffs like
this. We're breaking the work into multiple phases, which may require a little
bit of redundant effort, but our code gets a little simpler as a result.There's one subtle thing about these two layers that should be pointed out.
Both `nonescaped_string` and `escape_code` are parsers that return "one or more
characters". And then we use those to build a parser that returns "zero or
more characters". In fact, you can't build a "zero or more" parser using other
"zero or more" components, because that could trigger an infinite loop: the
outer parser could try to gather an infinite number of empty subparser
successes. Typically `nom` combinators will return an error instead of going
into an infinite loop.The next step is pretty simple: the string body must be wrapped in quotes.
```rust
use nom::sequence::delimited;fn json_string(input: &str) -> IResult<&str, &str> {
delimited(
tag("\""),
string_body,
tag("\"")
)
(input)
}
```This is the first time we've used `delimited`. It runs three sub-parsers,
returning the result of the middle one. The result from the first and third
arguments (the quote characters) are discarded.At this point I should plug in some code to do un-escaping. Because this code
doesn't use `nom` and doesn't really help us understand how to write a `nom`
parser, I'm going to skip the explanation and just pull the
[escape8259](https://docs.rs/escape8259/0.5.0/escape8259/) crate that does this
part. A call to un-escape a string is pretty simple:```rust
pub fn unescape(s: &str) -> Result
```So all we need to do is plug that into `json_string`. We earlier used `nom`'s
`map` combinator to do this sort of thing, but here we need something a little
different because `unescape` may fail. We need to use `map_res` to handle
`Result::Err`.```rust
use nom::combinator::map_res;
use escape8259::unescape;fn string_literal(input: &str) -> IResult<&str, String> {
let parser = delimited(
tag("\""),
string_body,
tag("\"")
);
map_res(parser, |s| {
unescape(s)
})
(input)
}
```We also need to update our `Node` enum to include a string variant (we'll call
this `Str`), and make that our final output.```rust
pub enum Node {
Null,
Bool(bool),
Integer(i64),
Float(f64),
Str(String),
}fn json_string(input: &str) -> IResult<&str, Node> {
map(string_literal, |s| {
Node::Str(s)
})
(input)
}
```Finally, we should write some tests to make sure this is working correctly.
```rust
#[test]
fn test_string() {
// Plain Unicode strings with no escaping
assert_eq!(json_string(r#""""#), Ok(("", Node::Str("".into()))));
assert_eq!(json_string(r#""Hello""#), Ok(("", Node::Str("Hello".into()))));
assert_eq!(json_string(r#""の""#), Ok(("", Node::Str("の".into()))));
assert_eq!(json_string(r#""𝄞""#), Ok(("", Node::Str("𝄞".into()))));// valid 2-character escapes
assert_eq!(json_string(r#"" \\ ""#), Ok(("", Node::Str(" \\ ".into()))));
assert_eq!(json_string(r#"" \" ""#), Ok(("", Node::Str(" \" ".into()))));// valid 6-character escapes
assert_eq!(json_string(r#""\u0000""#), Ok(("", Node::Str("\x00".into()))));
assert_eq!(json_string(r#""\u00DF""#), Ok(("", Node::Str("ß".into()))));
assert_eq!(json_string(r#""\uD834\uDD1E""#), Ok(("", Node::Str("𝄞".into()))));// Invalid because surrogate characters must come in pairs
assert!(json_string(r#""\ud800""#).is_err());
// Unknown 2-character escape
assert!(json_string(r#""\x""#).is_err());
// Not enough hex digits
assert!(json_string(r#""\u""#).is_err());
assert!(json_string(r#""\u001""#).is_err());
// Naked control character
assert!(json_string(r#""\x0a""#).is_err());
// Not a JSON string because it's not wrapped in quotes
assert!(json_string("abc").is_err());
}
```## Part 9. Arrays and Objects
Finally, all of the hard parts are complete, and we get to the fun parts:
arrays and objects (maps or dictionaries in other languages).Let's start with the changes to our `Node` enum, to give us a little better
idea how these recursive data structures should work.```rust
pub enum Node {
Null,
Bool(bool),
Integer(i64),
Float(f64),
Str(String),
Array(Vec),
Object(Vec<(String, Node)>),
}
```Since `Node` now includes types other than literal values, let's rename
`json_literal` to `json_value`:```rust
fn json_value(input: &str) -> -> IResult<&str, Node> {
spacey(alt((
json_array,
json_object,
json_string,
json_float,
json_integer,
json_bool,
json_null
)))
(input)
}
```An array can be heterogeneous (different value types, e.g. `[1, "foo", true]`).
Each object member must have a string for its key, and may have any value
type. An object might be `{"a": 1, "b": false}`. Arrays and objects can be
nested arbitrarily.Let's implement arrays first.
```rust
use nom::multi::separated_list0;fn json_array(input: &str) -> IResult<&str, Node> {
let parser = delimited(
tag("["),
separated_list0(tag(","), json_value),
tag("]")
);
map(parser, |v| {
Node::Array(v)
})
(input)
}
```That was surprisingly easy. The only new thing we needed was `separated_list0`,
which alternates between two subparsers. The first argument is the
"separator", and its result is thrown away; we get a vector of results from the
second parser. It will match zero or more elements; `nom` has a
`separated_list1` if you want one-or-more.Objects are up next; they're a little more complicated so let's implement them
as two separate functions.```rust
use nom::sequence::separated_pair;fn object_member(input: &str) -> IResult<&str, (String, Node)> {
separated_pair(string_literal, tag(":"), json_value)
(input)
}fn json_object(input: &str) -> IResult<&str, Node> {
let parser = delimited(
tag("{"),
separated_list0(
tag(","),
object_member
),
tag("}")
);
map(parser, |v| {
Node::Object(v)
})
(input)
}
```This looks a lot like the array implementation. The only difference (other
than the braces) is that where an array looks for a single value, the object
looks for a quoted string literal, then a `:` character, and then a value.And we have a JSON parser!
## Part 10. Spacing out
Well, we almost have a JSON parser. We might start testing arrays like this:
```rust
#[test]
fn test_array() {
assert_eq!(json_array("[]"), Ok(("", Node::Array(vec![]))));
assert_eq!(json_array("[1]"), Ok(("", Node::Array(vec![Node::Integer(1)]))));let expected = Node::Array(vec![Node::Integer(1), Node::Integer(2)]);
assert_eq!(json_array("[1,2]"), Ok(("", expected)));
}
```But it doesn't work if we write:
```rust
assert_eq!(json_array("[1, 2]"), Ok(("", expected)));
```The only difference is the space character after the comma. We forgot to
handle whitespace.In fact, we haven't handled whitespace anywhere. Whitespace could appear
anywhere: before or after values or any punctuation (braces, brackets, comma,
or colon).To ignore whitespace, we need a parser function that matches whitespace. We
could easily build one, but `nom` includes one that matches our needs exactly:
`nom::character::complete::multispace0`.That means we need to do a bunch of substitutions, things like:
```rust
tag("[")
```
need to become
```rust
delimited(multispace0, tag("["), multispace0)
```Which adds a lot of clutter, and is kind of hard to read. Maybe instead we
should write a combinator of our own to make this a little more compact. This
isn't absolutely necessary— the cluttered version is perfectly functional. The
only reason I'm going to tackle this is it provides a little bit of insight
into the pile of generic parameters you see if you look at the documentation
for `nom` combinators. If you don't care, feel free to skip this section.First, let's write a combinator that does nothing, other than apply a parser we
specify.```rust
fn identity(f: F) -> impl FnMut(I) -> IResult
where
F: FnMut(I) -> IResult,
{
f
}
```That looks pretty intimidating. But so do most of the built-in `nom`
combinators, so if we can understand this combinator function, we'll have a
little easier time understanding other parts of `nom`.Let's see if we can make some sense of all those generic parameters.
`F` is the type of the parser we pass in. It could be any `nom`-style parser,
and we already know what those look like; they accept one input parameter, and
return an `IResult`. This `IResult` has three generic parameters, and we've
always used two— the third has a default value, and we've been omitting it.So our `F` is a function that accepts one `I` and returns `IResult`.
`I` is our input parameter (which has been `&str` so far everywhere). `O` is
our output type (and we've used a bunch of different ones; `&str`, `Node`,
etc.) The `E` is the parser error type, and we can continue ignoring that for
now since we've only used the default.Our combinator returns a closure. So its return type is
`FnMut(I) -> IResult`. That looks the same as `F`, but for all cases
other than `identity` we'll return a different closure than the input, so we
will need to spell out the return type.A lot of `nom` combinators have even more complex type signatures
(`separated_pair` has 8 generic parameters!) but picking them apart is usually
pretty straightforward if you're patient. You'll probably only need to know
when something fails to compile.Anyway, let's write a combinator that wraps its input in a `delimited` with
`multispace0` on both sides.```rust
fn spacey(f: F) -> impl FnMut(I) -> IResult
where
F: FnMut(I) -> IResult,
{
delimited(multispace0, f, multispace0)
}
```This explodes with a huge pile of errors; many complaints about trait bounds
that aren't met for `I` and `E`. But it turns out that this is just because
`multispace0` requires those on its `I` and `E`, so we have to guarantee those
trait bounds as well. Copying those trait bounds over to our function will
work:```rust
fn spacey(f: F) -> impl FnMut(I) -> IResult
where
F: FnMut(I) -> IResult,
I: nom::InputTakeAtPosition,
::Item: nom::AsChar + Clone,
E: nom::error::ParseError,
{
delimited(multispace0, f, multispace0)
}
```Was that worth it? Maybe not for this program. But it's interesting to see
what's involved in building our own combinators. Maybe the `nom` function
documentation will look a little less scary, too.Now that we have a useful multispace-handling combinator, we can sprinkle it
around all the places where we need to ignore whitespace. For example:```rust
fn json_array(input: &str) -> IResult<&str, Node> {
let parser = delimited(
spacey(tag("[")),
separated_list0(spacey(tag(",")), json_value),
spacey(tag("]")),
);
map(parser, |v| {
Node::Array(v)
})
(input)
}
```## Part 11. Error handling.
We skipped over a few places where proper error handling is needed. For
example, numbers that are out of bounds (e.g. `1e99999`) should return some
kind of parse error.Currently we are using the `IResult` default error type, which is
`nom::internal::Err<(&str, nom::error::ErrorKind)>`. That doesn't look
promising— we can't realistically expect to be able to extend that type with
our own error variants.So let's build our own error type. We'll use macros from the
[`thiserror`](https://docs.rs/thiserror/1.0/thiserror/) crate to automatically
generate some of the boilerplate that's necessary for error types.```rust
#[derive(thiserror::Error, Debug, PartialEq)]
pub enum JSONParseError {
#[error("bad integer")]
BadInt,
#[error("bad float")]
BadFloat,
#[error("bad escape sequence")]
BadEscape,
#[error("unknown parser error")]
Unparseable,
}
```Because `nom` error handling uses generic parameters, it can be difficult to
see how to best implement a custom error type. There is a good minimal example
of custom error types in the nom 7.0 sources
([examples/custom_error.rs](https://github.com/Geal/nom/blob/7.0.0/examples/custom_error.rs))
that shows the steps needed to make things work gracefully:1. Figure out how to map a `nom` error into your error type. Usually this will
be with a dedicated enum variant.
2. Implement the trait `nom::error::ParseError` for your error type. This
will allow all of the `nom` combinators to generate your custom error type when
needed.
3. Use the 3-argument form of `IResult`, specifying your error type. You will
probably want to do this on most or all of your parser functions so combinators
work gracefully.When building a custom error type that will be generated by nom parsers,
consider how far you want to propagate the error metadata (`ErrorKind` and
input slice). If the error type is only visible internal to a crate, it can
be useful to preserve all the nom metadata (the `input` and `kind` parameters
to `ParseError::from_error_kind`) for debugging. In a public error struct, it
may be wiser to discard that information, as a user of your crate probably
doesn't care about `nom` error metadata. I will assume `JSONParseError` is
public, so I will discard the `nom` error parameters.```rust
use nom::error::{ErrorKind, ParseError};impl ParseError for JSONParseError {
fn from_error_kind(_input: I, _kind: ErrorKind) -> Self {
JSONParseError::Unparseable
}fn append(_: I, _: ErrorKind, other: Self) -> Self {
other
}
}
```For error handling on integers, we'll split the function into two parts to make
it easier to read:```rust
fn integer_body(input: &str) -> IResult<&str, &str, JSONParseError> {
recognize(
pair(
opt(tag("-")),
uint
)
)
(input)
}fn json_integer(input: &str) -> IResult<&str, Node, JSONParseError> {
let (remain, raw_int) = integer_body(input)?;
match raw_int.parse::() {
Ok(i) => Ok((remain, Node::Integer(i))),
Err(_) => Err(nom::Err::Failure(JSONParseError::BadInt)),
}
}
```Note that `json_integer` works differently from all the other parsers we've
written so far: instead of composing parsers using combinators, we actually run
the `integer_body` parser and capture its result (the remainder and the matched
string slice). We then attempt to parse the string slice into an integer, and
hand-assemble an `IResult` by hand.This can be a useful technique when the `nom` combinators don't supply exactly
what you need. Here, I first tried using `map_res` to parse the int, but it
turns out that `map_res` always throws away the error value returned by the
closure, and substitutes its own error (with kind `MapRes`).The same approach works for string escaping errors and float parsing errors,
though float overflow in Rust results in infinity, not an error. This means we
will never actually return `BadFloat` because there are no
grammatically-correct floats that can't be parsed into an `f64`.
(Though Rust versions older than 1.55 had some problems parsing
[certain edge cases](https://github.com/rust-lang/rust/issues/31407).)## Part 12. Finalization.
There's one more `nom`-specific step that we probably want. Assuming our code
is a library, meant to be used by other programs, we don't want `nom::IResult`
to show up as our public result type. We should instead return
`Result`.We can use `all_consuming` to ensure that all input was matched. Unfortunately,
there doesn't seem to be a simple `nom` shortcut for translating the error. We
can do this ourselves:```rust
use nom::combinator::all_consuming;
pub fn parse_json(input: &str) -> Result {
let (_, result) = all_consuming(json_value)(input).map_err(|nom_err| {
match nom_err {
nom::Err::Incomplete(_) => unreachable!(),
nom::Err::Error(e) => e,
nom::Err::Failure(e) => e,
}
})?;
Ok(result)
}
```We haven't talked yet about the three
[`nom::Err`](https://docs.rs/nom/7.0.0/nom/enum.Err.html) variants.- `Incomplete` is only used by `nom` streaming parsers. We don't use those, so
we can just mark that branch `unreachable!` (which would panic).
- `Error` is what we usually see when a parser has a problem. Something didn't
match the expected grammar.
- `Failure` appears less often. It means that the input could only be parsed
one way, but a parser decided that it was invalid. Unlike `Error`, this error
is propagated upward without trying any alternative paths (if something like
`alt` is present).Our code does use `Failure` in a few places: that's what we return when there
is a numeric conversion error or a bad escape code. If we use `Error` instead,
the parsers could return the wrong error type. The reason is that the nom
`alt` parser would keep trying other parsers, and if all of them fail, there's
no way for `alt` to know which error is the right one— it usually just returns
the last error.## Thanks for reading!
This ended up being a lot longer than I originally planned, and along the way I
discovered several things that I'd been doing wrong in my own parsers. There
are probably a few things that I've still missed; if you notice something, feel
free to open an issue at this page's
[GitHub repo](https://github.com/ericseppanen/json-parser-toy), or get in touch
on [twitter: @codeandbitters](https://twitter.com/codeandbitters)