Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dtolnay/clang-ast

Deserialization logic for efficiently processing Clang's `-ast-dump=json` format
https://github.com/dtolnay/clang-ast

Last synced: 6 days ago
JSON representation

Deserialization logic for efficiently processing Clang's `-ast-dump=json` format

Awesome Lists containing this project

README

        

`-ast-dump=json`
================

[github](https://github.com/dtolnay/clang-ast)
[crates.io](https://crates.io/crates/clang-ast)
[docs.rs](https://docs.rs/clang-ast)
[build status](https://github.com/dtolnay/clang-ast/actions?query=branch%3Amaster)

This library provides deserialization logic for efficiently processing Clang's
`-ast-dump=json` format from Rust.

```toml
[dependencies]
clang-ast = "0.1"
```


## Format overview

An AST dump is generated by a compiler command like:


$ clang++ -Xclang -ast-dump=json -fsyntax-only path/to/source.cc

The high-level structure is a tree of nodes, each of which has an `"id"` and a
`"kind"`, zero or more further fields depending on what the node kind is, and
finally an optional `"inner"` array of child nodes.

As an example, for an input file containing just the declaration `class S;`, the
AST would be as follows:

```js
{
"id": "0x1fcea38", //<-- root node
"kind": "TranslationUnitDecl",
"inner": [
{
"id": "0xadf3a8", //<-- first child node
"kind": "CXXRecordDecl",
"loc": {
"offset": 6,
"file": "source.cc",
"line": 1,
"col": 7,
"tokLen": 1
},
"range": {
"begin": {
"offset": 0,
"col": 1,
"tokLen": 5
},
"end": {
"offset": 6,
"col": 7,
"tokLen": 1
}
},
"name": "S",
"tagUsed": "class"
}
]
}
```


## Library design

By design, the clang-ast crate *does not* provide a single great big data
structure that exhaustively covers every possible field of every possible Clang
node type. There are three major reasons:

- **Performance** — these ASTs get quite large. For a reasonable mid-sized
translation unit that includes several platform headers, you can easily get an
AST that is tens to hundreds of megabytes of JSON. To maintain performance of
downstream tooling built on the AST, it's critical that you deserialize only
the few fields which are directly required by your use case, and allow Serde's
deserializer to efficiently ignore all the rest.

- **Stability** — as Clang is developed, the specific fields associated
with each node kind are expected to change over time in non-additive ways.
This is nonproblematic because the churn on the scale of individual nodes is
minimal (maybe one change every several years). However, if there were a data
structure that promised to be able to deserialize every possible piece of
information in every node, practically every change to Clang would be a
breaking change to some node *somewhere* despite your tooling not caring
anything at all about that node kind. By deserializing only those fields which
are directly relevant to your use case, you become insulated from the vast
majority of syntax tree changes.

- **Compile time** — a typical use case involves inspecting only a tiny
fraction of the possible nodes or fields, on the order of 1%. Consequently
your code will compile 100× faster than if you tried to include
everything in the data structure.


## Data structures

The core data structure of the clang-ast crate is `Node`.

```rust
pub struct Node {
pub id: Id,
pub kind: T,
pub inner: Vec>,
}
```

The caller must provide their own kind type `T`, which is an enum or struct as
described below. `T` determines exactly what information the clang-ast crate
will deserialize out of the AST dump.

By convention you should name your `T` type `Clang`.


## T = enum

Most often, you'll want `Clang` to be an enum. In this case your enum must have
one variant per node kind that you care about. The name of each variant matches
the `"kind"` entry seen in the AST.

Additionally there must be a fallback variant, which must be named either
`Unknown` or `Other`, into which clang-ast will put all tree nodes not matching
one of the expected kinds.

```rust
use serde::Deserialize;

pub type Node = clang_ast::Node;

#[derive(Deserialize)]
pub enum Clang {
NamespaceDecl { name: Option },
EnumDecl { name: Option },
EnumConstantDecl { name: String },
Other,
}

fn main() {
let json = std::fs::read_to_string("ast.json").unwrap();
let node: Node = serde_json::from_str(&json).unwrap();

}
```

The above is a simple example with variants for processing `"kind":
"NamespaceDecl"`, `"kind": "EnumDecl"`, and `"kind":
"EnumConstantDecl"` nodes. This is sufficient to extract the set of variants of
every enum in the translation unit, and the enums' namespace (possibly
anonymous) and enum name (possibly anonymous).

Newtype variants are fine too, particularly if you'll be deserializing more than
one field for some nodes.

```rust
use serde::Deserialize;

pub type Node = clang_ast::Node;

#[derive(Deserialize)]
pub enum Clang {
NamespaceDecl(NamespaceDecl),
EnumDecl(EnumDecl),
EnumConstantDecl(EnumConstantDecl),
Other,
}

#[derive(Deserialize, Debug)]
pub struct NamespaceDecl {
pub name: Option,
}

#[derive(Deserialize, Debug)]
pub struct EnumDecl {
pub name: Option,
}

#[derive(Deserialize, Debug)]
pub struct EnumConstantDecl {
pub name: String,
}
```


## T = struct

Rarely, it can make sense to instantiate Node with `Clang` being a struct type,
instead of an enum. This allows for deserializing a uniform group of data out of
*every* node in the syntax tree.

The following example struct collects the `"loc"` and `"range"` of every node if
present; these fields provide the file name / line / column position of nodes.
Not every node kind contains this information, so we use `Option` to collect it
for just the nodes that have it.

```rust
use serde::Deserialize;

pub type Node = clang_ast::Node;

#[derive(Deserialize)]
pub struct Clang {
pub kind: String, // or clang_ast::Kind
pub loc: Option,
pub range: Option,
}
```

If you really need, it's also possible to store *every other piece of key/value
information about every node* via a weakly typed `Map` and the
Serde `flatten` attribute.

```rust
use serde::Deserialize;
use serde_json::{Map, Value};

#[derive(Deserialize)]
pub struct Clang {
pub kind: String, // or clang_ast::Kind
#[serde(flatten)]
pub data: Map,
}
```


## Hybrid approach

To deserialize kind-specific information about a fixed set of node kinds you
care about, as well as some uniform information about every other kind of node,
you can use a hybrid of the two approaches by giving your `Other` / `Unknown`
fallback variant some fields.

```rust
use serde::Deserialize;

pub type Node = clang_ast::Node;

#[derive(Deserialize)]
pub enum Clang {
NamespaceDecl(NamespaceDecl),
EnumDecl(EnumDecl),
Other {
kind: clang_ast::Kind,
},
}
```


## Source locations

Many node kinds expose the source location of the corresponding source code
tokens, which includes:

- the filepath at which they're located;
- the chain of `#include`s by which that file was brought into the translation
unit;
- line/column positions within the source file;
- macro expansion trace for tokens constructed by expansion of a C preprocessor
macro.

You'll find this information in fields called `"loc"` and/or `"range"` in the
JSON representation.

```js
{
"id": "0x1251428",
"kind": "NamespaceDecl",
"loc": { //<--
"offset": 7004,
"file": "/usr/include/x86_64-linux-gnu/c++/10/bits/c++config.h",
"line": 258,
"col": 11,
"tokLen": 3,
"includedFrom": {
"file": "/usr/include/c++/10/utility"
}
},
"range": { //<--
"begin": {
"offset": 6994,
"col": 1,
"tokLen": 9
},
"end": {
"offset": 7155,
"line": 266,
"col": 1,
"tokLen": 1
}
},
...
}
```

The naive deserialization of these structures is challenging to work with
because Clang uses field omission to mean "same as previous". So if a `"loc"` is
printed without a `"file"` inside, it means the loc is in the same file as the
immediately previous loc in serialization order.

The clang-ast crate provides types for deserializing this source location
information painlessly, producing `Arc` as the type of filepaths which may
be shared across multiple source locations.

```rust
use serde::Deserialize;

pub type Node = clang_ast::Node;

#[derive(Deserialize)]
pub enum Clang {
NamespaceDecl(NamespaceDecl),
Other,
}

#[derive(Deserialize, Debug)]
pub struct NamespaceDecl {
pub name: Option,
pub loc: clang_ast::SourceLocation, //<--
pub range: clang_ast::SourceRange, //<--
}
```


## Node identifiers

Every syntax tree node has an `"id"`. In JSON it's the memory address of Clang's
internal memory allocation for that node, serialized to a hex string.

The AST dump uses ids as backreferences in nodes of directed acyclic graph
nature. For example the following MemberExpr node is part of the invocation of
an `operator bool` conversion, and thus its syntax tree refers to the resolved
`operator bool` conversion function declaration:

```js
{
"id": "0x9918b88",
"kind": "MemberExpr",
"valueCategory": "rvalue",
"referencedMemberDecl": "0x12d8330", //<--
...
}
```

The node it references, with memory address 0x12d8330, is found somewhere
earlier in the syntax tree:

```js
{
"id": "0x12d8330", //<--
"kind": "CXXConversionDecl",
"name": "operator bool",
"mangledName": "_ZNKSt17integral_constantIbLb1EEcvbEv",
"type": {
"qualType": "std::integral_constant::value_type () const noexcept"
},
"constexpr": true,
...
}
```

Due to the ubiquitous use of ids for backreferencing, it is valuable to
deserialize them not as strings but as a 64-bit integer. The clang-ast crate
provides an `Id` type for this purpose, which is cheaply copyable, hashable, and
comparible more cheaply than a string. You may find yourself with lots of
hashtables keyed on `Id`.


#### License


Licensed under either of Apache License, Version
2.0
or MIT license at your option.



Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in this crate by you, as defined in the Apache-2.0 license, shall
be dual licensed as above, without any additional terms or conditions.