https://github.com/lpraat/inbq

A library for parsing BigQuery queries and extracting schema-aware, column-level lineage.
https://github.com/lpraat/inbq
bigquery data-lineage parser sql
Last synced: 3 days ago
JSON representation
A library for parsing BigQuery queries and extracting schema-aware, column-level lineage.
Host: GitHub
URL: https://github.com/lpraat/inbq
Owner: lpraat
License: mit
Created: 2025-02-15T21:20:40.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2026-02-06T23:11:48.000Z (3 months ago)
Last Synced: 2026-02-17T01:54:23.554Z (2 months ago)
Topics: bigquery, data-lineage, parser, sql
Language: Rust
Homepage:
Size: 855 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # inbq

A library for parsing BigQuery queries and extracting schema-aware, column-level lineage.

### Features

- Parse BigQuery queries into well-structured ASTs with [easy-to-navigate nodes](#ast-navigation).

- Extract schema-aware, [column-level lineage](#concepts).

- Trace data flow through nested structs and arrays.

- Capture [referenced columns](#referenced-columns) and the specific query components (e.g., select, where, join) they appear in.

- Process both single and multi-statement queries with procedural language constructs.

- Built for speed and efficiency, with lightweight Python bindings that add minimal overhead.

## Python

### Install

`pip install inbq`

### Example (Pipeline API)

```python

import inbq

catalog = {"schema_objects": []}

def add_table(name: str, columns: list[tuple[str, str]]) -> None:

    catalog["schema_objects"].append({

        "name": name,

        "kind": {

            "table": {

                "columns": [{"name": name, "dtype": dtype} for name, dtype in columns]

            }

        }

    })

add_table("project.dataset.out", [("id", "int64"), ("val", "float64")])

add_table("project.dataset.t1", [("id", "int64"), ("x", "float64")])

add_table("project.dataset.t2", [("id", "int64"), ("s", "struct")])

query = """

declare default_val float64  default (select min(val) from project.dataset.out);

insert into `project.dataset.out`

select

    id,

    if(x is null or s.x is null, default_val, x + s.x)

from `project.dataset.t1` inner join `project.dataset.t2` using (id)

where s.source = "baz";

"""

pipeline = (

    inbq.Pipeline()

    .config(

        # If the `pipeline` is configured with `raise_exception_on_error=False`,

        # any error that occurs during parsing or lineage extraction is

        # captured and returned as a `inbq.PipelineError`

        raise_exception_on_error=False,

        # No effect with only one query (may provide a speedup with multiple queries)

        parallel=True,

    )

    .parse()

    .extract_lineage(catalog=catalog, include_raw=False)

)

sqls = [query]

pipeline_output = inbq.run_pipeline(sqls, pipeline=pipeline)

# This loop will iterate just once as we have only one query

for i, (ast, output_lineage) in enumerate(

    zip(pipeline_output.asts, pipeline_output.lineages)

):

    assert isinstance(ast, inbq.ast_nodes.Ast), (

        f"Could not parse query `{sqls[i][:20]}...` due to: {ast.error}"

    )

    print(f"{ast=}")

    assert isinstance(output_lineage, inbq.lineage.Lineage), (

        f"Could not extract lineage from query `{sqls[i][:20]}...` due to: {output_lineage.error}"

    )

    print("\nLineage:")

    for lin_obj in output_lineage.lineage.objects:

        print("Inputs:")

        for lin_node in lin_obj.nodes:

            print(

                f"{lin_obj.name}->{lin_node.name} <- {[f'{input_node.obj_name}->{input_node.node_name}' for input_node in lin_node.inputs]}"

            )

        print("\nSide inputs:")

        for lin_node in lin_obj.nodes:

            print(

                f"""{lin_obj.name}->{lin_node.name} <- {[f"{input_node.obj_name}->{input_node.node_name} @ {','.join(input_node.sides)}" for input_node in lin_node.side_inputs]}"""

            )

    print("\nReferenced columns:")

    for ref_obj in output_lineage.referenced_columns.objects:

        for ref_node in ref_obj.nodes:

            print(

                f"{ref_obj.name}->{ref_node.name} referenced in {ref_node.referenced_in}"

            )

# Prints:

# ast=Ast(...)

# Lineage:

# Inputs:

# project.dataset.out->id <- ['project.dataset.t2->id', 'project.dataset.t1->id']

# project.dataset.out->val <- ['project.dataset.t2->s.x', 'project.dataset.t1->x', 'project.dataset.out->val']

#

# Side inputs:

# project.dataset.out->id <- ['project.dataset.t2->s.source @ where', 'project.dataset.t2->id @ join', 'project.dataset.t1->id @ join']

# project.dataset.out->val <- ['project.dataset.t2->s.source @ where', 'project.dataset.t2->id @ join', 'project.dataset.t1->id @ join']

#

# Referenced columns:

# project.dataset.out->val referenced in ['default_var', 'select']

# project.dataset.t1->id referenced in ['join', 'select']

# project.dataset.t1->x referenced in ['select']

# project.dataset.t2->id referenced in ['join', 'select']

# project.dataset.t2->s.x referenced in ['select']

# project.dataset.t2->s.source referenced in ['where']

```

**Note:** What happens if you remove the insert and just keep the select in the query? `inbq` is designed to handle this gracefully. It will return the lineage for the last `SELECT` statement, but since the destination is no longer explicit, the output object (an anonymous query) will be assigned an anonymous identifier (e.g., `!anon_4`). Try it yourself and see how the output changes!

To learn more about the output elements (Lineage, Side Inputs, and Referenced Columns), please see the [Concepts](#concepts) section.

### Example (Individual Functions)

If you don't like the Pipeline API, you can use these functions instead:

#### `parse_sql` and `parse_sql_to_dict`

Parse a single SQL query:

```python

ast = inbq.parse_sql(query)

# You can also get a dictionary representation of the AST

ast_dict = inbq.parse_sql_to_dict(query)

```

#### `parse_sqls`

Parse multiple SQL queries in parallel:

```python

sqls = [query]

asts = inbq.parse_sqls(sqls, parallel=True)

```

#### `parse_sqls_and_extract_lineage`

Parse SQLs and extract lineage in one go:

```python

asts, lineages = inbq.parse_sqls_and_extract_lineage(

    sqls=[query],

    catalog=catalog,

    parallel=True

)

```

### AST Navigation

```python

import inbq

import inbq.ast_nodes as ast_nodes

sql = """

UPDATE proj.dataset.t1

SET quantity = quantity - 10,

    supply_constrained = DEFAULT

WHERE product like '%washer%';

UPDATE proj.dataset.t2

SET quantity = quantity - 10,

WHERE product like '%console%';

"""

ast = inbq.parse_sql(sql)

# Example: find updated tables and columns

for node in ast.find_all(

    ast_nodes.UpdateStatement,

):

    match node:

        case ast_nodes.UpdateStatement(

            table=table,

            alias=_,

            update_items=update_items,

            from_=_,

            where=_,

        ):

            print(f"Found updated table: {table.name}. Updated columns:")

            for update_item in update_items:

                for node in update_item.column.find_all(

                    ast_nodes.Identifier,

                    ast_nodes.QuotedIdentifier

                ):

                    match node:

                        case ast_nodes.Identifier(name=name) | ast_nodes.QuotedIdentifier(name=name):

                            print(f"- {name}")

# Example: find `like` filters

for node in ast.find_all(

    ast_nodes.BinaryExpr,

):

    match node:

        case ast_nodes.BinaryExpr(

            left=left,

            operator=ast_nodes.BinaryOperator_Like(),

            right=right,

        ):

            print(left, "like", right)

```

#### Variants and Variant Types in Python

The AST nodes in Python are auto-generated dataclasses from their Rust definitions.

For instance, a Rust enum `Expr` might be defined as:

```rust

pub enum Expr {

    // ... more variants here ...

    Binary(BinaryExpr),

    Identifier(Identifier),

    // ... more variants here ...

}

```

In Python, this translates to corresponding classes like `Expr_Binary(vty=BinaryExpr)`, `Expr_Identifier(vty=Identifier)`, etc.

The `vty` attribute stands for "variant type" (unit variants do not have a `vty` attribute).

You can search for any type of object using `.find_all()`, whether it's the variant (e.g., `Expr_Identifier`) or the concrete variant type (e.g., `Identifier`).

## Rust

### Install

`cargo add inbq`

### Example

```rust

use inbq::{

    lineage::{

        catalog::{Catalog, Column, SchemaObject, SchemaObjectKind},

        extract_lineage,

    },

    parser::Parser,

    scanner::Scanner,

};

fn column(name: &str, dtype: &str) -> Column {

    Column {

        name: name.to_owned(),

        dtype: dtype.to_owned(),

    }

}

fn main() -> anyhow::Result<()> {

    env_logger::init();

    let sql = r#"

        declare default_val float64 default (select min(val) from project.dataset.out);

        insert into `project.dataset.out`

        select

            id,

            if(x is null or s.x is null, default_val, x + s.x)

        from `project.dataset.t1` inner join `project.dataset.t2` using (id)

        where s.source = "baz";

    "#;

    let mut scanner = Scanner::new(sql);

    scanner.scan()?;

    let mut parser = Parser::new(scanner.tokens());

    let ast = parser.parse()?;

    println!("Syntax Tree: {:?}", ast);

    let data_catalog = Catalog {

        schema_objects: vec![

            SchemaObject {

                name: "project.dataset.out".to_owned(),

                kind: SchemaObjectKind::Table {

                    columns: vec![column("id", "int64"), column("val", "int64")],

                },

            },

            SchemaObject {

                name: "project.dataset.t1".to_owned(),

                kind: SchemaObjectKind::Table {

                    columns: vec![column("id", "int64"), column("x", "float64")],

                },

            },

            SchemaObject {

                name: "project.dataset.t2".to_owned(),

                kind: SchemaObjectKind::Table {

                    columns: vec![

                        column("id", "int64"),

                        column("s", "struct"),

                    ],

                },

            },

        ],

    };

    let lineage = extract_lineage(&[&ast], &data_catalog, false, true)

        .pop()

        .unwrap()?;

    println!("\nLineage: {:?}", lineage.lineage);

    println!("\nReferenced columns: {:?}", lineage.referenced_columns);

    Ok(())

}

```

## Command Line Interface

### Install binary

```bash

cargo install inbq

```

### Extract Lineage

1. Prepare your data catalog: create a JSON file (e.g., [catalog.json](./examples/lineage/catalog.json)) that defines the schema for all tables and views referenced in your SQL queries.

2. Run inbq: pass the catalog file and your [SQL file or directory of multiple SQL files](./examples/lineage/query.sql) to the inbq lineage command.

```bash

inbq extract-lineage \

    --pretty \

    --catalog ./examples/lineage/catalog.json  \

    ./examples/lineage/query.sql

```

The output is written to stdout.

## Concepts

### Lineage

Column-level lineage tracks how data flows from a destination column back to its original source columns. A destination column's value is derived from its direct input columns, and this process is applied recursively to trace the lineage back to the foundational source columns. For example, in `with tmp as (select a+b as tmp_c from t) select tmp_c as c from t`, the lineage for column `c` traces back to `a` and `b` as its source columns (the source table is `t`).

### Lineage - Side Inputs

Side inputs are columns that indirectly contribute to the final set of output values. As the name implies, they aren't part of the direct `SELECT` list, but are found in the surrounding clauses that shape the result, such as `WHERE`, `JOIN`, `WINDOW`, etc. Side inputs influence is traced recursively. For example, in the query:

```sql

with cte as (select id, c1 from table1 where f1>10)

select c2 as z

from table2 inner join cte using (id)

```

`table1.f1` is a side input to `z` with sides `join` and `where` (`cte.id`, later used in the join condition, is filtered by `table1.f1`). The other two side inputs are `table1.id` with side `join` and `table2.id` with side `join`.

### Referenced Columns

Referenced columns provide a detailed map of where each input column is mentioned within a query. This is the entry point for a column into the query's logic. From this initial reference, the column can then influence other parts of the query indirectly through subsequent operations.

## Limitations

While this library can parse and extract lineage for most BigQuery syntax, there are some current limitations. For example, the pipe (`|`) syntax and the recently introduced `MATCH_RECOGNIZE` clause are not yet supported. Requests and contributions for unsupported features are welcome.

## Contributing

Here's a brief overview of the project's key modules:

-   `crates/inbq/src/parser.rs`: contains the hand-written top-down parser.

-   `crates/inbq/src/ast.rs`: defines the Abstract Syntax Tree (AST) nodes.

    -   **Note**: If you add or modify AST nodes here, you must regenerate the corresponding Python nodes. You can do this by running `cargo run --bin inbq_genpy`, which will update `crates/py_inbq/python/inbq/ast_nodes.py`.

-   `crates/inbq/src/lineage.rs`: contains the core logic for extracting column-level lineage from the AST.

-   `crates/py_inbq/`: this crate exposes the Rust backend as a Python module via PyO3.

-   `crates/inbq/tests/`: this directory contains the tests. You can add new test cases for parsing and lineage extraction by editing the `.toml` files:

    -   `parsing_tests.toml`

    -   `lineage_tests.toml`
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lpraat/inbq

Awesome Lists containing this project

README