{"id":28474061,"url":"https://github.com/lpraat/inbq","last_synced_at":"2026-04-26T08:39:04.379Z","repository":{"id":295254353,"uuid":"933398464","full_name":"lpraat/inbq","owner":"lpraat","description":"A library for parsing BigQuery queries and extracting schema-aware, column-level lineage.","archived":false,"fork":false,"pushed_at":"2026-02-06T23:11:48.000Z","size":876,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-02-17T01:54:23.554Z","etag":null,"topics":["bigquery","data-lineage","parser","sql"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lpraat.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-02-15T21:20:40.000Z","updated_at":"2026-02-08T15:46:14.000Z","dependencies_parsed_at":"2025-12-16T05:00:14.085Z","dependency_job_id":null,"html_url":"https://github.com/lpraat/inbq","commit_stats":null,"previous_names":["lpraat/inbq"],"tags_count":19,"template":false,"template_full_name":null,"purl":"pkg:github/lpraat/inbq","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpraat%2Finbq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpraat%2Finbq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpraat%2Finbq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpraat%2Finbq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lpraat","download_url":"https://codeload.github.com/lpraat/inbq/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpraat%2Finbq/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32291336,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-26T08:29:33.829Z","status":"ssl_error","status_checked_at":"2026-04-26T08:29:18.366Z","response_time":129,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","data-lineage","parser","sql"],"created_at":"2025-06-07T13:00:44.461Z","updated_at":"2026-04-26T08:39:04.374Z","avatar_url":"https://github.com/lpraat.png","language":"Rust","readme":"# inbq\nA library for parsing BigQuery queries and extracting schema-aware, column-level lineage.\n\n### Features\n- Parse BigQuery queries into well-structured ASTs with [easy-to-navigate nodes](#ast-navigation).\n- Extract schema-aware, [column-level lineage](#concepts).\n- Trace data flow through nested structs and arrays.\n- Capture [referenced columns](#referenced-columns) and the specific query components (e.g., select, where, join) they appear in.\n- Process both single and multi-statement queries with procedural language constructs.\n- Built for speed and efficiency, with lightweight Python bindings that add minimal overhead.\n\n\n## Python\n### Install\n`pip install inbq`\n\n### Example (Pipeline API)\n```python\n\nimport inbq\n\ncatalog = {\"schema_objects\": []}\n\ndef add_table(name: str, columns: list[tuple[str, str]]) -\u003e None:\n    catalog[\"schema_objects\"].append({\n        \"name\": name,\n        \"kind\": {\n            \"table\": {\n                \"columns\": [{\"name\": name, \"dtype\": dtype} for name, dtype in columns]\n            }\n        }\n    })\n\nadd_table(\"project.dataset.out\", [(\"id\", \"int64\"), (\"val\", \"float64\")])\nadd_table(\"project.dataset.t1\", [(\"id\", \"int64\"), (\"x\", \"float64\")])\nadd_table(\"project.dataset.t2\", [(\"id\", \"int64\"), (\"s\", \"struct\u003csource string, x float64\u003e\")])\n\nquery = \"\"\"\ndeclare default_val float64  default (select min(val) from project.dataset.out);\n\ninsert into `project.dataset.out`\nselect\n    id,\n    if(x is null or s.x is null, default_val, x + s.x)\nfrom `project.dataset.t1` inner join `project.dataset.t2` using (id)\nwhere s.source = \"baz\";\n\"\"\"\n\npipeline = (\n    inbq.Pipeline()\n    .config(\n        # If the `pipeline` is configured with `raise_exception_on_error=False`,\n        # any error that occurs during parsing or lineage extraction is\n        # captured and returned as a `inbq.PipelineError`\n        raise_exception_on_error=False,\n        # No effect with only one query (may provide a speedup with multiple queries)\n        parallel=True,\n    )\n    .parse()\n    .extract_lineage(catalog=catalog, include_raw=False)\n)\nsqls = [query]\npipeline_output = inbq.run_pipeline(sqls, pipeline=pipeline)\n\n# This loop will iterate just once as we have only one query\nfor i, (ast, output_lineage) in enumerate(\n    zip(pipeline_output.asts, pipeline_output.lineages)\n):\n    assert isinstance(ast, inbq.ast_nodes.Ast), (\n        f\"Could not parse query `{sqls[i][:20]}...` due to: {ast.error}\"\n    )\n\n    print(f\"{ast=}\")\n\n    assert isinstance(output_lineage, inbq.lineage.Lineage), (\n        f\"Could not extract lineage from query `{sqls[i][:20]}...` due to: {output_lineage.error}\"\n    )\n\n    print(\"\\nLineage:\")\n    for lin_obj in output_lineage.lineage.objects:\n        print(\"Inputs:\")\n        for lin_node in lin_obj.nodes:\n            print(\n                f\"{lin_obj.name}-\u003e{lin_node.name} \u003c- {[f'{input_node.obj_name}-\u003e{input_node.node_name}' for input_node in lin_node.inputs]}\"\n            )\n\n        print(\"\\nSide inputs:\")\n        for lin_node in lin_obj.nodes:\n            print(\n                f\"\"\"{lin_obj.name}-\u003e{lin_node.name} \u003c- {[f\"{input_node.obj_name}-\u003e{input_node.node_name} @ {','.join(input_node.sides)}\" for input_node in lin_node.side_inputs]}\"\"\"\n            )\n\n    print(\"\\nReferenced columns:\")\n    for ref_obj in output_lineage.referenced_columns.objects:\n        for ref_node in ref_obj.nodes:\n            print(\n                f\"{ref_obj.name}-\u003e{ref_node.name} referenced in {ref_node.referenced_in}\"\n            )\n\n# Prints:\n# ast=Ast(...)\n\n# Lineage:\n# Inputs:\n# project.dataset.out-\u003eid \u003c- ['project.dataset.t2-\u003eid', 'project.dataset.t1-\u003eid']\n# project.dataset.out-\u003eval \u003c- ['project.dataset.t2-\u003es.x', 'project.dataset.t1-\u003ex', 'project.dataset.out-\u003eval']\n#\n# Side inputs:\n# project.dataset.out-\u003eid \u003c- ['project.dataset.t2-\u003es.source @ where', 'project.dataset.t2-\u003eid @ join', 'project.dataset.t1-\u003eid @ join']\n# project.dataset.out-\u003eval \u003c- ['project.dataset.t2-\u003es.source @ where', 'project.dataset.t2-\u003eid @ join', 'project.dataset.t1-\u003eid @ join']\n#\n# Referenced columns:\n# project.dataset.out-\u003eval referenced in ['default_var', 'select']\n# project.dataset.t1-\u003eid referenced in ['join', 'select']\n# project.dataset.t1-\u003ex referenced in ['select']\n# project.dataset.t2-\u003eid referenced in ['join', 'select']\n# project.dataset.t2-\u003es.x referenced in ['select']\n# project.dataset.t2-\u003es.source referenced in ['where']\n```\n\n**Note:** What happens if you remove the insert and just keep the select in the query? `inbq` is designed to handle this gracefully. It will return the lineage for the last `SELECT` statement, but since the destination is no longer explicit, the output object (an anonymous query) will be assigned an anonymous identifier (e.g., `!anon_4`). Try it yourself and see how the output changes!\n\nTo learn more about the output elements (Lineage, Side Inputs, and Referenced Columns), please see the [Concepts](#concepts) section.\n\n### Example (Individual Functions)\nIf you don't like the Pipeline API, you can use these functions instead:\n\n#### `parse_sql` and `parse_sql_to_dict`\nParse a single SQL query:\n```python\nast = inbq.parse_sql(query)\n\n# You can also get a dictionary representation of the AST\nast_dict = inbq.parse_sql_to_dict(query)\n```\n\n#### `parse_sqls`\nParse multiple SQL queries in parallel:\n\n```python\nsqls = [query]\nasts = inbq.parse_sqls(sqls, parallel=True)\n```\n\n#### `parse_sqls_and_extract_lineage`\nParse SQLs and extract lineage in one go:\n\n```python\nasts, lineages = inbq.parse_sqls_and_extract_lineage(\n    sqls=[query],\n    catalog=catalog,\n    parallel=True\n)\n```\n\n### AST Navigation\n```python\nimport inbq\nimport inbq.ast_nodes as ast_nodes\n\nsql = \"\"\"\nUPDATE proj.dataset.t1\nSET quantity = quantity - 10,\n    supply_constrained = DEFAULT\nWHERE product like '%washer%';\n\nUPDATE proj.dataset.t2\nSET quantity = quantity - 10,\nWHERE product like '%console%';\n\"\"\"\n\nast = inbq.parse_sql(sql)\n\n# Example: find updated tables and columns\nfor node in ast.find_all(\n    ast_nodes.UpdateStatement,\n):\n    match node:\n        case ast_nodes.UpdateStatement(\n            table=table,\n            alias=_,\n            update_items=update_items,\n            from_=_,\n            where=_,\n        ):\n            print(f\"Found updated table: {table.name}. Updated columns:\")\n            for update_item in update_items:\n                for node in update_item.column.find_all(\n                    ast_nodes.Identifier,\n                    ast_nodes.QuotedIdentifier\n                ):\n                    match node:\n                        case ast_nodes.Identifier(name=name) | ast_nodes.QuotedIdentifier(name=name):\n                            print(f\"- {name}\")\n\n# Example: find `like` filters\nfor node in ast.find_all(\n    ast_nodes.BinaryExpr,\n):\n    match node:\n        case ast_nodes.BinaryExpr(\n            left=left,\n            operator=ast_nodes.BinaryOperator_Like(),\n            right=right,\n        ):\n            print(left, \"like\", right)\n```\n\n#### Variants and Variant Types in Python\nThe AST nodes in Python are auto-generated dataclasses from their Rust definitions.\nFor instance, a Rust enum `Expr` might be defined as:\n\n```rust\npub enum Expr {\n    // ... more variants here ...\n    Binary(BinaryExpr),\n    Identifier(Identifier),\n    // ... more variants here ...\n}\n```\n\nIn Python, this translates to corresponding classes like `Expr_Binary(vty=BinaryExpr)`, `Expr_Identifier(vty=Identifier)`, etc.\nThe `vty` attribute stands for \"variant type\" (unit variants do not have a `vty` attribute).\nYou can search for any type of object using `.find_all()`, whether it's the variant (e.g., `Expr_Identifier`) or the concrete variant type (e.g., `Identifier`).\n\n## Rust\n### Install\n`cargo add inbq`\n\n### Example\n```rust\nuse inbq::{\n    lineage::{\n        catalog::{Catalog, Column, SchemaObject, SchemaObjectKind},\n        extract_lineage,\n    },\n    parser::Parser,\n    scanner::Scanner,\n};\n\nfn column(name: \u0026str, dtype: \u0026str) -\u003e Column {\n    Column {\n        name: name.to_owned(),\n        dtype: dtype.to_owned(),\n    }\n}\n\nfn main() -\u003e anyhow::Result\u003c()\u003e {\n    env_logger::init();\n\n    let sql = r#\"\n        declare default_val float64 default (select min(val) from project.dataset.out);\n\n        insert into `project.dataset.out`\n        select\n            id,\n            if(x is null or s.x is null, default_val, x + s.x)\n        from `project.dataset.t1` inner join `project.dataset.t2` using (id)\n        where s.source = \"baz\";\n    \"#;\n    let mut scanner = Scanner::new(sql);\n    scanner.scan()?;\n    let mut parser = Parser::new(scanner.tokens());\n    let ast = parser.parse()?;\n    println!(\"Syntax Tree: {:?}\", ast);\n\n    let data_catalog = Catalog {\n        schema_objects: vec![\n            SchemaObject {\n                name: \"project.dataset.out\".to_owned(),\n                kind: SchemaObjectKind::Table {\n                    columns: vec![column(\"id\", \"int64\"), column(\"val\", \"int64\")],\n                },\n            },\n            SchemaObject {\n                name: \"project.dataset.t1\".to_owned(),\n                kind: SchemaObjectKind::Table {\n                    columns: vec![column(\"id\", \"int64\"), column(\"x\", \"float64\")],\n                },\n            },\n            SchemaObject {\n                name: \"project.dataset.t2\".to_owned(),\n                kind: SchemaObjectKind::Table {\n                    columns: vec![\n                        column(\"id\", \"int64\"),\n                        column(\"s\", \"struct\u003csource string, x float64\u003e\"),\n                    ],\n                },\n            },\n        ],\n    };\n\n    let lineage = extract_lineage(\u0026[\u0026ast], \u0026data_catalog, false, true)\n        .pop()\n        .unwrap()?;\n\n    println!(\"\\nLineage: {:?}\", lineage.lineage);\n    println!(\"\\nReferenced columns: {:?}\", lineage.referenced_columns);\n    Ok(())\n}\n```\n\n\n## Command Line Interface\n### Install binary\n```bash\ncargo install inbq\n```\n\n### Extract Lineage\n1. Prepare your data catalog: create a JSON file (e.g., [catalog.json](./examples/lineage/catalog.json)) that defines the schema for all tables and views referenced in your SQL queries.\n\n2. Run inbq: pass the catalog file and your [SQL file or directory of multiple SQL files](./examples/lineage/query.sql) to the inbq lineage command.\n```bash\ninbq extract-lineage \\\n    --pretty \\\n    --catalog ./examples/lineage/catalog.json  \\\n    ./examples/lineage/query.sql\n```\n\nThe output is written to stdout.\n\n## Concepts\n\n### Lineage\nColumn-level lineage tracks how data flows from a destination column back to its original source columns. A destination column's value is derived from its direct input columns, and this process is applied recursively to trace the lineage back to the foundational source columns. For example, in `with tmp as (select a+b as tmp_c from t) select tmp_c as c from t`, the lineage for column `c` traces back to `a` and `b` as its source columns (the source table is `t`).\n\n### Lineage - Side Inputs\nSide inputs are columns that indirectly contribute to the final set of output values. As the name implies, they aren't part of the direct `SELECT` list, but are found in the surrounding clauses that shape the result, such as `WHERE`, `JOIN`, `WINDOW`, etc. Side inputs influence is traced recursively. For example, in the query:\n```sql\nwith cte as (select id, c1 from table1 where f1\u003e10)\nselect c2 as z\nfrom table2 inner join cte using (id)\n```\n`table1.f1` is a side input to `z` with sides `join` and `where` (`cte.id`, later used in the join condition, is filtered by `table1.f1`). The other two side inputs are `table1.id` with side `join` and `table2.id` with side `join`.\n\n### Referenced Columns\nReferenced columns provide a detailed map of where each input column is mentioned within a query. This is the entry point for a column into the query's logic. From this initial reference, the column can then influence other parts of the query indirectly through subsequent operations.\n\n## Limitations\nWhile this library can parse and extract lineage for most BigQuery syntax, there are some current limitations. For example, the pipe (`|`) syntax and the recently introduced `MATCH_RECOGNIZE` clause are not yet supported. Requests and contributions for unsupported features are welcome.\n\n## Contributing\nHere's a brief overview of the project's key modules:\n-   `crates/inbq/src/parser.rs`: contains the hand-written top-down parser.\n-   `crates/inbq/src/ast.rs`: defines the Abstract Syntax Tree (AST) nodes.\n    -   **Note**: If you add or modify AST nodes here, you must regenerate the corresponding Python nodes. You can do this by running `cargo run --bin inbq_genpy`, which will update `crates/py_inbq/python/inbq/ast_nodes.py`.\n-   `crates/inbq/src/lineage.rs`: contains the core logic for extracting column-level lineage from the AST.\n-   `crates/py_inbq/`: this crate exposes the Rust backend as a Python module via PyO3.\n-   `crates/inbq/tests/`: this directory contains the tests. You can add new test cases for parsing and lineage extraction by editing the `.toml` files:\n    -   `parsing_tests.toml`\n    -   `lineage_tests.toml`\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flpraat%2Finbq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flpraat%2Finbq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flpraat%2Finbq/lists"}