Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/canva-public/dbt-column-lineage-extractor
A lightweight Python-based tool for extracting and analyzing data column lineage for dbt projects
https://github.com/canva-public/dbt-column-lineage-extractor
Last synced: 4 days ago
JSON representation
A lightweight Python-based tool for extracting and analyzing data column lineage for dbt projects
- Host: GitHub
- URL: https://github.com/canva-public/dbt-column-lineage-extractor
- Owner: canva-public
- License: mit
- Created: 2024-10-15T02:09:33.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-11-24T22:10:00.000Z (3 months ago)
- Last Synced: 2025-02-01T06:41:18.539Z (11 days ago)
- Language: Python
- Size: 176 KB
- Stars: 128
- Watchers: 1
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
- awesome-dbt - dbt-column-lineage-extractor - Extract column level linage from dbt projects. (Utilities)
README
# DBT Column Lineage Extractor
# DISCLAIMER
**WARNING:** This tool is currently in beta and has only been tested on a limited number of dbt projects using the `snowflake` dialect. It might not perform as expected in every situation. Please report any issues or suggestions in the [Repository](https://github.com/canva-public/dbt-column-lineage-extractor)
## Overview
The DBT Column Lineage Extractor is a lightweight Python-based tool for extracting and analyzing data column lineage for dbt projects. This tool utilizes the [sqlglot](https://github.com/tobymao/sqlglot) library to parse and analyze SQL queries defined in your dbt models and maps their column lineage relationships.
## GitHub Repository
[dbt Column Lineage Extractor](https://github.com/canva-public/dbt-column-lineage-extractor)## Features
- Extract column level lineage for specified model columns, including direct and recursive relationships.
- Output results in a human-readable JSON format, which can be programmatically integrated for use cases such as data impact analysis, data tagging, etc.; or visualized with other tools.## Installation
### pip installation
```
pip install dbt-column-lineage-extractor==0.1.4b1
```## Required Input Files
To run the DBT Column Lineage Extractor, you need the following files:
- **`catalog.json`**: Provides the schema of the models, including names and types of the columns.
- **`manifest.json`**: Offers model-level lineage information.These files are generated by executing the command:
```bash
dbt docs generate
```### Important Notes
- The `dbt docs generate` command does not parse your SQL syntax. Instead, it connects to the data warehouse to retrieve schema information.
- Ensure that the relevant models are materialized in your dbt project as either tables or views for accurate schema information.
- If the models aren't materialized in your development environment, you might use the `--target` flag to specify an alternative target environment with all models materialized (e.g., `--target prod`), given you have access to it.
- After modifying the schemas, update the materialized models in your warehouse before running the `dbt docs generate` command.## Example Usage and Customization
The DBT Column Lineage Extractor can be used in two ways: via the command line interface or by integrating the Python scripts into your codebase.
```bash
cd examples
```### Option 1 - Command Line Interface
First, generate column lineage relationships to model's direct parents and children using the `dbt_column_lineage_direct` command, e.g.:
```bash
dbt_column_lineage_direct --manifest ./inputs/manifest.json --catalog ./inputs/catalog.json
```
Then analyze recursive column lineage relationships for a specific model and column using the `dbt_column_lineage_recursive` command, e.g.:
```bash
dbt_column_lineage_recursive --model model.jaffle_shop.stg_orders --column order_id
```See more usage guides using `dbt_column_lineage_direct -h` and `dbt_column_lineage_recursive -h`.
### Option 2 - Python Scripts
See the [readme file](./examples/readme.md) in the `examples` directory for more detailed instructions on how to integrate the DBT Column Lineage Extractor into your python scripts.##### Example Outputs
- `seed.jaffle_shop.raw_orders -- id`
- Structured Ancestors:
```json
{}
```
- Structured Descendants:
```json
{
"model.jaffle_shop.stg_orders": {
"order_id": {
"+": {
"model.jaffle_shop.customers": {
"number_of_orders": {
"+": {}
}
},
"model.jaffle_shop.orders": {
"order_id": {
"+": {}
}
}
}
}
}
}
```- `model.jaffle_shop.stg_orders -- order_id`
- Structured Ancestors:
```json
{
"seed.jaffle_shop.raw_orders": {
"id": {
"+": {}
}
}
}
```
- Structured Descendants:
```json
{
"model.jaffle_shop.customers": {
"number_of_orders": {
"+": {}
}
},
"model.jaffle_shop.orders": {
"order_id": {
"+": {}
}
}
}
```## Example Visualization
The structured JSON outputs can be used programmatically, or loaded into visualization tools like [jsoncrack.com](https://jsoncrack.com/editor) to visualize the column lineage relationships and dependencies.
![visualize](images/visualize.png)## Limitations
- Doesn’t support parse certain syntax, e.g. lateral flatten
- Doesn’t support dbt python models
- Only tested with `snowflake` dialect so far