Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gtoonstra/sqlineage

A parser for SQL, which gives back identifiers and a hierarchical model for lineage tracking
https://github.com/gtoonstra/sqlineage

Last synced: 1 day ago
JSON representation

A parser for SQL, which gives back identifiers and a hierarchical model for lineage tracking

Awesome Lists containing this project

README

        

# sqlineage

This is an extension module generated with ragel, which serves to scan SQL files
and extract SELECT, INSERT, CTE operations out of the file. It only extracts the table
names for now, which can be used to establish data lineage models at table level.

It probably only works on Ubuntu right now because of the way the code is compiled.

### Installation

1. Install ragel: `sudo apt-get install ragel` or `brew install ragel` on OSX.
2. Create a virtual environment venv for python3: `python3 -m venv venv`
3. Source the environment: `source venv/bin/activate`
4. Install pip, pip-tools, setuptools and wheel:

```
pip install pip --upgrade
pip install setuptools --upgrade
pip install pip-tools --upgrade
pip install wheel --upgrade
pip install pytest --upgrade
```

5. Execute "compile.sh": `./compile.sh`
6. Compile and install the module in your virtual environment: `pip install . --upgrade`

### Testing

1. Source the environment created before: `source venv/bin/activate`
2. Run pytest: `pytest`

### API and usage

```python
import sqlineage

def callback(parent, table, alias, query_alias, joins, operation, level):
print(parent, table, alias, query_alias, joins, operation, level)

sqlineage.scan('SELECT * FROM foo', callback)
```

The scanner calls the callback for each SQL 'block' that it finds. Subselects are supported.
What it does is help you build a hierarchical model of the SQL using only the table names.
This can be a first step towards building a converter that scans directories of SQL files and
outputs a file that shows how tables depend on one another. This can then be used to feed a
Graph database that records those dependencies. This graph database is then further useful for
querying and creating visualizations of these relationships.

There are thus only five elements you get back:
- parent: The parent of the current block being output
- table: The table name that is selected from or inserted into (not always available)
- alias: The alias to refer to a subselect or CTE
- query_alias: A reference to a query block
- joins: Any joins on the table in this SELECT, returned as a `|,|` string.
- operation: The operation taking place in that SQL block
- level: The hierarchical level where the code was found, useful for subselects.

### Example

A simple SQL file that looks like this:

```sql
INSERT INTO subselects( a )

SELECT
foo.a
FROM (
SELECT DISTINCT
b.a
FROM foo.bar.tablename b WITH (nolock)
WHERE b.a IS NOT NULL
UNION
SELECT DISTINCT
c.a
FROM abc.dbo.xyz c WITH (nolock)

UNION
SELECT DISTINCT
d.a
FROM abc.def.xyz d WITH (nolock)
WHERE DATEPART (weekday, GETDATE ()) = 1) foo;
```

will be output as this:

```
'ROOT', 'ROOT', 'ROOT', 'NONE', 0
'ROOT', 'subselects', 'subselects', '', 'INSERT', 1
'ROOT', '', 'foo', '', 'SELECT', 1
'foo', 'foo.bar.tablename', 'b', '', 'SELECT', 2
'foo', 'abc.dbo.xyz', 'c', '', 'SELECT', 2
'foo', 'abc.def.xyz', 'd', '', 'SELECT', 2
```