Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/michaelosthege/gittrail
Context manager for enforcing links between data pipeline outputs and git history.
https://github.com/michaelosthege/gittrail
data-lineage data-science
Last synced: about 2 months ago
JSON representation
Context manager for enforcing links between data pipeline outputs and git history.
- Host: GitHub
- URL: https://github.com/michaelosthege/gittrail
- Owner: michaelosthege
- License: agpl-3.0
- Created: 2021-11-30T11:13:36.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2022-02-22T16:46:24.000Z (almost 3 years ago)
- Last Synced: 2024-10-11T22:11:21.878Z (2 months ago)
- Topics: data-lineage, data-science
- Language: Python
- Homepage:
- Size: 41 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[![PyPI version](https://img.shields.io/pypi/v/gittrail)](https://pypi.org/project/gittrail)
[![pipeline](https://github.com/michaelosthege/gittrail/workflows/pipeline/badge.svg)](https://github.com/michaelosthege/gittrail/actions)
[![coverage](https://codecov.io/gh/michaelosthege/gittrail/branch/main/graph/badge.svg)](https://codecov.io/gh/michaelosthege/gittrail)# `gittrail` - Linking data pipeline outputs to git history
Versioning of code with git is easy, versioning data pipeline inputs/outputs is hard.``GitTrail`` helps you to maintain a traceable data lineage by enforcing a
link between data files and the commit history of your processing code.Like blockchain, but easier.
## How it works
``GitTrail`` is used as a context manager around the code that executes your data processing:```python
with GitTrail(
repo="/path/to/my_data_processing_code",
data="/path/to/my_data_storage",
):
# TODO: download the pipeline inputs to [data]
```Inbetween GitTrail sessions you may edit your pipeline code, make commits etc.
When your next data processing stage is ready:
```python
with GitTrail(
repo="/path/to/my_data_processing_code",
data="/path/to/my_data_storage",
):
# TODO: run data analysis on inputs from [data]
# TODO: save results to [data]
```Upon entering the context ``GitTrail`` attaches a log handler to re-route all logging into a `*.log` file in a subdirectory of [data].
When the context exits, the logger is detached and session metadata is stored in a `*.json` file.
The metadata includes the current git commit of your [repo], as well MD5 hashes of the files inside [data].Within the context, the following two rules are enforced:
1. The working tree of your code [repo] must be clean (no uncommitted changes).
2. All files currently found in [data] must have been created/changed in a previous ``GitTrail`` context.Taken together this means that:
* You're not allowed to add/edit/anything in [data] by hand.
* Your data processing code may continue to evolve as you're moving forward through your pipeline.
* You can amend/rewind/rewrite git commits of your processing code, but the corresponding files in [data] and the audit trail session file must be deleted.
* All files in the [data] are linked to the processing code that produced them.## Limitations
``GitTrail`` can't police everything, so keep the following in mind:
- Data outside of [data], for example a database, is not tracked.
If you're reading/writing data outside of [data] think about how you can trace that in your git history and/or [data] audit trail.
- Code outside of [repo] is not tracked.
Unless your [repo] specifies exact dependency versions, your code may not be 100 % reproducible.
- Audit trail files are not cryptographically signed, so if you mess with them that's not tracked.