Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/guiferviz/tuberia
Data engineering meets software engineering
https://github.com/guiferviz/tuberia
data data-engineering expectations pipeline python spark
Last synced: about 2 months ago
JSON representation
Data engineering meets software engineering
- Host: GitHub
- URL: https://github.com/guiferviz/tuberia
- Owner: guiferviz
- Created: 2022-06-18T11:44:53.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-01-12T23:31:54.000Z (about 2 years ago)
- Last Synced: 2024-04-20T07:02:55.111Z (10 months ago)
- Topics: data, data-engineering, expectations, pipeline, python, spark
- Language: Python
- Homepage: https://guiferviz.com/tuberia/
- Size: 1.08 MB
- Stars: 3
- Watchers: 3
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
Awesome Lists containing this project
README
Data engineering meets software engineering---
:books: **Documentation**:
https://aidictive.github.io/tuberia:keyboard: **Source Code**:
https://github.com/aidictive/tuberia---
## 🤔 What is this?
Tuberia is born from the need to bring the worlds of data and software
engineering closer together. Here is a list of common problems in data
projects:* Loooooong SQL queries impossible to understand/test.
* A lot of duplicate code due to the difficulty of reusing it in SQL queries.
* Lack of tests, sometimes because the used framework does not facilitate
testing tasks.
* Lack of documentation.
* Discrepancies between the existing documentation and the latest deployed code.
* A set of notebooks deployed under the Databricks Share folder.
* A generic notebook with utility functions.
* Use of drag-and-drop frameworks that limit the developer's creativity.
* Months of intense work to migrate existing pipelines from one orchestrator to
another (e.g. from Airflow to Prefect, from Databricks Jobs to Data
Factory...).Tuberia aims to solve all these problems and many others.
## 🤓 How it works?
You can view Tuberia as if it were a compiler. Instead of compiling a
programming language, it compiles the steps necessary for your data pipeline to
run successfully.Tuberia is not an orchestrator, but it allows you to run the code you write in
Python in any existing orchestrator: Airflow, Prefect, Databricks Jobs, Data
Factory....Tuberia provides some abstraction of where the code is executed, but defines
very well what are the necessary steps to execute it. For example, this shows
how to create a PySpark DataFrame from the `range` function and creates a Delta
table.```python
import pyspark.sql.functions as Ffrom tuberia import PySparkTable, run
class Range(PySparkTable):
"""Table with numbers from 1 to `n`.Attribute:
n: Max number in table."""
n: int = 10def df(self):
return self.spark.range(self.n).withColumn("id", F.col(self.schema.id)class DoubleRange(PySparkTable):
range: Range = Range()def df(self):
return self.range.read().withColumn("id", F.col("id") * 2)run(DoubleRange())
```!!! warning
Previous code may not work yet and it can change. Please, notice this
project is in an early stage of its development.All docstrings included in the code will be used to generate documentation
about your data pipeline. That information, together with the result of data
expectations/data quality rules will help you to always have complete and up to
date documentation.Besides that, as you have seen, Tuberia is pure Python so doing unit tests/data
tests is very easy. Programming gurus will enjoy data engineering again!