https://github.com/guiferviz/tuberia

Data engineering meets software engineering
https://github.com/guiferviz/tuberia

data data-engineering expectations pipeline python spark

Last synced: about 2 months ago
JSON representation

Data engineering meets software engineering

Host: GitHub
URL: https://github.com/guiferviz/tuberia
Owner: guiferviz
Created: 2022-06-18T11:44:53.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-01-12T23:31:54.000Z (over 2 years ago)
Last Synced: 2024-04-20T07:02:55.111Z (about 1 year ago)
Topics: data, data-engineering, expectations, pipeline, python, spark
Language: Python
Homepage: https://guiferviz.com/tuberia/
Size: 1.08 MB
Stars: 3
Watchers: 3
Forks: 0
Open Issues: 4
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md

Awesome Lists containing this project

README

        


    

        

    





    

        

    

    

        

    

    

        

    

    

        

    

    

        

    

    

        

    

    


    Data engineering meets software engineering



---

:books: **Documentation**:



    https://aidictive.github.io/tuberia



:keyboard: **Source Code**:



    https://github.com/aidictive/tuberia



---

## 🤔 What is this?

Tuberia is born from the need to bring the worlds of data and software

engineering closer together. Here is a list of common problems in data

projects:

* Loooooong SQL queries impossible to understand/test.

* A lot of duplicate code due to the difficulty of reusing it in SQL queries.

* Lack of tests, sometimes because the used framework does not facilitate

testing tasks.

* Lack of documentation.

* Discrepancies between the existing documentation and the latest deployed code.

* A set of notebooks deployed under the Databricks Share folder.

* A generic notebook with utility functions.

* Use of drag-and-drop frameworks that limit the developer's creativity.

* Months of intense work to migrate existing pipelines from one orchestrator to

another (e.g. from Airflow to Prefect, from Databricks Jobs to Data

Factory...).

Tuberia aims to solve all these problems and many others. 

## 🤓 How it works?

You can view Tuberia as if it were a compiler. Instead of compiling a

programming language, it compiles the steps necessary for your data pipeline to

run successfully.

Tuberia is not an orchestrator, but it allows you to run the code you write in

Python in any existing orchestrator: Airflow, Prefect, Databricks Jobs, Data

Factory....

Tuberia provides some abstraction of where the code is executed, but defines

very well what are the necessary steps to execute it. For example, this shows

how to create a PySpark DataFrame from the `range` function and creates a Delta

table.

```python

import pyspark.sql.functions as F

from tuberia import PySparkTable, run

class Range(PySparkTable):

    """Table with numbers from 1 to `n`.

    Attribute:

        n: Max number in table.

    """

    n: int = 10

    def df(self):

        return self.spark.range(self.n).withColumn("id", F.col(self.schema.id)

class DoubleRange(PySparkTable):

    range: Range = Range()

    def df(self):

        return self.range.read().withColumn("id", F.col("id") * 2)

run(DoubleRange())

```

!!! warning

    Previous code may not work yet and it can change. Please, notice this

    project is in an early stage of its development.

All docstrings included in the code will be used to generate documentation

about your data pipeline. That information, together with the result of data

expectations/data quality rules will help you to always have complete and up to

date documentation.

Besides that, as you have seen, Tuberia is pure Python so doing unit tests/data

tests is very easy. Programming gurus will enjoy data engineering again!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/guiferviz/tuberia

Awesome Lists containing this project

README