An open API service indexing awesome lists of open source software.

https://github.com/firelink-sh/evolve-py

A highly efficient, composable, and lightweight ETL and data integration framework.
https://github.com/firelink-sh/evolve-py

analytics arrow big-data data data-engineering data-integration data-science duckdb elt etl ingestion ingress ml olap pipeline polars postgresql python s3

Last synced: 3 months ago
JSON representation

A highly efficient, composable, and lightweight ETL and data integration framework.

Awesome Lists containing this project

README

          

evolve logo


A highly efficient, composable, and lightweight ETL and data integration framework.


[![CI](https://github.com/firelink-sh/evolve-py/actions/workflows/ci.yml/badge.svg)](https://github.com/firelink-sh/evolve-py/actions/workflows/ci.yml)
[![Tests](https://github.com/firelink-sh/evolve-py/actions/workflows/tests.yml/badge.svg)](https://github.com/firelink-sh/evolve-py/actions/workflows/tests.yml)
[![codecov](https://codecov.io/gh/firelink-sh/evolve-py/graph/badge.svg?token=OTFIM6UICZ)](https://codecov.io/gh/firelink-sh/evolve-py)


> evolve is currently in early development and consistently undergoes breaking changes
> to the core api and functionality. Expect a more stable version to be released in a couple
> of weeks.

evolve is an **open-source** and **platform agnostic** Python framework that enables your data teams to **efficiently integrate data** from a wide variety of **structured** or **unstructured** data sources into your **database**, **data warehouse**, or **data lake(house)** — **blazingly fast** with **minimal memory overhead** thanks to the Apache Arrow ecosystem.

It is **built for developers** with a **code-first** mindset. You will not find any low-code, clickops, or drag-and-drop shenanigans here.
evolve offers you full control of how your data is read, parsed, handled in-memory, transformed, and finally written to any destination you need.

- **Composable** - Design your own data pipelines to fit into your own stack, and add any extra (possibly proprietary) sources or targets that you might possibly need, all possible through evolve's intuitive and lightweight framework philosophy.
- **Blazing fast** - Zero-copy principles by leveraging Apache Arrow gives you extremely rapid in-memory operations perfect for OLAP and easy interoperability with DuckDB, Polars, Spark, DataFusion and many more query engines.
- **Customizable** - You choose the backend that you want to use. Do you prefer DataFrames? Use Polars! Or perhaps you prefer to work on data using SQL? Then use the DuckDB backend! It is completely up to you.
- **Platform agnostic** - Run your ETL/ELT using evolve on your own infrastructure, no vendor lock-in, never.

## Architecture (alpha version)

```mermaid
flowchart TD
%% Sources (Connectors)
subgraph Sources
CSV[Local CSV Source]
JSON[HDFS JSON Source]
Parquet[S3 Parquet Source]
SQL[SQL Source]
Custom[Custom Source]
end

%% Intermediate Representation
subgraph Backend
Arrow[Apache Arrow / Polars / DuckDB / Custom]
end

%% Targets (Connectors)
subgraph Targets
S3[S3 object store]
Local[Local file system]
HDFS[Hadoop file system]
DW[Data Warehouse]
ML[ML Pipeline]
CustomOut[Custom Format]
end

%% Mapping logic
CSV -->|Map to Arrow| Arrow
JSON -->|Map to Arrow| Arrow
SQL -->|Map to Arrow| Arrow
Custom -->|Conditional Mapping| Arrow
Parquet -->|Direct Mapping| S3

Arrow --> S3
Arrow --> Local
Arrow --> HDFS
Arrow --> DW
Arrow --> ML
Arrow --> Viz
Arrow --> CustomOut
```

## Example usage

```python
import evolve as ev

# Pipelines are lazy - only run when told to
pipeline = ev.Pipeline("parquet-ingestion") \
.with_source(ev.io.FixedWidthFile(...)) \
.with_target(ev.io.ParquetFile(...)) \
.with_transform(DropNulls(columns=(..., )))

pipeline.run() # runs the ETL
```

You can configure it with yaml or json!

```yml
source:
type: postgres
host: localhost
db: prod
user: admin
password: secret
schema: sales
tables: orders

transforms:
- type: drop_nulls
columns: ["order_id", "amount"]
- type: rename_columns
mapping:
order_id: id
amount: total
- type: filter_rows
condition: "total > 100"

target:
type: parquet
path: s3://prod/sales/orders.parquet
```

## License

evolve is distributed under the terms of both the MIT License and the Apache License (version 2.0).

See LICENSE-APACHE and LICENSE-MIT for details.