https://github.com/firelink-sh/evolve-py

A highly efficient, composable, and lightweight ETL and data integration framework
https://github.com/firelink-sh/evolve-py

analytics arrow big-data data data-engineering data-integration data-science duckdb elt etl ingestion ingress ml olap pipeline polars postgresql python s3

Last synced: 20 days ago
JSON representation

A highly efficient, composable, and lightweight ETL and data integration framework

Host: GitHub
URL: https://github.com/firelink-sh/evolve-py
Owner: firelink-sh
License: apache-2.0
Created: 2025-07-18T19:06:14.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-09-14T22:03:43.000Z (22 days ago)
Last Synced: 2025-09-14T22:16:35.808Z (22 days ago)
Topics: analytics, arrow, big-data, data, data-engineering, data-integration, data-science, duckdb, elt, etl, ingestion, ingress, ml, olap, pipeline, polars, postgresql, python, s3
Language: Python
Homepage:
Size: 2.22 MB
Stars: 5
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE-APACHE

Awesome Lists containing this project

README

A highly efficient, composable, and lightweight ETL and data integration framework

[![CI](https://github.com/firelink-sh/evolve-py/actions/workflows/ci.yml/badge.svg)](https://github.com/firelink-sh/evolve-py/actions/workflows/ci.yml)
[![Tests](https://github.com/firelink-sh/evolve-py/actions/workflows/tests.yml/badge.svg)](https://github.com/firelink-sh/evolve-py/actions/workflows/tests.yml)
[![codecov](https://codecov.io/gh/firelink-sh/evolve-py/graph/badge.svg?token=OTFIM6UICZ)](https://codecov.io/gh/firelink-sh/evolve-py)

evolve is an **open-source** and **platform agnostic** Python framework that enables your data teams to **efficiently integrate data** from a wide variety of **structured** or **unstructured** data sources into your **database**, **data warehouse**, or **data lake(house)** — **blazingly fast** with **minimal memory overhead** thanks to Apache Arrow.

It is **built for developers** with a **code-first** mindset. You will not find any low-code, clickops, or drag-and-drop shenanigans here.
evolve offers you full control of how your data is read, parsed, handled in-memory, transformed, and finally written to any destination you need.

- **Composable** - Design your own data pipelines to fit into your own stack, and add any extra (possibly proprietary) sources or targets that you might possibly need, all possible through evolve's intuitive and lightweight framework philosophy.
- **Blazing fast** - Zero-copy principles by leveraging Apache Arrow gives you extremely rapid in-memory operations perfect for OLAP and easy interoperability with DuckDB, Polars, Spark, DataFusion and many more query engines.
- **Customizable** - You choose the backend that you want to use. Do you prefer DataFrames? Use Polars! Or perhaps you prefer to work on data using SQL? Then use the DuckDB backend! It is completely up to you.
- **Platform agnostic** - Run your ETL/ELT using evolve on your own infrastructure, no vendor lock-in, never.

## Architecture (alpha version)

```mermaid
flowchart TD
%% Sources (Connectors)
subgraph Sources
CSV[Local CSV Source]
JSON[HDFS JSON Source]
Parquet[S3 Parquet Source]
SQL[SQL Source]
Custom[Custom Source]
end

%% Intermediate Representation
subgraph Backend
Arrow[Apache Arrow / Polars / DuckDB / Custom]
end

%% Targets (Connectors)
subgraph Targets
S3[S3 object store]
Local[Local file system]
HDFS[Hadoop file system]
DW[Data Warehouse]
ML[ML Pipeline]
CustomOut[Custom Format]
end

Arrow --> S3
Arrow --> Local
Arrow --> HDFS
Arrow --> DW
Arrow --> ML
Arrow --> Viz
Arrow --> CustomOut
```

## Why evolve?

- Ingress and ETL/ELT is for some reason difficult for organizations to manage,
costly, without clear standards/frameworks it rapidly becomes messy.
- no "lowcode"/UI/drag and drop shit, made for real data engineers, not business managers
- no vendor lock-in. easy to audit, extend, and run wherever.
- standardized interface/framework - custom logic
- arrow native
- fast in-memory operations (perfect for OLAP)
- easy interoperability with DuckDB, Pandas, Polars, Spark, etc.
- Potential for streaming, GPU acceleration, real-time analytics.
- deployment agnostic (NO LOCK-IN)!!!! YOU RUN IT HOW YOU WANT TO - I COULDN'T CARE LESS
- community potential :)

This is not a replacement for Fivetran or Airbyte - we are offering a **developer-first alternative**
- lightweight
- transparent
- extensible
- free
- high performant

There is no reason to reinvent the wheel for your ETL needs - use evolve!

## Example usage

```python
from evolve import Pipeline
from evolve.source import PostgresSource
from evolve.target import ParquetTarget
from evolve.transform import DropNulls

# Pipelines are lazy - only run when told to
pipeline = Pipeline("ingress") \
.with_source(PostgresSource(...)) \
.with_target(ParquetTarget(...)) \
.with_transform(DropNulls(columns=(..., ))

pipeline.run() # runs the ETL
```

You can configure it with yaml or json!

```yml
source:
type: postgres
host: localhost
db: prod
user: admin
password: secret
schema: sales
tables: orders

transforms:
- type: drop_nulls
columns: ["order_id", "amount"]
- type: rename_columns
mapping:
order_id: id
amount: total
- type: filter_rows
condition: "total > 100"

target:
type: parquet
path: s3://prod/sales/orders.parquet
```

## License

evolve is distributed under the terms of both the MIT License and the Apache License (version 2.0).

See LICENSE-APACHE and LICENSE-MIT for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/firelink-sh/evolve-py

Awesome Lists containing this project

README