Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mharrisb1/clickhouse-transform
Build custom transformation pipelines for Clickhouse using an intuitive and expressive dataframe API
https://github.com/mharrisb1/clickhouse-transform
Last synced: 20 days ago
JSON representation
Build custom transformation pipelines for Clickhouse using an intuitive and expressive dataframe API
- Host: GitHub
- URL: https://github.com/mharrisb1/clickhouse-transform
- Owner: mharrisb1
- License: mit
- Created: 2022-08-06T19:04:31.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-08-06T19:07:10.000Z (over 2 years ago)
- Last Synced: 2023-03-20T09:22:36.476Z (almost 2 years ago)
- Language: Python
- Size: 6.84 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 🏡 Clickhouse-Transform (⚠️ Experimental)
Build custom transformation pipelines for Clickhouse using standard SQL or an intuitive and expressive dataframe API.
Clickhouse-Transform uses the [Ibis-Project](https://ibis-project.org/docs/3.0.2/ibis-for-sql-programmers/) as the
DataFrame API. See their usage guides for a detailed overview of how to create expressions with Ibis.## Installation
```shell
pip install clickhouse_transform
```## Usage
### Building lazily evaluated query execution models
```python
from clickhouse_transform import Sessionconfigs = {...}
session = Session.builder.from_configs(configs).create()
opportunities = session.table(database="crm_db", table="opportunities")
lost_opportunities = opportunities.filter(opportunities.stage == "Closed: lost")
top_opportunity_losers = lost_opportunities.groupby("owner_id").size()# retrieve results as Pandas DataFrame
df = top_opportunity_losers.execute()
```### Creating execution pipelines
```python
from clickhouse_transform import Session, Pipeline
from clickhouse_transform.types import Modelconfigs = {...}
session = Session.builder.from_configs(configs).create()
pipeline = Pipeline(session)pipeline.add_source(database="sales", table="purchases")
pipeline.add_source(database="sales", table="customers")EU_COUNTRY_ISO_CODES = ["GBR", "DEU", "FRA", "ITA"]
@pipeline.model()
def eu_purchases(sales_purchases: Model) -> Model:
return sales_purchases.filter(sales_purchases.region.isin(EU_COUNTRY_ISO_CODES))@pipeline.model()
def eu_customers(sales_customers: Model) -> Model:
return sales_customers.filter(sales_customers.region.isin(EU_COUNTRY_ISO_CODES))@pipeline.model()
def purchases_with_customer_info(eu_purchases: Model, eu_customers: Model) -> Model:
return (
eu_purchases
.left_join(eu_customers, eu_purchases.customer_id == eu_customers.id)
.select([
eu_purchases.purchase_amount,
eu_customers.id.name("customer_id"),
eu_customers.first_name.concat(" ", eu_customers.last_name).name("customer_full_name"),
eu_customers.email
])
)@pipeline.model()
def customers_lifetime_spend(purchases_with_customer_info: Model) -> Model:
return (
purchases_with_customer_info
.groupby(["customer_id", "customer_full_name", "email"])
.aggregate(purchases_with_customer_info.purchase_amount.sum().name("lifetime_spend"))
)expr = pipeline.run()
df = expr.execute()
```