https://github.com/glassflow/glassflow-python-sdk
GlassFlow Python SDK to publish and consume data to your pipelines at Glassflow.dev
https://github.com/glassflow/glassflow-python-sdk
data data-processing datastreaming python real-time sdk stream-processing
Last synced: 4 months ago
JSON representation
GlassFlow Python SDK to publish and consume data to your pipelines at Glassflow.dev
- Host: GitHub
- URL: https://github.com/glassflow/glassflow-python-sdk
- Owner: glassflow
- License: mit
- Created: 2024-02-26T13:26:28.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-10-06T14:21:43.000Z (4 months ago)
- Last Synced: 2025-10-06T15:51:32.861Z (4 months ago)
- Topics: data, data-processing, datastreaming, python, real-time, sdk, stream-processing
- Language: Python
- Homepage: https://glassflow.dev/
- Size: 4.81 MB
- Stars: 9
- Watchers: 4
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# GlassFlow Python SDK

A Python SDK for creating and managing data pipelines between Kafka and ClickHouse.
## Features
- Create and manage data pipelines between Kafka and ClickHouse
- Deduplication of events during a time window based on a key
- Temporal joins between topics based on a common key with a given time window
- Schema validation and configuration management
## Installation
```bash
pip install glassflow
```
## Quick Start
### Initialize client
```python
from glassflow.etl import Client
# Initialize GlassFlow client
client = Client(host="your-glassflow-etl-url")
```
### Create a pipeline
```python
pipeline_config = {
"pipeline_id": "my-pipeline-id",
"source": {
"type": "kafka",
"connection_params": {
"brokers": [
"http://my.kafka.broker:9093"
],
"protocol": "PLAINTEXT",
"skip_auth": True
},
"topics": [
{
"consumer_group_initial_offset": "latest",
"name": "users",
"schema": {
"type": "json",
"fields": [
{
"name": "event_id",
"type": "string"
},
{
"name": "user_id",
"type": "string"
},
{
"name": "name",
"type": "string"
},
{
"name": "email",
"type": "string"
},
{
"name": "created_at",
"type": "string"
}
]
},
"deduplication": {
"enabled": True,
"id_field": "event_id",
"id_field_type": "string",
"time_window": "1h"
}
}
]
},
"join": {
"enabled": False
},
"sink": {
"type": "clickhouse",
"host": "http://my.clickhouse.server",
"port": "9000",
"database": "default",
"username": "default",
"password": "c2VjcmV0",
"secure": False,
"max_batch_size": 1000,
"max_delay_time": "30s",
"table": "users_dedup",
"table_mapping": [
{
"source_id": "users",
"field_name": "event_id",
"column_name": "event_id",
"column_type": "UUID"
},
{
"source_id": "users",
"field_name": "user_id",
"column_name": "user_id",
"column_type": "UUID"
},
{
"source_id": "users",
"field_name": "created_at",
"column_name": "created_at",
"column_type": "DateTime"
},
{
"source_id": "users",
"field_name": "name",
"column_name": "name",
"column_type": "String"
},
{
"source_id": "users",
"field_name": "email",
"column_name": "email",
"column_type": "String"
}
]
}
}
# Create a pipeline
pipeline = client.create_pipeline(pipeline_config)
```
## Get pipeline
```python
# Get a pipeline by ID
pipeline = client.get_pipeline("my-pipeline-id")
```
### List pipelines
```python
pipelines = client.list_pipelines()
for pipeline in pipelines:
print(f"Pipeline ID: {pipeline['pipeline_id']}")
print(f"Name: {pipeline['name']}")
print(f"Transformation Type: {pipeline['transformation_type']}")
print(f"Created At: {pipeline['created_at']}")
print(f"State: {pipeline['state']}")
```
### Pause / Resume Pipeline
```python
pipeline = client.get_pipeline("my-pipeline-id")
pipeline.pause()
print(pipeline.status)
```
```python
pipeline = client.get_pipeline("my-pipeline-id")
pipeline.resume()
print(pipeline.status)
```
### Stop pipeline
```python
# Stop a pipeline gracefully
client.stop_pipeline("my-pipeline-id")
# Stop a pipeline ungracefully (terminate)
client.stop_pipeline("my-pipeline-id", terminate=True)
# Or stop via pipeline instance
pipeline.stop()
```
### Delete pipeline
```python
# Delete a pipeline
client.delete_pipeline("my-pipeline-id")
# Or delete via pipeline instance
pipeline.delete()
```
## Pipeline Configuration
For detailed information about the pipeline configuration, see [GlassFlow docs](https://docs.glassflow.dev/pipeline/pipeline-configuration).
## Tracking
The SDK includes anonymous usage tracking to help improve the product. Tracking is enabled by default but can be disabled in two ways:
1. Using an environment variable:
```bash
export GF_TRACKING_ENABLED=false
```
2. Programmatically using the `disable_tracking` method:
```python
from glassflow.etl import Client
client = Client(host="my-glassflow-host")
client.disable_tracking()
```
The tracking collects anonymous information about:
- SDK version
- Platform (operating system)
- Python version
- Pipeline ID
- Whether joins or deduplication are enabled
- Kafka security protocol, auth mechanism used and whether authentication is disabled
- Errors during pipeline creation and deletion
## Development
### Setup
1. Clone the repository
2. Create a virtual environment
3. Install dependencies:
```bash
uv venv
source .venv/bin/activate
uv pip install -e .[dev]
```
### Testing
```bash
pytest
```