https://github.com/ansh-info/databridge
End-to-end financial data pipeline unifying real-time and batch ingestion with PySpark ETL, BigQuery storage, DBT modeling, Kafka streaming, and Airflow/Docker orchestration.
https://github.com/ansh-info/databridge
airflow apache-spark bash big-data bigquery dbt docker docker-compose etl etl-pipeline gcp google kafka kafka-consumer kubernetes orchestration pyspark python3 real-time stock
Last synced: 4 months ago
JSON representation
End-to-end financial data pipeline unifying real-time and batch ingestion with PySpark ETL, BigQuery storage, DBT modeling, Kafka streaming, and Airflow/Docker orchestration.
- Host: GitHub
- URL: https://github.com/ansh-info/databridge
- Owner: ansh-info
- License: mit
- Created: 2025-04-15T08:46:23.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-05T12:44:10.000Z (about 1 year ago)
- Last Synced: 2025-05-06T15:59:55.363Z (about 1 year ago)
- Topics: airflow, apache-spark, bash, big-data, bigquery, dbt, docker, docker-compose, etl, etl-pipeline, gcp, google, kafka, kafka-consumer, kubernetes, orchestration, pyspark, python3, real-time, stock
- Language: Python
- Homepage:
- Size: 493 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DataBridge
## Project Overview
DataBridge is a comprehensive financial data platform designed to ingest, process, and analyze both real-time and static data sources. The project's primary goals are:
Many financial data workflows today separate batch and real-time processes, lack unified transformation logic, and require manual intervention. DataBridge provides an end-to-end, unified pipeline combining batch (Kaggle) and streaming (Alpha Vantage) ingestion, scalable Spark ETL, BigQuery storage, DBT transformations into a star schema, and workflow orchestration via Docker Compose and Airflow.
- One primary real-time data source from Alpha Vantage for intraday stock prices
- Automated ingestion of static datasets from Kaggle (e.g., S&P 500, global economy, cryptocurrency)
- Unified ETL processing using PySpark and for scalable data loading into BigQuery
- Kafka integration (producer & consumer) for streaming data pipelines
- Data modeling and transformations with DBT, resulting in a star schema (fact and dimension tables) in BigQuery
- Workflow orchestration using Apache Airflow
- Containerized local development and deployment via Docker Compose (Kafka, Python services)
- Comprehensive testing suite with pytest for pipeline validation
-> Click to watch the demo video on YouTube
## Prerequisites
- Python 3.12
- Java JRE (for Spark)
- Google Cloud account with BigQuery & GCS access
- Kaggle account for static pipelines
- Docker & Docker Compose (optional, for Kafka and containerized services)
- DBT Core & DBT BigQuery plugin (for data modeling)
## Installation
1. Clone the repository:
```bash
git clone https://github.com/your-org/DataBridge.git
cd DataBridge
```
2. Create and activate a Python virtual environment:
```bash
python -m venv .venv
source .venv/bin/activate
```
3. Install dependencies:
```bash
pip install --upgrade pip
pip install -r requirements.txt
```
## Configuration
1. Copy environment file templates:
```bash
cp .env.example .env
cp config/dbt-user-creds.example.json config/dbt-user-creds.json
```
2. Edit `.env` and fill in:
- `ALPHA_VANTAGE_KEYS` (comma-separated API keys)
- `PROJECT_ID` (your GCP project ID), `DATASET_NAME` (BigQuery dataset name), `GCS_BUCKET` (temporary GCS bucket)
- `PARQUET_OUTPUT_PATH` (optional; GCS or local path to write Parquet exports)
- `STOCK_SYMBOLS` (comma-separated tickers)
- `KAGGLE_USERNAME`, `KAGGLE_KEY`
- `KAFKA_BOOTSTRAP_SERVERS`, `KAFKA_TOPIC`, `KAFKA_CONSUMER_GROUP`
3. Populate `config/dbt-user-creds.json` with your GCP service account key.
## GCP Setup
Use the helper scripts to provision GCS bucket and BigQuery dataset:
```bash
python - <")
create_bigquery_dataset("")
EOF
```
## Pipelines
### Static Data Pipeline
Run all static pipelines (S&P 500, global economy, crypto, etc.):
```bash
python static/run_all.py
```
Or run individual modules, e.g.:
```bash
python static/sandp500.py
```
### Real-Time Pipeline
- **Last N records** (one-off):
```bash
python streaming/realtime_stock_recent.py
```
- **Continuous stream** (every 5 minutes):
```bash
python streaming/realtime_stock_stream.py
```
### Kafka Test Stream & Consumer
1. Produce test data to Kafka and write to BigQuery:
```bash
python streaming/realtime_test_kafka_stream.py
```
2. Consume from Kafka and load to BigQuery:
```bash
python kafka_consumer/consumer.py
```
### ETL Module
Use the Alpha Vantage ETL module in `etl/alpha_vantage.py`:
```python
from etl.alpha_vantage import fetch_intraday_data, parse_alpha_vantage_json, write_to_bigquery
# fetch, parse into Spark DataFrame, then write:
data = fetch_intraday_data("AAPL")
df = parse_alpha_vantage_json(data, "AAPL", spark)
write_to_bigquery(df, "intraday", DATASET_NAME, PROJECT_ID, GCS_BUCKET)
```
## DBT Models
DBT is used for transforming raw tables and modeling marts:
```bash
# Ensure your profile is set (profile: default picks up env vars)
dbt deps
dbt seed
dbt run
dbt test
```
Models live under `models/` with `staging/`, `intermediate/`, and `marts/`.

## Airflow (Optional)
DAG definitions reside in `airflow/dags/`. To run Airflow locally:
```bash
export AIRFLOW_HOME=$(pwd)/airflow
airflow db migrate
airflow standalone
```
## Docker & Docker Compose
Kafka & test-producer/consumer can be launched via Docker Compose:
```bash
docker-compose up -d
```
Services:
- `zookeeper`, `kafka`
- `test-producer` (runs Kafka test stream)
- `kafka-consumer` (loads Kafka topic to BigQuery)
## Running Tests
Run Python unit tests with `pytest`:
```bash
pytest
```
## Directory Structure
```
├── config/ # configuration and GCP setup
├── etl/ # Alpha Vantage ETL module
├── kafka_utils/ # Kafka producer config utility
├── kafka_consumer/ # Kafka consumer script
├── static/ # static data pipelines (Kaggle)
├── streaming/ # real-time & test streaming scripts
├── models/ # DBT models (staging, marts, etc.)
├── airflow/ # Airflow DAGs & logs
├── tests/ # unit tests
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── README.md
```
## License
This project is licensed under the [MIT License](LICENSE).