https://github.com/ansh-info/databridge

End-to-end financial data pipeline unifying real-time and batch ingestion with PySpark ETL, BigQuery storage, DBT modeling, Kafka streaming, and Airflow/Docker orchestration.
https://github.com/ansh-info/databridge

airflow apache-spark bash big-data bigquery dbt docker docker-compose etl etl-pipeline gcp google kafka kafka-consumer kubernetes orchestration pyspark python3 real-time stock

Last synced: 5 months ago
JSON representation

End-to-end financial data pipeline unifying real-time and batch ingestion with PySpark ETL, BigQuery storage, DBT modeling, Kafka streaming, and Airflow/Docker orchestration.

Host: GitHub
URL: https://github.com/ansh-info/databridge
Owner: ansh-info
License: mit
Created: 2025-04-15T08:46:23.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-05-05T12:44:10.000Z (about 1 year ago)
Last Synced: 2025-05-06T15:59:55.363Z (about 1 year ago)
Topics: airflow, apache-spark, bash, big-data, bigquery, dbt, docker, docker-compose, etl, etl-pipeline, gcp, google, kafka, kafka-consumer, kubernetes, orchestration, pyspark, python3, real-time, stock
Language: Python
Homepage:
Size: 493 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # DataBridge

## Project Overview

DataBridge is a comprehensive financial data platform designed to ingest, process, and analyze both real-time and static data sources. The project's primary goals are:

Many financial data workflows today separate batch and real-time processes, lack unified transformation logic, and require manual intervention. DataBridge provides an end-to-end, unified pipeline combining batch (Kaggle) and streaming (Alpha Vantage) ingestion, scalable Spark ETL, BigQuery storage, DBT transformations into a star schema, and workflow orchestration via Docker Compose and Airflow.

- One primary real-time data source from Alpha Vantage for intraday stock prices

- Automated ingestion of static datasets from Kaggle (e.g., S&P 500, global economy, cryptocurrency)

- Unified ETL processing using PySpark and for scalable data loading into BigQuery

- Kafka integration (producer & consumer) for streaming data pipelines

- Data modeling and transformations with DBT, resulting in a star schema (fact and dimension tables) in BigQuery

- Workflow orchestration using Apache Airflow

- Containerized local development and deployment via Docker Compose (Kafka, Python services)

- Comprehensive testing suite with pytest for pipeline validation



  

    

  

  


   -> Click to watch the demo video on YouTube



## Prerequisites

- Python 3.12

- Java JRE (for Spark)

- Google Cloud account with BigQuery & GCS access

- Kaggle account for static pipelines

- Docker & Docker Compose (optional, for Kafka and containerized services)

- DBT Core & DBT BigQuery plugin (for data modeling)

## Installation

1. Clone the repository:

   ```bash

   git clone https://github.com/your-org/DataBridge.git

   cd DataBridge

   ```

2. Create and activate a Python virtual environment:

   ```bash

   python -m venv .venv

   source .venv/bin/activate

   ```

3. Install dependencies:

   ```bash

   pip install --upgrade pip

   pip install -r requirements.txt

   ```

## Configuration

1. Copy environment file templates:

   ```bash

   cp .env.example .env

   cp config/dbt-user-creds.example.json config/dbt-user-creds.json

   ```

2. Edit `.env` and fill in:

   - `ALPHA_VANTAGE_KEYS` (comma-separated API keys)

   - `PROJECT_ID` (your GCP project ID), `DATASET_NAME` (BigQuery dataset name), `GCS_BUCKET` (temporary GCS bucket)

   - `PARQUET_OUTPUT_PATH` (optional; GCS or local path to write Parquet exports)

   - `STOCK_SYMBOLS` (comma-separated tickers)

   - `KAGGLE_USERNAME`, `KAGGLE_KEY`

   - `KAFKA_BOOTSTRAP_SERVERS`, `KAFKA_TOPIC`, `KAFKA_CONSUMER_GROUP`

3. Populate `config/dbt-user-creds.json` with your GCP service account key.

## GCP Setup

Use the helper scripts to provision GCS bucket and BigQuery dataset:

```bash

python - <")

create_bigquery_dataset("")

EOF

```

## Pipelines

### Static Data Pipeline

Run all static pipelines (S&P 500, global economy, crypto, etc.):

```bash

python static/run_all.py

```

Or run individual modules, e.g.:

```bash

python static/sandp500.py

```

### Real-Time Pipeline

- **Last N records** (one-off):

  ```bash

  python streaming/realtime_stock_recent.py

  ```

- **Continuous stream** (every 5 minutes):

  ```bash

  python streaming/realtime_stock_stream.py

  ```

### Kafka Test Stream & Consumer

1. Produce test data to Kafka and write to BigQuery:

   ```bash

   python streaming/realtime_test_kafka_stream.py

   ```

2. Consume from Kafka and load to BigQuery:

   ```bash

   python kafka_consumer/consumer.py

   ```

### ETL Module

Use the Alpha Vantage ETL module in `etl/alpha_vantage.py`:

```python

from etl.alpha_vantage import fetch_intraday_data, parse_alpha_vantage_json, write_to_bigquery

# fetch, parse into Spark DataFrame, then write:

data = fetch_intraday_data("AAPL")

df = parse_alpha_vantage_json(data, "AAPL", spark)

write_to_bigquery(df, "intraday", DATASET_NAME, PROJECT_ID, GCS_BUCKET)

```

## DBT Models

DBT is used for transforming raw tables and modeling marts:

```bash

# Ensure your profile is set (profile: default picks up env vars)

dbt deps

dbt seed

dbt run

dbt test

```

Models live under `models/` with `staging/`, `intermediate/`, and `marts/`.

![dbt-diagram](images/dbt_lineage.png)

## Airflow (Optional)

DAG definitions reside in `airflow/dags/`. To run Airflow locally:

```bash

export AIRFLOW_HOME=$(pwd)/airflow

airflow db migrate

airflow standalone

```

## Docker & Docker Compose

Kafka & test-producer/consumer can be launched via Docker Compose:

```bash

docker-compose up -d

```

Services:

- `zookeeper`, `kafka`

- `test-producer` (runs Kafka test stream)

- `kafka-consumer` (loads Kafka topic to BigQuery)

## Running Tests

Run Python unit tests with `pytest`:

```bash

pytest

```

## Directory Structure

```

├── config/                  # configuration and GCP setup

├── etl/                     # Alpha Vantage ETL module

├── kafka_utils/             # Kafka producer config utility

├── kafka_consumer/          # Kafka consumer script

├── static/                  # static data pipelines (Kaggle)

├── streaming/               # real-time & test streaming scripts

├── models/                  # DBT models (staging, marts, etc.)

├── airflow/                 # Airflow DAGs & logs

├── tests/                   # unit tests

├── Dockerfile

├── docker-compose.yml

├── requirements.txt

└── README.md

```

## License

This project is licensed under the [MIT License](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ansh-info/databridge

Awesome Lists containing this project

README