{"id":27914875,"url":"https://github.com/ansh-info/databridge","last_synced_at":"2026-02-28T17:31:06.645Z","repository":{"id":291580941,"uuid":"966650900","full_name":"ansh-info/DataBridge","owner":"ansh-info","description":"End-to-end financial data pipeline unifying real-time and batch ingestion with PySpark ETL, BigQuery storage, DBT modeling, Kafka streaming, and Airflow/Docker orchestration.","archived":false,"fork":false,"pushed_at":"2025-05-05T12:44:10.000Z","size":505,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-06T15:59:55.363Z","etag":null,"topics":["airflow","apache-spark","bash","big-data","bigquery","dbt","docker","docker-compose","etl","etl-pipeline","gcp","google","kafka","kafka-consumer","kubernetes","orchestration","pyspark","python3","real-time","stock"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ansh-info.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-15T08:46:23.000Z","updated_at":"2025-05-05T12:43:47.000Z","dependencies_parsed_at":"2025-05-05T13:51:24.906Z","dependency_job_id":"57bd69ce-a0d4-4f04-8d6b-909b642580ca","html_url":"https://github.com/ansh-info/DataBridge","commit_stats":null,"previous_names":["ansh-info/databridge"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ansh-info/DataBridge","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ansh-info%2FDataBridge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ansh-info%2FDataBridge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ansh-info%2FDataBridge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ansh-info%2FDataBridge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ansh-info","download_url":"https://codeload.github.com/ansh-info/DataBridge/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ansh-info%2FDataBridge/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262702336,"owners_count":23350644,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","apache-spark","bash","big-data","bigquery","dbt","docker","docker-compose","etl","etl-pipeline","gcp","google","kafka","kafka-consumer","kubernetes","orchestration","pyspark","python3","real-time","stock"],"created_at":"2025-05-06T15:31:15.224Z","updated_at":"2026-02-28T17:31:06.585Z","avatar_url":"https://github.com/ansh-info.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DataBridge\n\n## Project Overview\n\nDataBridge is a comprehensive financial data platform designed to ingest, process, and analyze both real-time and static data sources. The project's primary goals are:\n\nMany financial data workflows today separate batch and real-time processes, lack unified transformation logic, and require manual intervention. DataBridge provides an end-to-end, unified pipeline combining batch (Kaggle) and streaming (Alpha Vantage) ingestion, scalable Spark ETL, BigQuery storage, DBT transformations into a star schema, and workflow orchestration via Docker Compose and Airflow.\n\n- One primary real-time data source from Alpha Vantage for intraday stock prices\n- Automated ingestion of static datasets from Kaggle (e.g., S\u0026P 500, global economy, cryptocurrency)\n- Unified ETL processing using PySpark and for scalable data loading into BigQuery\n- Kafka integration (producer \u0026 consumer) for streaming data pipelines\n- Data modeling and transformations with DBT, resulting in a star schema (fact and dimension tables) in BigQuery\n- Workflow orchestration using Apache Airflow\n- Containerized local development and deployment via Docker Compose (Kafka, Python services)\n- Comprehensive testing suite with pytest for pipeline validation\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://youtu.be/BXZvJIVycHE\" target=\"_blank\"\u003e\n    \u003cimg src=\"https://img.youtube.com/vi/BXZvJIVycHE/maxresdefault.jpg\" alt=\"Watch the video\" width=\"640\" style=\"border-radius:12px;\" /\u003e\n  \u003c/a\u003e\n  \u003cbr /\u003e\n  \u003ca href=\"https://youtu.be/BXZvJIVycHE\" target=\"_blank\"\u003e -\u003e Click to watch the demo video on YouTube\u003c/a\u003e\n\u003c/p\u003e\n\n## Prerequisites\n\n- Python 3.12\n- Java JRE (for Spark)\n- Google Cloud account with BigQuery \u0026 GCS access\n- Kaggle account for static pipelines\n- Docker \u0026 Docker Compose (optional, for Kafka and containerized services)\n- DBT Core \u0026 DBT BigQuery plugin (for data modeling)\n\n## Installation\n\n1. Clone the repository:\n   ```bash\n   git clone https://github.com/your-org/DataBridge.git\n   cd DataBridge\n   ```\n2. Create and activate a Python virtual environment:\n   ```bash\n   python -m venv .venv\n   source .venv/bin/activate\n   ```\n3. Install dependencies:\n   ```bash\n   pip install --upgrade pip\n   pip install -r requirements.txt\n   ```\n\n## Configuration\n\n1. Copy environment file templates:\n   ```bash\n   cp .env.example .env\n   cp config/dbt-user-creds.example.json config/dbt-user-creds.json\n   ```\n2. Edit `.env` and fill in:\n   - `ALPHA_VANTAGE_KEYS` (comma-separated API keys)\n   - `PROJECT_ID` (your GCP project ID), `DATASET_NAME` (BigQuery dataset name), `GCS_BUCKET` (temporary GCS bucket)\n   - `PARQUET_OUTPUT_PATH` (optional; GCS or local path to write Parquet exports)\n   - `STOCK_SYMBOLS` (comma-separated tickers)\n   - `KAGGLE_USERNAME`, `KAGGLE_KEY`\n   - `KAFKA_BOOTSTRAP_SERVERS`, `KAFKA_TOPIC`, `KAFKA_CONSUMER_GROUP`\n3. Populate `config/dbt-user-creds.json` with your GCP service account key.\n\n## GCP Setup\n\nUse the helper scripts to provision GCS bucket and BigQuery dataset:\n\n```bash\npython - \u003c\u003cEOF\nfrom config.gcp_setup import create_gcs_bucket, create_bigquery_dataset\ncreate_gcs_bucket(\"\u003cyour-gcs-bucket\u003e\")\ncreate_bigquery_dataset(\"\u003cyour-dataset-name\u003e\")\nEOF\n```\n\n## Pipelines\n\n### Static Data Pipeline\n\nRun all static pipelines (S\u0026P 500, global economy, crypto, etc.):\n\n```bash\npython static/run_all.py\n```\n\nOr run individual modules, e.g.:\n\n```bash\npython static/sandp500.py\n```\n\n### Real-Time Pipeline\n\n- **Last N records** (one-off):\n  ```bash\n  python streaming/realtime_stock_recent.py\n  ```\n- **Continuous stream** (every 5 minutes):\n  ```bash\n  python streaming/realtime_stock_stream.py\n  ```\n\n### Kafka Test Stream \u0026 Consumer\n\n1. Produce test data to Kafka and write to BigQuery:\n   ```bash\n   python streaming/realtime_test_kafka_stream.py\n   ```\n2. Consume from Kafka and load to BigQuery:\n   ```bash\n   python kafka_consumer/consumer.py\n   ```\n\n### ETL Module\n\nUse the Alpha Vantage ETL module in `etl/alpha_vantage.py`:\n\n```python\nfrom etl.alpha_vantage import fetch_intraday_data, parse_alpha_vantage_json, write_to_bigquery\n# fetch, parse into Spark DataFrame, then write:\ndata = fetch_intraday_data(\"AAPL\")\ndf = parse_alpha_vantage_json(data, \"AAPL\", spark)\nwrite_to_bigquery(df, \"intraday\", DATASET_NAME, PROJECT_ID, GCS_BUCKET)\n```\n\n## DBT Models\n\nDBT is used for transforming raw tables and modeling marts:\n\n```bash\n# Ensure your profile is set (profile: default picks up env vars)\ndbt deps\ndbt seed\ndbt run\ndbt test\n```\n\nModels live under `models/` with `staging/`, `intermediate/`, and `marts/`.\n\n![dbt-diagram](images/dbt_lineage.png)\n\n## Airflow (Optional)\n\nDAG definitions reside in `airflow/dags/`. To run Airflow locally:\n\n```bash\nexport AIRFLOW_HOME=$(pwd)/airflow\nairflow db migrate\nairflow standalone\n```\n\n## Docker \u0026 Docker Compose\n\nKafka \u0026 test-producer/consumer can be launched via Docker Compose:\n\n```bash\ndocker-compose up -d\n```\n\nServices:\n\n- `zookeeper`, `kafka`\n- `test-producer` (runs Kafka test stream)\n- `kafka-consumer` (loads Kafka topic to BigQuery)\n\n## Running Tests\n\nRun Python unit tests with `pytest`:\n\n```bash\npytest\n```\n\n## Directory Structure\n\n```\n├── config/                  # configuration and GCP setup\n├── etl/                     # Alpha Vantage ETL module\n├── kafka_utils/             # Kafka producer config utility\n├── kafka_consumer/          # Kafka consumer script\n├── static/                  # static data pipelines (Kaggle)\n├── streaming/               # real-time \u0026 test streaming scripts\n├── models/                  # DBT models (staging, marts, etc.)\n├── airflow/                 # Airflow DAGs \u0026 logs\n├── tests/                   # unit tests\n├── Dockerfile\n├── docker-compose.yml\n├── requirements.txt\n└── README.md\n```\n\n## License\n\nThis project is licensed under the [MIT License](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fansh-info%2Fdatabridge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fansh-info%2Fdatabridge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fansh-info%2Fdatabridge/lists"}