{"id":32522095,"url":"https://github.com/gregorykogan/crypto-trading-data-pipeline","last_synced_at":"2026-04-08T21:32:41.576Z","repository":{"id":319938745,"uuid":"1077874934","full_name":"GregoryKogan/crypto-trading-data-pipeline","owner":"GregoryKogan","description":"Real-time crypto trading data pipeline using Apache Spark, Kafka, and Airflow. Containerized microservices architecture for streaming analytics.","archived":false,"fork":false,"pushed_at":"2025-10-21T01:36:47.000Z","size":22,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-28T06:52:46.747Z","etag":null,"topics":["airflow","airflow-dags","airflow-docker","apache-airflow","apache-kafka","apache-spark","data-engineering","data-engineering-pipeline","data-pipeline","data-processing","docker","docker-compose","kafka","postgresql","pyspark","python","spark","streaming-data","streaming-data-pipelines","streaming-data-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GregoryKogan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-16T21:44:20.000Z","updated_at":"2025-10-22T16:51:59.000Z","dependencies_parsed_at":"2025-10-21T03:32:58.751Z","dependency_job_id":"fb3b2397-bcd9-480d-8460-d5a4139895f1","html_url":"https://github.com/GregoryKogan/crypto-trading-data-pipeline","commit_stats":null,"previous_names":["gregorykogan/crypto-trading-data-pipeline"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/GregoryKogan/crypto-trading-data-pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GregoryKogan%2Fcrypto-trading-data-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GregoryKogan%2Fcrypto-trading-data-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GregoryKogan%2Fcrypto-trading-data-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GregoryKogan%2Fcrypto-trading-data-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GregoryKogan","download_url":"https://codeload.github.com/GregoryKogan/crypto-trading-data-pipeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GregoryKogan%2Fcrypto-trading-data-pipeline/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31575598,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T14:31:17.711Z","status":"ssl_error","status_checked_at":"2026-04-08T14:31:17.202Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","airflow-dags","airflow-docker","apache-airflow","apache-kafka","apache-spark","data-engineering","data-engineering-pipeline","data-pipeline","data-processing","docker","docker-compose","kafka","postgresql","pyspark","python","spark","streaming-data","streaming-data-pipelines","streaming-data-processing"],"created_at":"2025-10-28T06:52:08.308Z","updated_at":"2026-04-08T21:32:41.571Z","avatar_url":"https://github.com/GregoryKogan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Real-Time Crypto Trading Analytics Pipeline\n\n[![Python](https://img.shields.io/badge/Python-3.9+-3776AB?style=flat-square\u0026logo=python\u0026logoColor=white)](https://www.python.org/)\n[![Apache Spark](https://img.shields.io/badge/Apache%20Spark-3.5.1-E25A1C?style=flat-square\u0026logo=apachespark\u0026logoColor=white)](https://spark.apache.org/)\n[![Apache Kafka](https://img.shields.io/badge/Apache%20Kafka-3.7.0-231F20?style=flat-square\u0026logo=apachekafka\u0026logoColor=white)](https://kafka.apache.org/)\n[![Apache Airflow](https://img.shields.io/badge/Apache%20Airflow-2.8+-017CEE?style=flat-square\u0026logo=apacheairflow\u0026logoColor=white)](https://airflow.apache.org/)\n[![PostgreSQL](https://img.shields.io/badge/PostgreSQL-17-336791?style=flat-square\u0026logo=postgresql\u0026logoColor=white)](https://www.postgresql.org/)\n[![Docker](https://img.shields.io/badge/Docker-Containerized-2496ED?style=flat-square\u0026logo=docker\u0026logoColor=white)](https://www.docker.com/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=flat-square)](https://opensource.org/licenses/MIT)\n\nA streaming data pipeline for real-time cryptocurrency trade data processing. This project demonstrates modern data engineering concepts using a containerized microservices architecture with Apache Spark, Kafka, and Airflow.\n\n## Table of Contents\n\n- [Core Technologies](#core-technologies)\n- [Architecture](#architecture)\n- [How to Run the Pipeline](#how-to-run-the-pipeline)\n  - [Build and Start All Services](#1-build-and-start-all-services)\n  - [Configure Airflow Connections](#2-configure-airflow-connections-one-time-setup)\n  - [Start the Pipeline](#3-start-the-pipeline)\n- [Monitoring \u0026 Verification](#monitoring--verification)\n- [How to Stop the Pipeline](#how-to-stop-the-pipeline)\n- [Project Structure](#project-structure)\n- [License](#license)\n\n## Core Technologies\n\n- **Data Ingestion**: Python, Binance WebSocket API\n- **Message Broker**: Apache Kafka\n- **Stream Processing**: Apache Spark (Structured Streaming)\n- **Data Storage**: PostgreSQL\n- **Orchestration \u0026 Monitoring**: Apache Airflow\n- **Containerization**: Docker \u0026 Docker Compose\n\n## Architecture\n\nThe data flows from a live exchange feed through Kafka, is processed in real-time by Spark, and is stored in a PostgreSQL database. Airflow manages the submission and monitoring of the Spark job.\n\n```mermaid\ngraph TD\n    subgraph External Source\n        API[\"Crypto Exchange API\u003cbr/\u003e(WebSocket Endpoint)\"]\n    end\n\n    subgraph Docker Compose\n        Producer[\"Data Ingestion Producer\u003cbr/\u003e(producer.py)\"]\n        Kafka[\"Apache Kafka\u003cbr/\u003e(Topic: raw_trades)\"]\n        Airflow[\"Apache Airflow\u003cbr/\u003e(Scheduling \u0026 Monitoring)\"]\n        Spark[\"Stream Processor\u003cbr/\u003e(Spark Streaming)\u003cbr/\u003e1-min Tumbling Windows\"]\n        Postgres[\"Data Mart (PostgreSQL)\u003cbr/\u003e(Table: trades_1min_agg)\"]\n    end\n\n    API -- \"1. Live Trade Data (JSON)\" --\u003e Producer\n    Producer -- \"2. Raw Trade Events (JSON)\" --\u003e Kafka\n    Kafka -- \"3a. Consume Raw Events\" --\u003e Spark\n    Airflow -. \"3b. Submits \u0026 Monitors Job\" .-\u003e Spark\n    Spark -- \"4. Aggregated Data\" --\u003e Postgres\n```\n\n## How to Run the Pipeline\n\n### 1. Build and Start All Services\n\nFrom the project's root directory:\n\n```bash\ndocker-compose up --build -d\n```\n\nThis will start the following services:\n\n- **Zookeeper**: Kafka coordination service\n- **Kafka**: Message broker for streaming data\n- **PostgreSQL**: Database for storing aggregated data\n- **pgAdmin**: Web interface for database management\n- **Producer**: Python service that ingests live crypto data\n- **Spark Master**: Spark cluster master node\n- **Spark Worker**: Spark cluster worker node\n- **Airflow Init**: Initializes Airflow database and creates admin user\n- **Airflow Webserver**: Web interface for pipeline monitoring\n- **Airflow Scheduler**: Schedules and monitors DAGs\n\n### 2. Configure Airflow Connections (One-Time Setup)\n\nThe Airflow DAGs need to know how to connect to PostgreSQL.\n\n1. Navigate to the **Airflow UI**: [http://localhost:8081](http://localhost:8081)\n2. Login with username `admin` and password `admin`.\n3. Go to **Admin -\u003e Connections**.\n\n4. **Create the PostgreSQL Connection:**\n    - Click the `+` button to add a new connection.\n    - **Connection Id:** `crypto_pipeline_postgres`\n    - **Connection Type:** `Postgres`\n    - **Host:** `postgres`\n    - **Database:** `crypto_data`\n    - **Login:** `user`\n    - **Password:** `password`\n    - **Port:** `5432`\n    - Click **Save**.\n\n### 3. Start the Pipeline\n\n1. In the Airflow UI, go to the **DAGs** view.\n2. Find the `crypto_pipeline_submit_dag` and un-pause it using the toggle on the left.\n3. Click on the DAG name, then click the \"Play\" button to trigger it manually. This will submit the Spark job.\n4. Find the `crypto_pipeline_monitor_dag` and un-pause it. This DAG will now run automatically every 5 minutes to monitor the pipeline.\n\n## Monitoring \u0026 Verification\n\n- **Producer Logs**: Check that the producer is successfully publishing messages.\n\n    ```bash\n    docker logs -f producer\n    ```\n\n    You should see logs like `Published trade to Kafka: ...`\n\n- **Spark Master UI**: [http://localhost:8080](http://localhost:8080)\n  - You should see one \"Running Application\" corresponding to our `CryptoAnalytics` job.\n\n- **Airflow UI**: [http://localhost:8081](http://localhost:8081)\n  - The `crypto_pipeline_submit_dag` should have a successful run.\n  - The `crypto_pipeline_monitor_dag` should have successful runs every 5 minutes.\n\n- **pgAdmin (Database UI)**: [http://localhost:5050](http://localhost:5050)\n  - Add a new server connection:\n    - **Host:** `postgres`\n    - **Port:** `5432`\n    - **Username:** `user`\n    - **Password:** `password`\n  - Navigate to `crypto_data -\u003e Schemas -\u003e public -\u003e Tables -\u003e trades_1min_agg`.\n  - Right-click the table and select \"View/Edit Data\" -\u003e \"All Rows\". You should see aggregated data appearing and updating every minute.\n\n## How to Stop the Pipeline\n\nTo stop all running containers and remove the network, run:\n\n```bash\ndocker-compose down\n```\n\nTo stop the containers AND remove all persisted data (PostgreSQL data, pgAdmin data, named volumes), use the `-v` flag:\n\n```bash\ndocker-compose down -v\n```\n\n## Project Structure\n\n```plaintext\n/crypto-trading-data-pipeline/\n|\n├── .gitignore\n├── docker-compose.yml\n├── README.md\n|\n├── producer/\n│   ├── Dockerfile\n│   ├── producer.py\n│   └── requirements.txt\n|\n├── spark_processor/\n│   ├── Dockerfile\n│   ├── processor.py\n│   └── requirements.txt\n|\n├── airflow/\n│   ├── dags/\n│   │   ├── crypto_pipeline_monitor_dag.py\n│   │   └── crypto_pipeline_submit_dag.py\n│   ├── Dockerfile\n│   └── requirements.txt\n|\n└── postgres/\n    └── init/\n        └── init.sql\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgregorykogan%2Fcrypto-trading-data-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgregorykogan%2Fcrypto-trading-data-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgregorykogan%2Fcrypto-trading-data-pipeline/lists"}