{"id":17260076,"url":"https://github.com/paulescu/real-time-data-pipelines-in-python","last_synced_at":"2025-09-11T23:35:08.108Z","repository":{"id":216301024,"uuid":"740942889","full_name":"Paulescu/real-time-data-pipelines-in-python","owner":"Paulescu","description":"Real-time Feature Pipelines in Python ⚡","archived":false,"fork":false,"pushed_at":"2024-03-14T22:49:08.000Z","size":1801,"stargazers_count":267,"open_issues_count":2,"forks_count":66,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-03-30T22:11:56.336Z","etag":null,"topics":["ml","python","quix","realtime"],"latest_commit_sha":null,"homepage":"https://www.realworldml.net/subscribe","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Paulescu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-09T11:43:53.000Z","updated_at":"2025-03-17T16:25:13.000Z","dependencies_parsed_at":"2024-02-20T11:47:53.515Z","dependency_job_id":"0907afe8-38c6-4098-a35e-22d3925edc36","html_url":"https://github.com/Paulescu/real-time-data-pipelines-in-python","commit_stats":null,"previous_names":["paulescu/real-time-data-pipelines-in-python"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Paulescu%2Freal-time-data-pipelines-in-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Paulescu%2Freal-time-data-pipelines-in-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Paulescu%2Freal-time-data-pipelines-in-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Paulescu%2Freal-time-data-pipelines-in-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Paulescu","download_url":"https://codeload.github.com/Paulescu/real-time-data-pipelines-in-python/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247595335,"owners_count":20963943,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ml","python","quix","realtime"],"created_at":"2024-10-15T07:47:07.301Z","updated_at":"2025-04-07T05:13:27.687Z","avatar_url":"https://github.com/Paulescu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003ca href='https://www.realworldml.net/'\u003e\u003cimg src='./assets/rwml_logo.png' width='350'\u003e\u003c/a\u003e    \n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003ch1\u003eBuild and deploy a production-ready real-time feature pipeline in Python\u003c/h1\u003e\n    \u003ch2\u003eApache Kafka + Python = \u003ca href=\"https://github.com/quixio/quix-streams\"\u003eQuix Streams\u003c/a\u003e ❤️\u003c/h2\u003e\n    \n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003ca href=\"https://www.youtube.com/watch?v=JMQwXmlloJM\"\u003e\n    \u003cimg src=\"assets/yt_cover.png\" alt=\"Intro to the course\" style=\"width:75%;\"\u003e\n    \u003cp\u003eClick here to watch the video 🎬\u003c/p\u003e\n  \u003c/a\u003e\n\u003c/div\u003e\n\n\n#### Table of contents\n* [The problem](#the-problem)\n* [Example](#example)\n* [Run the pipeline locally](#run-the-pipeline-locally)\n* [Deployment](#deployment)\n* [Streamlit dashboard for monitoring](#streamlit-dashboard-for-monitoring)\n* [Wanna learn more real-time ML?](#wanna-learn-more-real-time-ml)\n\n\n## The problem\n\nImagine you want to build a trading bot for crypto currencies using ML.\n\nBefore you even get to work on your ML model, you need to design, develop and deploy a **real-time feature pipeline** that produces the features your model needs both at training time and at inference time.\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./assets/3_pipelines.gif\" width='400' /\u003e\n\u003c/div\u003e\n\nThis pipeline has 3 steps:\n\n- **Ingest** raw data from an external service, like raw trades from the Kraken Websocket API.\n\n- **Transform** these trades into features for your ML model, like trading indicators based on 1-minute OHLC candles, and\n\n- **Save** these features in a Feature Store, so your ML models can fetch them both to generate training data, and to generate real-time predictions.\n\nIn a real-world setting, each of these steps is implemented as a separate service, and communication between these services happens through a message broker like Kafka.\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./assets/docker_and_kafka.gif\" width='400' /\u003e\n\u003c/div\u003e\n\nThis way you make your system scalable, by spinning up more containers as needed, and leveraging Kafka consumer groups.\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./assets/scaling.gif\" width='400' /\u003e\n\u003c/div\u003e\n\nAnd this is all great, but the question now is\n\u003e How do you implement this in practice?\n\nLet's go through an example.\n\n## Example\n\nIn this repo you have a full implementation of a production-ready real-time feature pipeline for crypto trading, plus a real-time [dasbhoard to visualize these features](https://streamlit-plabartabajo-ohlcinrealtime-production.deployments.quix.io/).\n\nWe use [Quix Streams 2.0](https://github.com/quixio/quix-streams) a cloud native library for processing data in Kafka using pure Python.\n\nWith Quix Streams we get the best from both worlds:\n\n- low-level scalability and resiliency from Apache Kafka, so our code is production-ready from day 1, and\n\n- an easy-to-use Python interface, which makes this library extremely user-friendly for Data Scientist and ML engineers like you and me.\n\n\nIn this repository we have implemented 3 services for our real-time pipeline\n\n- `trade_producer` → reads trades from the Kraken Websocket API and saves them in a Kafka topic.\n- `trade_to_ohlc` → reads trades from Kafka topic, computes Open-High-Low-Close candles (OHLC) using Stateful Window Operators, and saves them in another Kafka topic.\n\n- `ohlc_to_feature_store` → saves these final features to an external Features Store.\n\nPlus a\n\n- Streamlit `dashboard` to visualize the saved features in real-time.\n\nThe final pipeline has been deployed to the [Quix Cloud](https://quix.io/), as well as the [Streamlit dashboard](https://dashboard-plabartabajo-ohlcinrealtime-production.deployments.quix.io/).\n\n\n\n## Run the pipeline locally\n\n1. Create an `.env` file and fill in the credentials to connect to the serverles Hopsworks Feature Store\n    ```\n    $ cp .env.example .env\n    ```\n\n2. Build Docker image for each of the pipeline steps: `trade_producer`, `trade_to_ohlc` and `ohlc_to_feature_store`\n    ```\n    $ make build\n    ```\n\n3. Start the pipeline\n    ```\n    $ make start\n    ```\n\n3. Stop the pipeline locally\n    ```\n    $ make stop\n    ```\n\n## Deployment\n\nThis pipeline can run on any production environment that supports Docker and a message broker like Apache Kafka or Redpanda. In this example, I have deployed it to Quix Cloud.\n\n\u003e[Quix Cloud](https://quix.io/) provides fully managed containers, Kafka and observability tools to run your applications in production.\n\nTo deploy this pipeline to the [Quix Cloud]() you just need to\n\n- [Sign up for FREE](https://quix.io/)\n- Create a Quix Cloud Project and an environment, and\n- Fork this repository and link it to your newly created Quix Cloud environment.\n\n\u003e [This video](https://quix.io/docs/create/overview.html#next-step) will help you get up and running on Quix Cloud\n\n## Streamlit dashboard for monitoring\n\nThe streamlit app at `/dashboard` periodically fetches the latest data from the feature store, and plots it on a dashboard.\n\nThe dashboard has been deployed to Quix Cloud and it is publicly accessible [here](https://dashboard-plabartabajo-ohlcinrealtime-production.deployments.quix.io/).\n\n\n\u003cdiv align=\"center\"\u003e\n  \u003ca href=\"https://dashboard-plabartabajo-ohlcinrealtime-production.deployments.quix.io/\"\u003e\n    \u003cimg src=\"assets/dashboard.png\" alt=\"Intro to the course\" style=\"width:75%;\"\u003e\n    \u003cp\u003eClick here to see the dashoard in action\u003c/p\u003e\n  \u003c/a\u003e\n\u003c/div\u003e\n\n\n## Wanna learn more real-time ML?\n\nJoin more than 13k subscribers to the Real-World ML Newsletter. Every Saturday morning.\n\n[→ Subscribe for FREE 🤗](https://www.realworldml.net/subscribe)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpaulescu%2Freal-time-data-pipelines-in-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpaulescu%2Freal-time-data-pipelines-in-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpaulescu%2Freal-time-data-pipelines-in-python/lists"}