{"id":15208034,"url":"https://github.com/vubacktracking/stream-data-processing","last_synced_at":"2026-01-31T21:07:45.661Z","repository":{"id":252817227,"uuid":"836596887","full_name":"VuBacktracking/stream-data-processing","owner":"VuBacktracking","description":"Streaming data processing pipeline using Spark, PostgreSQL, Debezium, Kafka, Minio, Delta Lake, Trino and DBeaver","archived":false,"fork":false,"pushed_at":"2024-08-16T08:52:53.000Z","size":1797,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-06T17:15:36.953Z","etag":null,"topics":["dbeaver","debezium","delta-lake","kafka","spark","spark-streaming","stream-processing","trino"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VuBacktracking.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-01T07:20:14.000Z","updated_at":"2025-01-16T04:03:51.000Z","dependencies_parsed_at":null,"dependency_job_id":"f8c062ee-c2cb-4b55-b0c8-1beed3e06f84","html_url":"https://github.com/VuBacktracking/stream-data-processing","commit_stats":{"total_commits":37,"total_committers":1,"mean_commits":37.0,"dds":0.0,"last_synced_commit":"343f4d06ba900dbaca9002ff70d6d3eb449302af"},"previous_names":["vubacktracking/stream-data-processing"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/VuBacktracking/stream-data-processing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VuBacktracking%2Fstream-data-processing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VuBacktracking%2Fstream-data-processing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VuBacktracking%2Fstream-data-processing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VuBacktracking%2Fstream-data-processing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VuBacktracking","download_url":"https://codeload.github.com/VuBacktracking/stream-data-processing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VuBacktracking%2Fstream-data-processing/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260945484,"owners_count":23087021,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dbeaver","debezium","delta-lake","kafka","spark","spark-streaming","stream-processing","trino"],"created_at":"2024-09-28T07:00:58.443Z","updated_at":"2026-01-31T21:07:45.633Z","avatar_url":"https://github.com/VuBacktracking.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DATA STREAM PROCESSING \n\n## Overview\n\n* Persist data to PostgreSQL.\n* Monitor changes to data using the Debezium Connector.\n* Stream data from a Kafka topic using PySpark (Spark Streaming).\n* Convert the streaming data to Delta Lake format.\n* Write the Delta Lake data to MinIO (AWS Object Storage).\n* Query the data with Trino.\n* Display the results in DBeaver.\n\n## System Architecture\n\n\u003cp align = \"center\"\u003e\n    \u003cimg src=\"assets/architecture.png\" alt=\"workflow\"\u003e\n\u003c/p\u003e\n\n## Prequisites \n\nBefore runing this script, ensure you have the following installed.\\\n**Note**:  The project was setup on Ubuntu 22.04 OS.\n\n* Ubuntu 22.04 (prefered, but you can use Ubuntu 20.04)\n* Python 3.10\n* Apache Spark (installed locally)\n* Apache Airflow\n* Confluent Containers (Zookeeper, Kafka, Schema Registry, Connect, Control Center)\n* Docker\n* Minio\n* Trino, DBeaver CE\n* Delta Lake\n* Debezium, Debezium UI\n\n## Start\n\n1. **Clone the repository**\n```bash\n$ git clone https://github.com/VuBacktracking/stream-data-processing.git\n$ cd stream-data-processing\n```\n\n2. **Start our data streaming infrastructure**\n```bash\n$ sudo service docker start\n$ docker compose -f storage-docker-compose.yaml -f stream-docker-compose.yaml up -d\n```\n\n3. **Setup environment**\n```bash\n$ python3 -m venv .venv\n$ pip install -r requirements.txt\n```\n\nCreate `.env` file and paste your MINIO keys, SPARK_HOME in it.\n```ini\n# MinIO\n- MINIO_ACCESS_KEY='minio_access_key'\n- MINIO_SECRET_KEY='minio_secret_key'\n- MINIO_ENDPOINT='http://localhost:9000'\n- BUCKET_NAME='datalake'\n\n# Postgres SQL\n- POSTGRES_DB='v9'\n- POSTGRES_USER='v9'\n- POSTGRES_PASSWORD='v9'\n\n# Spark\n- SPARK_HOME=\"\"\n```\n\n4. **Services**\n\n* Postgres is accessible on the default port 5432.\n* Debezium UI: http://localhost:8085.\n* Kafka Control Center: http://localhost:9021.\n* Trino: http://localhost:8084.\n* MinIO: http://localhost:9001.\n\n## How to use?\n\n- **Step 1. Start Debezium Connection**\n```bash\ncd debezium\nbash run-cdc.sh register_connector conf/products-cdc-config.json\n```\n\nYou should see the connection is running like the image below in the port http://localhost:8085.\n\n\u003cp align = \"center\"\u003e\n    \u003cimg src=\"assets/debezium-connect.png\" width = 80%\u003e\n\u003c/p\u003e\n\n- **Step 2. Create table and insert data into Database**\n\n```bash\npython3 database-operations/create_table.py\npython3 database-operations/insert_table.py\n```\n\nIn the PostgreSQL connection, you should see the database `v9` and the table `products` like the image below.\n\n\u003cp align = \"center\"\u003e\n    \u003cimg src=\"assets/postgres.png\" width = 80%\u003e\n\u003c/p\u003e\n\n- **Step 3. Start Streaming Data to MinIO**\n```bash\npython3 stream_processing/delta-to-minio.py\n```\n\nAfter putting data to MinIO storage, you can go to the port http://localhost:9001 and see the result like this image\n\n\u003cp align = \"center\"\u003e\n    \u003cimg src=\"assets/minio.png\" width = 80%\u003e\n\u003c/p\u003e\n\n## Read streaming data with Trino and Dbeaver\n\n### Connect Trino in Dbeaver\n\n\u003cp align = \"center\"\u003e\n    \u003cimg src=\"assets/trino_connect.png\" width = 80%\u003e\n\u003c/p\u003e\n\n### Query with Dbeaver\n\nCreate your Trino schema and table in Dbeaver\n\n```sql\n-- Create the schema if it doesn't exist\nCREATE SCHEMA IF NOT EXISTS lakehouse.products\nWITH (location = 's3://datalake/');\n\n-- Create the products table\nCREATE TABLE IF NOT EXISTS lakehouse.products.products (\n    id VARCHAR,\n    name VARCHAR,\n    original_price DOUBLE,\n    price DOUBLE,\n    fulfillment_type VARCHAR,\n    brand VARCHAR,\n    review_count INTEGER,\n    rating_average DOUBLE,\n    favourite_count INTEGER,\n    current_seller VARCHAR,\n    number_of_images INTEGER,\n    category VARCHAR,\n    quantity_sold INTEGER,\n    discount DOUBLE\n) WITH (\n    location = 's3://datalake/products/'\n);\n```\n\n\u003cp align = \"center\"\u003e\n    \u003cimg src=\"assets/trino_dbeaver.png\" width = 80%\u003e\n\u003c/p\u003e","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvubacktracking%2Fstream-data-processing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvubacktracking%2Fstream-data-processing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvubacktracking%2Fstream-data-processing/lists"}