{"id":14982297,"url":"https://github.com/vubacktracking/hdfs-stream-processing","last_synced_at":"2026-02-15T16:05:16.758Z","repository":{"id":314334152,"uuid":"831589053","full_name":"VuBacktracking/hdfs-stream-processing","owner":"VuBacktracking","description":"Streaming data processing using Hadoop HDFS, Spark, Kafka, Minio, Elasticsearch","archived":false,"fork":false,"pushed_at":"2024-08-01T05:27:27.000Z","size":506,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-09-11T22:48:39.866Z","etag":null,"topics":["airflow","elastic","hadoop","hdfs","kafka","kibana","minio","spark"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VuBacktracking.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-07-21T02:48:26.000Z","updated_at":"2025-08-25T11:56:12.000Z","dependencies_parsed_at":"2025-09-11T22:48:41.379Z","dependency_job_id":"f0b09e05-b5f0-43b0-b7aa-94844223df6c","html_url":"https://github.com/VuBacktracking/hdfs-stream-processing","commit_stats":null,"previous_names":["vubacktracking/hdfs-stream-processing"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/VuBacktracking/hdfs-stream-processing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VuBacktracking%2Fhdfs-stream-processing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VuBacktracking%2Fhdfs-stream-processing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VuBacktracking%2Fhdfs-stream-processing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VuBacktracking%2Fhdfs-stream-processing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VuBacktracking","download_url":"https://codeload.github.com/VuBacktracking/hdfs-stream-processing/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VuBacktracking%2Fhdfs-stream-processing/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278343071,"owners_count":25971399,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-04T02:00:05.491Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","elastic","hadoop","hdfs","kafka","kibana","minio","spark"],"created_at":"2024-09-24T14:05:05.668Z","updated_at":"2025-10-04T16:43:19.552Z","avatar_url":"https://github.com/VuBacktracking.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Stream Data Processing using Hadoop Ecosystem\n\n## Overview\n\n* Fetch compressed data from a URL.\n* Utilize PySpark for data processing, leveraging HDFS for storage and monitoring resources via Apache Hadoop YARN.\n* Employ a data generator to simulate streaming data and transmit it to Apache Kafka.\n* Implement PySpark (Spark Streaming) to consume and process streaming data from Kafka topics.\n* Persist streaming data into Elasticsearch for storage and subsequent visualization using Kibana.\n* Store streaming data into MinIO, a cloud-native object storage service.\n* Utilize Apache Airflow for orchestrating the entire data pipeline workflow.\n\n## System Architecture\n\u003cp align = \"center\"\u003e\n    \u003cimg src=\"assets/architecture.png\" alt=\"workflow\"\u003e\n\u003c/p\u003e\n\n## Prequisites\nBefore runing this script, ensure you have the following installed.\\\n**Note**:  The project was setup on Ubuntu 22.04 OS.\n\n* Ubuntu 22.04 (prefered, but you can use Ubuntu 20.04)\n* Apache Hadoop HDFS (installed locally)\n* Apache Spark (installed locally)\n* Apache Kafka (installed locally)\n* Apache Airflow\n* Docker\n* Minio\n* Elasticsearch, Kibana\n\n## Getting Started\n\n1. **Clone the repository**\n```\n$ git clone https://github.com/VuBacktracking/hdfs-stream-processing.git\n$ cd hdfs-stream-processing\n```\n\n2. **Start our data streaming infrastructure**\n```\n$ sudo service docker start\n$ sudo systemctl start zookeeper\n$ sudo systemctl start kafka\n$ start-all.sh\n$ docker compose -f storage-docker-compose.yaml up -d\n```\n\n3. **Setup environment**\n```\n$ python3 -m venv .venv\n$ pip install -r requirements.txt\n```\n\nCreate `.env` file and paste your HADOOP_HOME, SPARK_HOME, KAFKA_HOME in it.\n```\nHADOOP_HOME=\"\"\nSPARK_HOME=\"\"\nKAFKA_HOME=\"\"\n```\n\n4. **Services**\n\n    * Elasticsearch -\u003e `localhost:5601`\n    * Airflow -\u003e `localhost:8080`\n    * MinIO -\u003e `localhost:9001`\n    * Spark Jobs -\u003e `localhost:4040`\n    * Kafka -\u003e `localhost:9092`\n    * Hadoop Namenode -\u003e `localhost:9870`\n    * Hadoop YARN -\u003e `localhost:8088/cluster`\n    * Hadoop HDFS -\u003e `localhost:9000`\n\n## Steps of the project\n\n1. **Download data and put to Hadoop HDFS**\n```\n$ wget -O data/sensors.zip https://github.com/erkansirin78/datasets/raw/master/sensors_instrumented_in_an_office_building_dataset.zip \u0026\u0026 unzip ./data/sensors.zip -d ./data/ \u0026\u0026 rm data/KETI/README.txt \u0026\u0026 rm data/sensors.zip\n```\n\n```\n$ hdfs dfs -mkdir -p /user/stream_data/\n$ hdfs dfs -copyFromLocal ./data/KETI/ /user/stream_data/\n```\n\n2. **Run spark transforming**\n```\npython3 utils/spark_transforming.py\n```\n\n\u003cp align = \"center\"\u003e\n    \u003cimg src=\"assets/transforming_log_file.png\" alt=\"workflow\"\u003e\n\u003c/p\u003e\n\n3. **Create Kafka Topic**\n```\npython3 kafka/kafka_admin.py\n```\n\u003cp align = \"center\"\u003e\n    \u003cimg src=\"assets/kafka_topic_log.png\" alt=\"workflow\"\u003e\n\u003c/p\u003e\n\n4. **Running Data Generator**\n\nThanks for the repository of [@erkansirin78](https://github.com/erkansirin78), this script successfully simulates a streaming data. You can find the scripts of `datatframe_to_kafka.py` in this repository [data_generator](https://github.com/erkansirin78/data-generator).\n\n```\npython3 kafka/kafka_producer.py\n```\n\nor directly use the bash script\n```\n./bash/data_generator.sh\n```\n\n5. **Read and store data in Minio and Elasticsearch**\n```\npython3 spark_streaming/covert-to-elasticsearch.py\n```\n\nand\n\n```\npython3 spark_streaming/covert-to-minio.py\n```\n\n## Read data in Elasticsearch","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvubacktracking%2Fhdfs-stream-processing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvubacktracking%2Fhdfs-stream-processing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvubacktracking%2Fhdfs-stream-processing/lists"}