{"id":24458288,"url":"https://github.com/olamide100/distributed_computing","last_synced_at":"2026-04-24T18:06:20.601Z","repository":{"id":218983777,"uuid":"747852425","full_name":"OLAMIDE100/Distributed_Computing","owner":"OLAMIDE100","description":"Kafka, Spark, Airflow, Cassandra , Docker","archived":false,"fork":false,"pushed_at":"2024-01-24T21:38:37.000Z","size":21,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-31T06:43:28.385Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OLAMIDE100.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-01-24T19:11:33.000Z","updated_at":"2024-01-25T09:57:48.000Z","dependencies_parsed_at":"2024-01-24T20:57:53.365Z","dependency_job_id":null,"html_url":"https://github.com/OLAMIDE100/Distributed_Computing","commit_stats":null,"previous_names":["olamide100/distributed_computing"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/OLAMIDE100/Distributed_Computing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OLAMIDE100%2FDistributed_Computing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OLAMIDE100%2FDistributed_Computing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OLAMIDE100%2FDistributed_Computing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OLAMIDE100%2FDistributed_Computing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OLAMIDE100","download_url":"https://codeload.github.com/OLAMIDE100/Distributed_Computing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OLAMIDE100%2FDistributed_Computing/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32234789,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-24T13:21:15.438Z","status":"ssl_error","status_checked_at":"2026-04-24T13:21:15.005Z","response_time":64,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-21T03:13:40.001Z","updated_at":"2026-04-24T18:06:15.593Z","avatar_url":"https://github.com/OLAMIDE100.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Distributed Computing\n\n- [Overview](#overview)\n- [Technologies](#technologies)\n- [Getting Started](#getting-started)\n- [Architecture](#architecture)\n\n- [License](#license)\n\n## Overview\n\nThis repository contains a comprehensive data processing pipeline leveraging Kafka, Spark, Airflow, Docker, and Cassandra. The project is designed to showcase a scalable and fault-tolerant architecture for handling large volumes of data. Additionally, fake data is generated using the Faker library to simulate a real-world use case.\n\n## Technologies\n\n### 1. Kafka\n\n[Kafka](https://kafka.apache.org/) is used as the messaging backbone for the project. It enables scalable and distributed data streaming between different components of the pipeline.\n\n### 2. Spark\n\n[Spark](https://spark.apache.org/) is employed for large-scale data processing. It utilizes the power of distributed computing to efficiently process and analyze data from Kafka topics.\n\n### 3. Airflow\n\n[Airflow](https://airflow.apache.org/) is utilized for orchestrating the entire data pipeline. It allows for easy scheduling, monitoring, and management of complex workflows.\n\n### 4. Docker\n\n[Docker](https://www.docker.com/) is employed for containerization, ensuring consistency across different environments. Each component of the pipeline is encapsulated in a Docker container, making deployment and scaling straightforward.\n\n### 5. Cassandra\n\n[Cassandra](https://cassandra.apache.org/) is chosen as the NoSQL database for storing processed data. It provides scalability and high availability, making it suitable for handling large amounts of data.\n\n### 6. Faker\n\n[Faker](https://faker.readthedocs.io/en/master/) is used to generate synthetic data for testing and development purposes. It helps in simulating realistic data scenarios within the pipeline.\n\n## Getting Started\n\nTo get started with the project, follow these steps:\n\n\n1. **Set the environment variables in secrets.env**\n```\n\n# Airflow configurations\nAIRFLOW__CORE__FERNET_KEY=\nAIRFLOW__WEBSERVER__SECRET_KEY=\n_AIRFLOW_WWW_USER_USERNAME=\n_AIRFLOW_WWW_USER_PASSWORD=\nAIRFLOW_UID=\n\n# Postgres configurations\nPOSTGRES_USER=\nPOSTGRES_PASSWORD=\nPOSTGRES_DB=\n# casandra configurations\nMAX_HEAP_SIZE=\nHEAP_NEWSIZE=\nCASSANDRA_USERNAME=\nCASSANDRA_PASSWORD=\nCASSANDRA_PORT=\n\n\n\n# Zookeeper configurations\nZOOKEEPER_CLIENT_PORT=\nZOOKEEPER_SERVER_ID=\n\n\n\n# Kafka base configurations\nREPLICATION_FACTOR=\nCONNECT_REST_PORT=\nCONNECT_REST_ADVERTISED_HOST_NAME=\nKAFKA_AUTHORIZER_CLASS_NAME=\nKAFKA_ALLOW_EVERYONE_IF_NO_ACL_FOUND=\nSCHEMA_REGISTRY_HOST_NAME=\nKAFKA_CLUSTERS_0_NAME=\nDYNAMIC_CONFIG_ENABLED=\n\n\n\n\n# Spark configurations\nSPARK_UI_PORT=\nSPARK_MASTER_PORT=\nSPARK_USER=\nSPARK_WORKER_MEMORY=\nSPARK_WORKER_CORES=\n\n\n\n```\n2. **Run Makefile**\n\n```\nmake run\n\n```\n3. **Cassandra Table Creation**\n```\ndocker exec -it cassandra\n\n```\n```\ncqlsh -u cassandra -p cassandra\nCREATE KEYSPACE spark_streaming WITH replication = {'class':'SimpleStrategy','replication_factor':1};\n\nCREATE TABLE spark_streaming.sentimental_analysis(id int primary key, sentence text, sentimental_analysis text);\n\n```\n\n4. **Kafka Topic Creation**\n```\ndocker exec -it  kafka_ui\n\n```\n```\nbin/kafka-topics.sh --bootstrap-server localhost:9092 --create \\\n                    --topic sentences \\\n                    --partitions 2 \\\n                    --replication-factor 2 \\\n                    --config max.message.bytes=64000 \\\n                    --config flush.messages=1\n                    --config acks=all\n                    --config min.insync.replicas=2\n\n```\n\n\n5. **Run Spark Streaming**\n```\ndocker exec -it spark-worker \nspark-submit  --jars /opt/bitnami/spark/jars/spark-sql-kafka-0-10_2.12-3.3.0.jar,/opt/bitnami/spark/jars/spark-cassandra-connector_2.12-3.3.0.jar data/spark_streaming.py\n```\n\n\n## License\n\nThis project is licensed under the [MIT License](LICENSE).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Folamide100%2Fdistributed_computing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Folamide100%2Fdistributed_computing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Folamide100%2Fdistributed_computing/lists"}