{"id":23499627,"url":"https://github.com/same-ou/spark-hdfs-ml","last_synced_at":"2025-10-31T08:31:33.374Z","repository":{"id":268627162,"uuid":"904949907","full_name":"same-ou/spark-hdfs-ml","owner":"same-ou","description":"Spark and HDFS cluster using Docker and Docker Compose","archived":false,"fork":false,"pushed_at":"2024-12-25T14:54:10.000Z","size":286,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-07T12:52:22.528Z","etag":null,"topics":["hdfs","ml","spark"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/same-ou.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-17T21:31:24.000Z","updated_at":"2025-02-06T11:37:15.000Z","dependencies_parsed_at":"2024-12-18T00:19:01.246Z","dependency_job_id":"ac852d7b-9691-4de0-8f1d-ca2ed2b1a8ce","html_url":"https://github.com/same-ou/spark-hdfs-ml","commit_stats":null,"previous_names":["same-ou/spark-hdfs-ml"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/same-ou/spark-hdfs-ml","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/same-ou%2Fspark-hdfs-ml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/same-ou%2Fspark-hdfs-ml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/same-ou%2Fspark-hdfs-ml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/same-ou%2Fspark-hdfs-ml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/same-ou","download_url":"https://codeload.github.com/same-ou/spark-hdfs-ml/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/same-ou%2Fspark-hdfs-ml/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281956160,"owners_count":26589782,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-31T02:00:07.401Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hdfs","ml","spark"],"created_at":"2024-12-25T06:18:00.117Z","updated_at":"2025-10-31T08:31:32.987Z","avatar_url":"https://github.com/same-ou.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark \u0026 HDFS Cluster Setup for Machine Learning Training\nThis guide provides detailed steps to set up a Spark and HDFS cluster using Docker and Docker Compose. The setup includes loading data into HDFS and running simple PySpark applications for machine learning.\n\n## Prerequisites\nBefore you begin, ensure that you have the following installed:\n\n * [Docker](https://docs.docker.com/get-started/get-docker/): Make sure Docker is installed and running on your machine.\n * [Docker Compose](https://docs.docker.com/compose/install/): This is required to run the multi-container setup.\n\n## Step 1: Clone the Repository\nClone the repository to your local machine:\n\n```bash\ngit clone git@github.com:same-ou/spark-hdfs-ml.git\ncd spark-hdfs-ml \n```\n## Step 2: Set Up Docker Containers with Docker Compose\nThis project uses Docker Compose to set up the Spark Master, Spark Worker, HDFS (NameNode and DataNode) containers. To start the containers, follow these steps:\n\n1. Start Docker Compose:\n\n    From the project root, run the following command to start the cluster:\n\n    ```bash\n        docker-compose up -d\n    ```\nThis command will automatically start the containers for the Spark Master, Spark Worker, HDFS NameNode, and HDFS DataNode.\n\n2. Verify the Containers:\n\n    Check if the containers are running:\n\n    ```bash\n        docker ps\n    ```\n\nYou should see containers for `spark-master`, `spark-worker-1`, `spark-worker-2`, `namenode`, `datanode`, and `hue`.\n\n## Step 3: Cluster Architecture\nBelow is a diagram of the architecture for this Spark and HDFS cluster setup.\n\n![cluster architecture](images/architecture-dark.png)\n\n\nSpark Master: The central controller node that manages the cluster and schedules tasks.\nSpark Worker: The worker nodes that execute the tasks.\nHDFS NameNode: The master server that manages the filesystem metadata.\nHDFS DataNode: The worker nodes that store the actual data in HDFS.\n\n## Step 4: Load Data into HDFS\n\nIn order to load data into HDFS, we need to have the data available inside the `namenode` container. One way to do this is by using the `docker cp` command, which copies files or directories between the host machine and a container. The command would look like this:\n\n```bash\ndocker cp /path/to/your/data/data.csv namenode:/tmp/data/\n```\nHowever, we are using a simpler and more efficient approach. The data folder in the main project folder is mounted to a specific folder on the `namenode` container. You can verify this by running the following command inside the namenode container\n\n```bash\ndocker exec -it namenode ls /tmp/data \n```\nThis will show the contents of the /tmp/data directory inside the namenode container. As the data folder in the main project folder is automatically mounted to the container's volume, any files placed in the data folder on the host machine will be automatically available inside the container.\n\nSimply add your data file to the data folder in the project directory, and it will be automatically mounted to the container's `/tmp/data/` directory.\n\n### Copy the Data to HDFS\nAfter you verify that the data is correctly mounted, you can proceed to upload the data into HDFS:\n\n1. Create a directory in HDFS to store the CSV file:\n\n```bash\ndocker exec -it namenode hdfs dfs -mkdir -p /user/data  \n```\n2. Upload the CSV file (e.g., tweets.csv) from the mounted /tmp/data folder into the newly created HDFS directory:\n\n```bash\ndocker exec -it namenode hdfs dfs -put /tmp/data/tweets.csv /user/data/\n```\n\n3. Verify that the file was uploaded successfully:\n\n```bash\ndocker exec -it namenode hdfs dfs -ls /user/data   \n```\n\nYou should see tweets.csv listed in the output.\n\nAdditionally, you can verify the file upload using Hue. Hue is a web interface that allows you to interact with HDFS. To use it:\n\n1. Open your browser and navigate to http://localhost:8000.\n2. Log in with the default credentials (If this is your first time logging in, you will be asked to create an account).\n3. Go to the File Browser section and navigate to /user/data/.\n4. You should see tweets.csv listed in the directory.\n\n## Step 5: Process Data Using PySpark\n\nNow that the data is loaded into HDFS, you can use the PySpark application to process it.\n\n1. **Create your application in the `apps` folder**, which is mounted to the `spark-master` container. You can write your PySpark code, for example, `read_hdfs.py` (or any other script), to process the data.\n\n2. **Run the application** by executing the following command from your local machine:\n\n```bash\ndocker exec -it spark-master \\\n    /opt/bitnami/spark/bin/spark-submit \\\n    --master spark://spark-master:7077 \\\n    /opt/bitnami/spark/work/read_hdfs.py\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsame-ou%2Fspark-hdfs-ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsame-ou%2Fspark-hdfs-ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsame-ou%2Fspark-hdfs-ml/lists"}