{"id":50115745,"url":"https://github.com/lpsouza/bigdata-exercise","last_synced_at":"2026-05-23T15:05:04.908Z","repository":{"id":354757103,"uuid":"1223022465","full_name":"lpsouza/bigdata-exercise","owner":"lpsouza","description":"This project provides a containerized environment to run Big Data processing exercises using Apache Hadoop and Apache Spark.","archived":false,"fork":false,"pushed_at":"2026-04-29T22:30:23.000Z","size":506,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-30T00:18:10.532Z","etag":null,"topics":["docker","docker-compose","exercises","hadoop","spark"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lpsouza.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-28T00:01:33.000Z","updated_at":"2026-04-29T22:30:27.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/lpsouza/bigdata-exercise","commit_stats":null,"previous_names":["lpsouza/bigdata-exercise"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/lpsouza/bigdata-exercise","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpsouza%2Fbigdata-exercise","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpsouza%2Fbigdata-exercise/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpsouza%2Fbigdata-exercise/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpsouza%2Fbigdata-exercise/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lpsouza","download_url":"https://codeload.github.com/lpsouza/bigdata-exercise/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lpsouza%2Fbigdata-exercise/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33400254,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-23T04:15:53.637Z","status":"ssl_error","status_checked_at":"2026-05-23T04:15:53.242Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker","docker-compose","exercises","hadoop","spark"],"created_at":"2026-05-23T15:04:43.306Z","updated_at":"2026-05-23T15:05:04.666Z","avatar_url":"https://github.com/lpsouza.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Big Data Exercise: Hadoop and Spark Cluster\n\nThis project provides a containerized environment to run Big Data processing exercises using Apache Hadoop and Apache Spark. It sets up a localized cluster to demonstrate WordCount operations using both MapReduce (Hadoop) and PySpark.\n\n## Project Structure\n\n- `docker-compose.yaml`: Orchestrates the Hadoop (Namenode, Datanode, ResourceManager, NodeManager) and Spark (Master, Worker) containers.\n- `hadoop.sh`: Shell script to automate the execution of a Hadoop MapReduce job.\n- `spark.sh`: Shell script to automate the execution of a PySpark job.\n- `data/`: Directory containing the input dataset (`moby-dick.txt`) and the Spark script (`wordcount.py`).\n\n---\n\n## Prerequisites: Installing Docker\n\nBefore running the exercises, you must have Docker and Docker Compose installed on your machine.\n\n### Windows\n\n1. **Install WSL 2**: Open PowerShell as Administrator and run:\n\n    ```powershell\n    wsl --install\n    ```\n\n2. **Download Docker Desktop**: Download the installer from the [Docker Official Website](https://www.docker.com/products/docker-desktop).\n3. **Installation**: Run the installer and ensure the \"Use the WSL 2 based engine\" option is selected.\n4. **Verification**: Open a terminal and run `docker --version`.\n\n### Linux (Agnostic)\n\n1. **Install using the convenience script**:\n\n    ```bash\n    curl -fsSL https://get.docker.com -o get-docker.sh\n    sudo sh get-docker.sh\n    ```\n\n2. **Manage Docker as a non-root user**:\n\n    ```bash\n    sudo usermod -aG docker $USER\n    ```\n\n    *Note: Log out and log back in for this to take effect.*\n\n3. **Verification**: Run `docker --version`.\n\n### macOS\n\n1. **Download Docker Desktop**: Download the installer (Intel or Apple Chip version) from the [Docker Official Website](https://www.docker.com/products/docker-desktop).\n2. **Installation**: Drag and drop Docker into the Applications folder and follow the setup instructions.\n3. **Verification**: Open a terminal and run `docker --version`.\n\n---\n\n## Getting Started\n\n### 1. Spin up the Cluster\n\nNavigate to the project root and run:\n\n```bash\ndocker compose up -d\n```\n\nThis will start all necessary services in the background.\n\n### 2. Verify Services\n\nCheck if the containers are running:\n\n```bash\ndocker compose ps\n```\n\nYou can also access the Web UIs:\n\n- **Hadoop Namenode**: [http://localhost:9870](http://localhost:9870)\n- **Spark Master**: [http://localhost:8080](http://localhost:8080)\n\n---\n\n## Running the Exercises\n\n### Hadoop MapReduce\n\nThe script uploads the text file to HDFS, runs the built-in WordCount example, and filters the result for the word \"whale\".\n\n**Linux/macOS**:\n\n```bash\nchmod +x hadoop.sh\n./hadoop.sh\n```\n\n**Windows (PowerShell)**:\n\n```powershell\n./hadoop.ps1\n```\n\n### Apache Spark (PySpark)\n\nThe script submits the `wordcount.py` job to the Spark cluster. It processes the data and saves the output to a temporary directory inside the container.\n\n**Linux/macOS**:\n\n```bash\nchmod +x spark.sh\n./spark.sh\n```\n\n**Windows (PowerShell)**:\n\n```powershell\n./spark.ps1\n```\n\n---\n\n## Cleaning Up\n\nTo stop and remove the containers and volumes:\n\n```bash\ndocker compose down -v\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flpsouza%2Fbigdata-exercise","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flpsouza%2Fbigdata-exercise","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flpsouza%2Fbigdata-exercise/lists"}