{"id":19766189,"url":"https://github.com/akaliutau/hadoop-cluster","last_synced_at":"2026-05-14T01:40:45.537Z","repository":{"id":120056035,"uuid":"368464412","full_name":"akaliutau/hadoop-cluster","owner":"akaliutau","description":"Batch data processing on the dockerized Hadoop cluster","archived":false,"fork":false,"pushed_at":"2021-07-09T10:07:02.000Z","size":96,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-15T15:55:36.845Z","etag":null,"topics":["batch-processing","hadoop-cluster","hdf5","hdfs","java","mapreduce"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/akaliutau.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-05-18T08:58:15.000Z","updated_at":"2023-04-05T16:57:36.000Z","dependencies_parsed_at":"2023-10-01T15:09:48.649Z","dependency_job_id":null,"html_url":"https://github.com/akaliutau/hadoop-cluster","commit_stats":{"total_commits":3,"total_committers":1,"mean_commits":3.0,"dds":0.0,"last_synced_commit":"6146b1a7dcfe5b6acdad3022ae67f57c4738c4db"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/akaliutau/hadoop-cluster","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akaliutau%2Fhadoop-cluster","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akaliutau%2Fhadoop-cluster/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akaliutau%2Fhadoop-cluster/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akaliutau%2Fhadoop-cluster/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/akaliutau","download_url":"https://codeload.github.com/akaliutau/hadoop-cluster/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akaliutau%2Fhadoop-cluster/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33006800,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-13T13:14:54.681Z","status":"ssl_error","status_checked_at":"2026-05-13T13:14:51.610Z","response_time":115,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["batch-processing","hadoop-cluster","hdf5","hdfs","java","mapreduce"],"created_at":"2024-11-12T04:22:59.927Z","updated_at":"2026-05-14T01:40:45.497Z","avatar_url":"https://github.com/akaliutau.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"About\n=======\n\nThis project covers basics of batch data processing on the dockerized Hadoop cluster.\n\nSimple MapReduce app is the classic one (collects statistics about words in input file)\n\n\nCluster installation and running\n=================================\n\nBuild cluster using docker compose file from /hadoop-cluster/docker-compose.yml\n\n```\ndocker-compose -f docker-compose.yml up -d\n```\n\nMake sure all containers are up and running:\n\n```\ndocker container list\n\nCONTAINER ID   IMAGE                                                    COMMAND                  CREATED         STATUS                   PORTS                                            NAMES\n818989699cf1   bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8          \"/entrypoint.sh /run\"   2 minutes ago   Up 2 minutes (healthy)   0.0.0.0:9000-\u003e9000/tcp, 0.0.0.0:9870-\u003e9870/tcp   namenode\n06865eccae10   bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8          \"/entrypoint.sh /run\"   2 minutes ago   Up 2 minutes (healthy)   9864/tcp                                         datanode\n85998aa7c898   bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8       \"/entrypoint.sh /run\"   2 minutes ago   Up 2 minutes (healthy)   8042/tcp                                         nodemanager\na4a1c1cba4e7   bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8     \"/entrypoint.sh /run\"   2 minutes ago   Up 2 minutes (healthy)   8188/tcp                                         historyserver\n1f6373828e94   bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8   \"/entrypoint.sh /run\"   2 minutes ago   Up 2 minutes (healthy)   8088/tcp                                         resourcemanager\n```\n\nGo to the name node UI to see the status of hadoop cluster:\n\n```\nhttp://localhost:9870/dfshealth.html#tab-overview\n```\nContainers can be stopped using the following command\n\n```\ndocker-compose -f docker-compose.yml down\n```\n\nRunning MapReduce app on the cluster\n=====================================\n\nBuild the project:\n\n```\nmvn clean package\n```\n\nFirst of all, copy input and compiled jar file to namenode container (which is playing a role of orchestrator this time)\n\n```\ndocker cp ./input/sonet104.txt namenode:/tmp\ndocker cp ./target/hadoop-wordcounter-1.0.jar namenode:/tmp\n```\n\nConnect a bash session to the namenode container:\n\n```\ndocker exec -it namenode /bin/bash\n\nroot@a6af3563e823:/# ls /tmp -sh\ntotal 76K\n4.0K hadoop-root-namenode.pid    4.0K hsperfdata_root                                        4.0K sonet104.txt\n 60K hadoop-wordcounter-1.0.jar  4.0K jetty-0.0.0.0-9870-hdfs-_-any-7028729040041975610.dir\n```\n\nCopy all necessary files to datanode(s):\n\n```\n# HDFS list commands to show all the directories in root \"/\"\nhdfs dfs -ls /\n\n# Create a new directory inside HDFS using mkdir tag.\nhdfs dfs -mkdir -p /user/root\n\n# Copy the files to the input path in HDFS.\nhdfs dfs -put /tmp/sonet104.txt /user/root/sonet104.txt\n\n# Take a look at the content of your input file.\nhdfs dfs -cat /user/root/sonet104.txt\n```\n\nThe 3rd command will result in the following output showing that Hadoop used network operations to transfer data to the datanode:\n\n```\n2021-05-17 10:56:30,757 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false\n```\n\nFinally run the MapReduce app as a hadoop job:\n\n```\n# Run map reduce job from the path where you have the jar file.\nhadoop jar /tmp/hadoop-wordcounter-1.0.jar hadoop.mr.wordcount.WordCount /user/root/sonet104.txt /user/root/sonet104-output\n```\n \n\nIf run was successful, the output should be similiar to this one:\n\n```\n39,322 INFO client.RMProxy: Connecting to ResourceManager at resourcemanager/172.20.0.3:8032\n39,471 INFO client.AHSProxy: Connecting to Application History server at historyserver/172.20.0.2:10200\n39,498 INFO client.RMProxy: Connecting to ResourceManager at resourcemanager/172.20.0.3:8032\n39,499 INFO client.AHSProxy: Connecting to Application History server at historyserver/172.20.0.2:10200\n39,644 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.\n39,660 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1621246640599_0001\n39,755 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false\n39,847 INFO mapred.FileInputFormat: Total input files to process : 1\n39,877 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false\n40,303 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false\n40,311 INFO mapreduce.JobSubmitter: number of splits:2\n40,416 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false\n40,434 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1621246640599_0001\n40,434 INFO mapreduce.JobSubmitter: Executing with tokens: []\n40,609 INFO conf.Configuration: resource-types.xml not found\n40,609 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.\n41,043 INFO impl.YarnClientImpl: Submitted application application_1621246640599_0001\n41,098 INFO mapreduce.Job: The url to track the job: http://resourcemanager:8088/proxy/application_1621246640599_0001/\n41,101 INFO mapreduce.Job: Running job: job_1621246640599_0001\n47,198 INFO mapreduce.Job: Job job_1621246640599_0001 running in uber mode : false\n47,199 INFO mapreduce.Job:  map 0% reduce 0%\n52,251 INFO mapreduce.Job:  map 50% reduce 0%\n53,258 INFO mapreduce.Job:  map 100% reduce 0%\n57,290 INFO mapreduce.Job:  map 100% reduce 100%\n57,298 INFO mapreduce.Job: Job job_1621246640599_0001 completed successfully\n57,380 INFO mapreduce.Job: Counters: 54\n        File System Counters\n                FILE: Number of bytes read=523\n                FILE: Number of bytes written=689698\n                Peak Reduce Virtual memory (bytes)=8452407296\n                \n                ...\n                \n        Shuffle Errors\n                BAD_ID=0\n                CONNECTION=0\n                IO_ERROR=0\n                WRONG_LENGTH=0\n                WRONG_MAP=0\n                WRONG_REDUCE=0\n        File Input Format Counters\n                Bytes Read=983\n        File Output Format Counters\n                Bytes Written=738\n```\n\nCheck the output file created by app:\n\n```\n hdfs dfs -ls -h /user/root/sonet104-output\n \n-rw-r--r--   3 root supergroup        738 2021-05-17 11:02 /user/root/sonet104-output/part-00000\n\nhdfs dfs -cat /user/root/sonet104-output/part-00000\n\n2021-05-17 11:09:57,462 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false\n\nAh,     1\nApril   1\nEre     1\nFor     2\nHath    1\n...\n```\n\nCheck the status of completed jobs:\n\n```\nmapred job -list all\n\n51,194 INFO client.RMProxy: Connecting to ResourceManager at resourcemanager/172.21.0.4:8032\n51,352 INFO client.AHSProxy: Connecting to Application History server at historyserver/172.21.0.6:10200\n51,827 INFO conf.Configuration: resource-types.xml not found\n51,827 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.\n\nTotal jobs:2\nJobId       JobName     State         StartTime      UserName           Queue      Priority       UsedContainers   RsvdContainers  UsedMem   RsvdMem   NeededMem         AM info\njob_1621246640599_0001            wordcount     SUCCEEDED       1621249360737          root         default       DEFAULT     N/A              N/A      N/A             N/A               N/A      http://resourcemanager:8088/proxy/application_1621246640599_0001/\n job_1621254046819_0001            wordcount     SUCCEEDED       1621254310556          root         default       DEFAULT    N/A              N/A      N/A             N/A               N/A      http://resourcemanager:8088/proxy/application_1621254046819_0001/\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakaliutau%2Fhadoop-cluster","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fakaliutau%2Fhadoop-cluster","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakaliutau%2Fhadoop-cluster/lists"}