{"id":18690552,"url":"https://github.com/razo7/nap","last_synced_at":"2026-04-24T23:33:15.543Z","repository":{"id":37285656,"uuid":"156896222","full_name":"razo7/Nap","owner":"razo7","description":"Nap: Network-Aware Data Partitions for Efficient Distributed Processing","archived":false,"fork":false,"pushed_at":"2024-09-25T14:43:06.000Z","size":195322,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-11T17:20:35.753Z","etag":null,"topics":["hadoop-multinode-cluster","java","mapreduce","mapreduce-containers","mapreduce-java","python","reducers-location","wolfram-mathematica","yarn-hadoop-cluster"],"latest_commit_sha":null,"homepage":"","language":"Mathematica","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/razo7.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"Security_Group_Example.jpg","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-09T17:34:38.000Z","updated_at":"2023-02-19T13:12:36.000Z","dependencies_parsed_at":"2024-11-07T10:48:13.029Z","dependency_job_id":"d4dfd371-d9ad-4379-978b-21f8e177b9cf","html_url":"https://github.com/razo7/Nap","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/razo7/Nap","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/razo7%2FNap","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/razo7%2FNap/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/razo7%2FNap/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/razo7%2FNap/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/razo7","download_url":"https://codeload.github.com/razo7/Nap/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/razo7%2FNap/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32245148,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-24T13:21:15.438Z","status":"ssl_error","status_checked_at":"2026-04-24T13:21:15.005Z","response_time":64,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hadoop-multinode-cluster","java","mapreduce","mapreduce-containers","mapreduce-java","python","reducers-location","wolfram-mathematica","yarn-hadoop-cluster"],"created_at":"2024-11-07T10:48:04.745Z","updated_at":"2026-04-24T23:33:15.525Z","avatar_url":"https://github.com/razo7.png","language":"Mathematica","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Nap\nNap: Network-Aware Data Partitions for Efficient Distributed Processing\n\nData partition _modification_ which considers the _network_ (nodes' downlinks) for a shorter mutliway join in _Hadoop_.\n\n## Table of Contents\n* [Introduction](#introduction)\n* [Features](#features)\n* [Technologies](#technologies)\n* [Usage](#usage)\n   * [Running Hadoop Multi Node Cluster (Step by Step)](#running-hadoop-multi-node-cluster-step-by-step)\n     * [Commands Before Running Daemons](#commands-before-running-daemons)\n     * [Setting Cluster Configuration Files](#setting-cluster-configuration-files)\n     * [Commands for Running Daemons](#commands-for-running-daemons) \n   * [Running Job With and Without Network Awareness](#running-job-with-and-without-network-awareness)    \n   * [How to Test Cluster Links?](#how-to-test-cluster-links)\n   * [How to Collect the Job's Parameters and Make Figures?](#how-to-collect-the-jobs-parameters-and-make-figures)\n   * [Useful Commands](#useful-commands) \n* [Documentation](#documentation)\n   * [Job Modification and Partitioner Class (Java)](#job-modification-and-partitioner-class-java)\n     * [Job Modification](#job-modification)\n     * [Partitioner Class](#partitioner-class)\n   * [How to Run Wonder Shaper?](#how-to-run-wonder-shaper)\n   * [How to Access Daemons URIs?](#how-to-access-daemons-uris)\n* [Sources](#sources)\n* [Contact](#contact)\n\n## Introduction\nIt is well known that in many Distributed Processing the __network__ influnce graetly the task finish time, and it could even be the _bottleneck_ of the task or the larger job.\n __Apache Hadoop__  (as well as _Apache Spark_) which is widely accpepted framework in the Big Data industry does not take into account the _network_ topology, unaware of it when running a job. The problem lies in __*Apche YARN*__ which is in charge of the allocation procedure of containers (map, reduce and am) between and inside jobs. It has a basic assumption that the processing is performed with a homogeneous computers, but when they are heterogeneous the system performance is inefficient. Thus, we have been _modifying Hadoop_ for a better partition of the data in the network, and thus minimizing the job's completion time. \n\nThe following file includes all the needed information for using the code and reproduce the experiment as in the [article](#Contact), more useful functionalites and code during my research with _Hadoop Multi Node cluster_.\n \n## Features\nThis repository includes:\n+ DownUp_SpeedTest - Testing downlink and uplink between nodes (see [How to Test Cluster Links?](#how-to-test-cluster-links))\n+ hadoop - An Hadoop 2.9.1 network aware compiled version (without the source code), the output from compiling project [see Nap-Hadoop-2.9.1](https://github.com/razo7/Nap-Hadoop-2.9.1).\n+ Input- txt files for WordCount example or three tables join, multiway join.\n+ mav-had - Maven Eclipse folder with POM file and relavent code under `Nap\\mav-had\\src\\main\\java\\run` (see [Job Modification and Partitioner Class (Java)](#job-modification-and-partitioner-class-java))\n  + `AcmMJ.java` - Old implematation of the multiway join (without secondary sort)\n  + `AcmMJSort.java` - Current implematation of the multiway join (with secondary sort), __use this one__.\n  + `Nap.java` - Alogrithm _Nap_ from the paper in Java (it also normalize the links).\n  + `WordCountOR.java` - Old wordcount example.\n  + `WordCountOR2.java` - Current wordcount example, __use this one__.\n  + Other Java files under `Nap\\mav-had\\src\\main\\java` are old examples.\n+ job History (see [Job History](#how-to-collect-the-jobs-parameters-and-make-figures))- \n   + python code for parsing with REST API the job history server `downJobHistory.py`.\n   + .xlsx files with the counters infromation of the jobs.\n   + Wolfram Mathematica code `HadoopJobStatistics3.nb` and .xlsx files  for making figures.\n   + An example of the counters for job_1544370435610_006 from the Job History using REST API (see the URL) `JobHistory Counters Example.PNG`\n+ .gitattributes - git lfs file\n+ ec2-proj2.pem - RSA key for EC2 \n+ LICENSE - Apache License 2.0 \n+ README.md - This README file\n+ Security Group Example.jpg - An example of group policy in AWS inbound.\n\n## Technologies\n* Hadoop 2.9.1\n* Java 8\n* Python 2.7\n* Wolfram Mathematica 11.0\n* Git 2.20.1 and Git lfs 2.6.1\n\n## Usage\n\n### Running Hadoop Multi Node Cluster (Step by Step)\nThis was tested on two different clusters:\n   + Weak cluster- Three computers, each has Intel core 2 duo, 4 GB RAM, 2 cores, 150GB Disk and Ubuntu 16.04 LTS.\n   + Strong cluster (AWS)- Four t2.xlarge, each has Intel core 2 duo, 16 GB RAM, 4 VCPU, 100GB Disk, 10 Gbps link and Ubuntu 14.04 LTS.\n   One of the instances is a master for NameNode (NN) and Resource Manager (RM), and the rest are three slaves for Datanode (DN), and Node Manager (NM).\n#### Commands Before Running Daemons\nAll of this commands are recommended for a running a new computer, when you have a new cluster on AWS but if not just do the steps (on _numbers_) and make sure the rest is covered.\n\n+ USER (Skip)- Create a user, _hadoop2_, and grant it with permissions, the name of this user will be used also as the user for HDFS. \n```\nsudo useradd -m hadoop2\nsudo passwd hadoop2\n\nsudo usermod -aG sudo hadoop2\t#Add user 'hadoop2' with sudo group\t\nsu hadoop2\n\ncd\n```\n+ GIT (Optional to use Git)- For big files to upload we need git-lfs \n```\nsudo add-apt-repository ppa:git-core/ppa\ncurl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash\nsudo apt-get install git-lfs\ngit lfs install\n\ngit config --global credential.helper cache\ngit clone https://github.com/razo7/Nap.git\n```\n+ IPV6 (Optional)- Disable ipv6, Append the following to the end of the file \n```\nsudo nano /etc/sysctl.conf   \t\n\t#disable ipv6\nnet.ipv6.conf.all.disable_ipv6 = 1\nnet.ipv6.conf.default.disable_ipv6 = 1\nnet.ipv6.conf.lo.disable_ipv6 = 1\n```\n1. JAVA- Download Java 8\n```\nsudo add-apt-repository ppa:webupd8team/java\nsudo apt-get update\nsudo apt-get install oracle-java8-installer\n```\n2. ENV- Set enviroment variables (use .bashrc or use export each time, for finding jre run `sudo update-alternatives --config java`)\n```\nsudo nano .bashrc\nJAVA_HOME=\"/usr/lib/jvm/java-8-oracle\"\nJRE_HOME=\"/usr/lib/jvm/java-8-oracle/jre\"\n### HADOOP Variables ###\nHADOOP_PREFIX=\"/usr/local/hadoop\"\nHADOOP_HOME=\"/usr/local/hadoop\"\nHADOOP_INSTALL=\"/usr/local/hadoop\"\nHADOOP_MAPRED_HOME=\"/usr/local/hadoop\"\nHADOOP_COMMON_HOME=\"/usr/local/hadoop\"\nHADOOP_HDFS_HOME=\"/usr/local/hadoop\"\nYARN_HOME=\"/usr/local/hadoop\"\nHADOOP_COMMON_LIB_NATIVE_DIR=\"/usr/local/hadoop/lib/native\"\nHADOOP_CONF=\"/usr/local/hadoop/etc/hadoop\"\nHDFS_COMM=\"/usr/local/hadoop/bin/hdfs\"\nHADOOP_USER_NAME=\"hadoop2\"\nCOMPILED_HADOOP=\"/home/hadoop2/intelliji_workspace/hadoop-2.9.1-src/hadoop-dist/target/hadoop-2.9.1\"\nPATH=\"/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin\"\n\n. .bashrc\n```\n+ HADOOP (Optional)- Copy hadoop direcory from my Github repository (Nap) \n```\nsudo cp -r ~/Nap/hadoop /usr/local/\nsudo chmod -R 777 $HADOOP_HOME #create hadoop direcory with premission\nsudo chown -R hadoop2:hadoop2 /usr/local/hadoop\n```\n+ TIME (Optional)- choose a consistent timezone \n```\necho \"Asia/Jerusalem\" | sudo tee /etc/timezone\nsudo dpkg-reconfigure --frontend noninteractive tzdata\n```\n+ MySQL (Optional)- Create a database \n```\nmysql   -u  root   -p\nroot\nCREATE DATABASE acm_ex;\n```\n+ PSSH (Optional)- Enables Parallel SSH \n```\nsudo apt install python-pip python-setuptools  #install PIP for PSSH – Parallel SSH\nsudo apt-get install pssh\n```\n+ Hostname (Optional)- Update hostname \n``` sudo nano /etc/hostname ```\n3. HOSTS- Update hosts names for easy ssh - the following is an example using Public IPs (MASTER_NAME- master), *important* for using Hadoop.\n```\nsudo nano /etc/hosts #every node all the other nodes in the cluster by their name\n127.0.0.1 localhost\n23.22.43.90 master # for master-\u003e \"172.31.24.83 master\"\n34.224.43.40 slave1\n54.211.249.84 slave2\n54.183.14.16 slave3\n35.176.236.213 slave4\n35.177.54.37 slave5\n```\n4. EXCHANGE KEYS- make a passwordless ssh to all the slaves- create key and exchange it OR ssh and save the key as in the terminal it might ask, *important* for using Hadoop.\n```\nssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa # create an RSA key without authntication and save it to ~/.ssh/id_rsa\ncat ~/.ssh/id_rsa.pub \u003e\u003e ~/.ssh/authorized_keys #move it  ~/.ssh/authorized_keys\nchmod 0600 ~/.ssh/authorized_keys #change authorizations to the file, everybody can read it...\nssh-copy-id -i ~/.ssh/id_rsa.pub slave1 #send rsa key to all the slaves... \nssh-copy-id -i ~/.ssh/id_rsa.pub slave2 \nssh-copy-id -i ~/.ssh/id_rsa.pub slave3 \nssh-copy-id -i ~/.ssh/id_rsa.pub slave4 \nssh-copy-id -i ~/.ssh/id_rsa.pub slave5 \n```\n5. SLAVES FILE- UPDATE HADOOP_CONF, *important* for using Hadoop.\n```\nslave1\nslave2\nslave3\nslave4\nslave5\n```\n+ CLEANING (Optional)- Clean old files and copy from GIT \n```\nsudo rm -rf $HADOOP_HOME/*\nsudo cp -r Nap/hadoop/* $HADOOP_HOME\nsudo mkdir -p $HADOOP_HOME/data/hadoop-data/nn $HADOOP_HOME/data/hadoop-data/snn $HADOOP_HOME/data/hadoop-data/dn $HADOOP_HOME/data/hadoop-data/mapred/system $HADOOP_HOME/data/hadoop-data/mapred/local #create direcories for future logs if they are not existed\n```\n6. COPYING- UPDATE HADOOP_CONF, copying from one node to the rest \n```\nparallel-ssh -h $HADOOP_CONF/slaves \"rm -rf $HADOOP_HOME/*\"\nscp -r $HADOOP_HOME/* slave1:$HADOOP_HOME\nscp -r $HADOOP_HOME/* slave2:$HADOOP_HOME\nscp -r $HADOOP_HOME/* slave3:$HADOOP_HOME\nscp -r $HADOOP_HOME/* slave4:$HADOOP_HOME\nscp -r $HADOOP_HOME/* slave5:$HADOOP_HOME\nparallel-ssh -h $HADOOP_CONF/slaves \"chmod -R 777 $HADOOP_HOME\"\nparallel-ssh -h $HADOOP_CONF/slaves \"chown -R hadoop2:hadoop2 /usr/local/hadoop\"\n```\n+ UPDATE JAVA ENV (Optional)- Might need to update two enviroment variables used by Hadoop in Nap/hadoop/etc/hadoop/hadoop-env.sh \n    + FOR ORACLE\n    ```\n    export JAVA_HOME=\"/usr/lib/jvm/java-8-oracle\"\n    export HADOOP_HOME=\"/usr/local/hadoop\"\n    ```\n    + FOR OPENJDK\n    ```\n    export JAVA_HOME=\"/usr/lib/jvm/java-8-openjdk-amd64\"\n    export HADOOP_HOME=\"/usr/local/hadoop\"\n    ```\n+ SECURITY GROUP in AWS (Only for AWS cluster)- Create security group (per region) and change the inbound rules with All-TRAFFIC rules for each machine in the cluster (elastic IP) as in the following example. \n![Image](Security_Group_Example.jpg)\n\n#### Setting Cluster Configuration Files \nAn example of modifing the cluster configuration files, `Nap\\hadoop\\etc\\hdfs-site.xml`, `Nap\\hadoop\\etc\\core-site.xml`, `Nap\\hadoop\\etc\\mapred-site.xml`, and `Nap\\hadoop\\etc\\yarn-site.xml`.\nChange the _MASTER_NAME_ to a defined name in the `/etc/hosts`, i.e., master.\n1. hdfs-site.xml\n```\n\u003cconfiguration\u003e\n    \u003cproperty\u003e\n        \u003cname\u003edfs.replication\u003c/name\u003e\n        \u003cvalue\u003e2\u003c/value\u003e\n                \u003cdescription\u003eDefault block replication, up to the number of datanodes\u003c/description\u003e\n    \u003c/property\u003e\n    \u003cproperty\u003e\n        \u003cname\u003edfs.blocksize\u003c/name\u003e\n        \u003cvalue\u003e8388608\u003c/value\u003e\n                \u003cdescription\u003eBlock size in bytes, the default is 128MB= 128 * 1024 * 1024, it was dfs.block.size\u003c/description\u003e\n    \u003c/property\u003e\n    \u003cproperty\u003e\n        \u003cname\u003edfs.namenode.name.dir\u003c/name\u003e\n        \u003cvalue\u003efile:///usr/local/hadoop/data/hadoop-data/nn\u003c/value\u003e\n                \u003cdescription\u003eDirectory for storing metadata by namenode\u003c/description\u003e\n    \u003c/property\u003e\n    \u003cproperty\u003e\n        \u003cname\u003edfs.datanode.data.dir\u003c/name\u003e\n        \u003cvalue\u003efile:///usr/local/hadoop/data/hadoop-data/dn\u003c/value\u003e\n                \u003cdescription\u003eDirectory for storing blocks by datanode\u003c/description\u003e\n    \u003c/property\u003e\n    \u003cproperty\u003e\n        \u003cname\u003edfs.namenode.checkpoint.dir\u003c/name\u003e\n        \u003cvalue\u003efile:///usr/local/hadoop/data/hadoop-data/snn\u003c/value\u003e\n    \u003c/property\u003e\n\t\n\u003c/configuration\u003e\n```\n* Add the next two properties when running in EC2.*\n```\n\n    \u003cproperty\u003e\n        \u003cname\u003edfs.client.use.datanode.hostname\u003c/name\u003e\n        \u003cvalue\u003etrue\u003c/value\u003e\n        \u003cdescription\u003eWhether clients should use datanode hostnames when connecting to datanodes.\u003c/description\u003e\n    \u003c/property\u003e\n    \u003cproperty\u003e\n        \u003cname\u003edfs.datanode.use.datanode.hostname\u003c/name\u003e\n        \u003cvalue\u003etrue\u003c/value\u003e\n        \u003cdescription\u003eWhether datanodes should use datanode hostnames when connecting to other datanodes for data transfer.\u003c/description\u003e\n    \u003c/property\u003e\n\t\n```\n\n2. core-site.xml\n```\n\u003cconfiguration\u003e\n    \u003cproperty\u003e\n        \u003cname\u003efs.defaultFS\u003c/name\u003e\n        \u003cvalue\u003ehdfs://MASTER_NAME:9000\u003c/value\u003e\n\t\t\u003cdescription\u003eCould be also port 54310.\n\t\tThe name of the default file system. \n\t\tA URI whose scheme and authority determine the FileSystem implementation. \n\t\tThe uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class.\n\t\tThe uri's authority is used to  determine the host, port, etc. for a filesystem, it was \"fs.default.name\".\n\t\t\u003c/description\u003e\n    \u003c/property\u003e\n    \u003cproperty\u003e\n        \u003cname\u003ehadoop.http.staticuser.user\u003c/name\u003e\n        \u003cvalue\u003ehadoop2\u003c/value\u003e\n\t\t\u003cdescription\u003eThe user name to filter as, on static web filters while rendering content. An example use is the HDFS web UI (user to be used for browsing files). Orginally it was dr.who. \u003c/description\u003e\n    \u003c/property\u003e\n\n\u003c/configuration\u003e\n```\n3. mapred-site.xml\n```\n\u003cconfiguration\u003e\n    \u003cproperty\u003e\n        \u003cname\u003emapreduce.framework.name\u003c/name\u003e\n        \u003cvalue\u003eyarn\u003c/value\u003e\n\t    \u003cdescription\u003eThe framework for running mapreduce jobs\u003c/description\u003e\n    \u003c/property\u003e\n    \u003cproperty\u003e\n        \u003cname\u003emapreduce.job.reduce.slowstart.completedmaps\u003c/name\u003e\n        \u003cvalue\u003e0.0\u003c/value\u003e\n\t    \u003cdescription\u003eReducers start shuffling based on a threshold of percentage of mappers that have finished, 0-\u003e there is no waiting, 1-\u003e every reducer waits\n\t\thttps://stackoverflow.com/questions/11672676/when-do-reduce-tasks-start-in-hadoop/11673808\u003c/description\u003e\n    \u003c/property\u003e\n\t\u003cproperty\u003e\n\t\t\u003cname\u003emapreduce.jobhistory.address\u003c/name\u003e\n\t\t\u003cvalue\u003eMASTER_NAME:10020\u003c/value\u003e \n\t\t\u003cdescription\u003eHostname of machine where jobhistory service is started\u003c/description\u003e \n    \u003c/property\u003e\n    \u003cproperty\u003e\n\t\t\u003cname\u003emapreduce.jobhistory.webapp.address\u003c/name\u003e\n\t\t\u003cvalue\u003eMASTER_NAME:19888\u003c/value\u003e \n    \u003c/property\u003e\n\t\u003c!-- FOR 2GB Nodes --\u003e\n\t\u003cproperty\u003e\n        \u003cname\u003eyarn.app.mapreduce.am.resource.mb\u003c/name\u003e\n        \u003cvalue\u003e512\u003c/value\u003e\n\t\u003c/property\u003e\n\n\t\u003cproperty\u003e\n        \u003cname\u003emapreduce.map.memory.mb\u003c/name\u003e\n        \u003cvalue\u003e512\u003c/value\u003e\n\t\u003c/property\u003e\n\n\t\u003cproperty\u003e\n        \u003cname\u003emapreduce.reduce.memory.mb\u003c/name\u003e\n        \u003cvalue\u003e512\u003c/value\u003e\n\t\u003c/property\u003e\n\n\u003c/configuration\u003e\n```\n4. yarn-site.xml\n The log aggregation is *highly recommended*\n```\n\t\u003cproperty\u003e\n\t\t\u003cname\u003eyarn.nodemanager.aux-services\u003c/name\u003e\n\t\t\u003cvalue\u003emapreduce_shuffle\u003c/value\u003e\n\t\u003c/property\u003e\n\t\u003cproperty\u003e\n\t\t\u003cname\u003eyarn.nodemanager.aux-services.mapreduce.shuffle.class\u003c/name\u003e\n\t\t\u003cvalue\u003eorg.apache.hadoop.mapred.ShuffleHandler\u003c/value\u003e\n\t\u003c/property\u003e\n\t\u003cproperty\u003e\n\t\t\u003cname\u003eyarn.resourcemanager.hostname\u003c/name\u003e\n\t\t\u003cvalue\u003eMASTER_NAME\u003c/value\u003e\n\t\u003c/property\u003e\n\t\u003cproperty\u003e\n\t\t\u003cname\u003eyarn.log-aggregation-enable\u003c/name\u003e\n\t\t\u003cvalue\u003etrue\u003c/value\u003e\n    \u003c/property\u003e\n    \u003cproperty\u003e\n\t\t\u003cname\u003eyarn.nodemanager.remote-app-log-dir\u003c/name\u003e\n\t\t\u003cvalue\u003e/app-logs\u003c/value\u003e\n    \u003c/property\u003e\n    \u003cproperty\u003e\n\t\t\u003cname\u003eyarn.nodemanager.remote-app-log-dir-suffix\u003c/name\u003e\n\t\t\u003cvalue\u003elogs\u003c/value\u003e\n    \u003c/property\u003e\n\t\u003cproperty\u003e\n\t\t\u003cname\u003eyarn.log.server.url\u003c/name\u003e\n\t\t\u003cvalue\u003ehttp://MASTER_NAME:19888/jobhistory/logs\u003c/value\u003e\n    \u003c/property\u003e\n    \u003cproperty\u003e\n\t\t\u003cname\u003eyarn.nodemanager.log-dirs\u003c/name\u003e\n\t\t\u003cvalue\u003e/usr/local/hadoop/logs\u003c/value\u003e\n    \u003c/property\u003e\n\t\t\u003c!-- 2/5/6 (1536/4536/5536) GB Memory and 2 cores --\u003e\n\t\u003cproperty\u003e\n\t\t\u003cname\u003eyarn.nodemanager.resource.memory-mb\u003c/name\u003e\n\t\t\u003cvalue\u003e1536\u003c/value\u003e\n\t\u003c/property\u003e\n\t\u003cproperty\u003e\n\t\t\u003cname\u003eyarn.scheduler.maximum-allocation-mb\u003c/name\u003e\n\t\t\u003cvalue\u003e1536\u003c/value\u003e\n\t\u003c/property\u003e\n\t\u003cproperty\u003e\n\t\t\u003cname\u003eyarn.scheduler.minimum-allocation-mb\u003c/name\u003e\n\t\t\u003cvalue\u003e128\u003c/value\u003e\n\t\u003c/property\u003e\n\t\u003cproperty\u003e\n\t\t\u003cname\u003eyarn.nodemanager.resource.cpu-vcores\u003c/name\u003e\n\t\t\u003cvalue\u003e2\u003c/value\u003e\n\t\u003c/property\u003e\n\t\u003cproperty\u003e\n\t\t\u003cname\u003eyarn.nodemanager.vmem-check-enabled\u003c/name\u003e\n\t\t\u003cvalue\u003efalse\u003c/value\u003e\n\t\t\u003cdescription\u003e Whether virtual memory limits will be enforced for containers, could disable virtual-memory checking and can prevent containers from being allocated properly on JDK8\n\t\thttps://stackoverflow.com/questions/21005643/container-is-running-beyond-memory-limits \u003c/description\u003e\n\t\u003c/property\u003e\n\n\u003c/configuration\u003e\n```\n\n#### Commands for Running Daemons\n\n+  (Optional) Clean old temporarily files in the cluster `parallel-ssh -h $HADOOP_CONF/slaves \"rm -rf $HADOOP_HOME/data/hadoop-data/*\"`\n+  (Optional) Clean old temporarily files, locally `rm -rf $HADOOP_HOME/data/hadoop-data/*`\n1. Restart namenode and the whole HDFS space `hdfs namenode -format`\n+  (Optional) Run a daemon of jobhistory for reviewing my Jobs `$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver`\n2.  Run all the daemons one after another by the slaves (nodes) name and their role, one node can be only used for HDFS and not for YARN and MapReduce `start-all.sh`\n3.  Creating folder in HDFS with /user/hadoop2 under my username, hadoop2, which was selected also in hdfs-site.xml. `hdfs dfs -mkdir /user` \n4. Create a folder for input and output, those folders will be the home direcory, the prefix, for input or output by the Hadoop Job you write (Java)\n```\nhdfs dfs -mkdir /user/hadoop2\nhdfs dfs -mkdir /user/hadoop2/input \nhdfs dfs -mkdir /user/hadoop2/output\n```\n### Running Job With and Without Network Awareness\nDo one of the follwing when you have up and running Hadoop in your system (you can do the [above](#running-hadoop-multi-node-cluster-step-by-step) steps).\n\nPattern of running my hadoop code `hadoop jar MY_JAR PACKAGE_CLASS INUPUT OUTPUT SPLIT_SIZE NUM_REDUCER DOWNLINK_VEC JOB1_NAME ROUNDS`\n\n1. Multiway join (Java code) example of three tables, Papers, Papers-Authors, and Authors.\n   + Add three tables (ACM rows: 1,818,114 -\u003e X, 4,795,532 -\u003e Y, 3,885,527 -\u003e Z)\n   \n```\nhdfs dfs -put ~/Nap/input/TextFile_MultiwayJoin/x_article.txt /user/hadoop2/input/x\nhdfs dfs -put ~/Nap/input/TextFile_MultiwayJoin/y_article_author.txt /user/hadoop2/input/y\nhdfs dfs -put ~/Nap/input/TextFile_MultiwayJoin/z_persons.txt /user/hadoop2/input/z\n```\n \n   + Run Multiway Join With _Network Awareness_- Example for running the job on a cluser of master, slave1, slave3, and slave5 nodes with a 7, 6, and 6 respected downlink rates. There are also three reducers and 24 mappers, s1=s2=10 (shared variables for the join), the job name is _oneLoop2-104-30-10-16_ and the job is run twice (for running multiple jobs one after another) \n   `hadoop jar ~/Nap/mav-had/target/mav-had-0.0.2-SNapSHOT.jar run.AcmMJSort input/x input/y input/z output/hashLoop-Full-24-3-1010-766-2 24 3 \"10 10\" \"slave1 slave3 slave5\" \"7 6 6\" \"hashLoop-Full-24-3-1010-766-4\" 10`\n   \n   + Run Multiway Join Without _Network Awareness_ - Example like the last command with a change in the output directory and the downlink rates (0,0,0) \n   `hadoop jar ~/Nap/mav-had/target/mav-had-0.0.2-SNapSHOT.jar run.AcmMJSort input/x input/y input/z output/hashLoop-Full-24-3-1010-000-2 24 3 \"10 10\" \"slave1 slave3 slave5\" \"0 0 0\" \"hashLoop-Full-24-3-1010-000-4\" 10`\n\n   \n2. WordCount Example\n   + Add text file- Alice in the wonderland `hdfs dfs -put ~/Nap/input/TextFile_wc/alice.txt /user/hadoop2/input/alice`\n   \n   + Run WordCount With _Network Awareness_- Example for running the job on a cluser of master, slave1, and slave2 nodes with a 25, 10, and 20 respected downlink rates. There are also nine reducers and five mappers, the job name is _alice-000_ and the job is run twice (for running multiple jobs one after another) `hadoop jar ~/AMJ/mav-had/target/mav-had-0.0.2-SNAPSHOT.jar run.WordCountOR2 input/alice output/alice-251020-9-1 5 9 \"master slave1 slave2\" \"25 10 20\" \"alice-000\" 2`\n\n   + Run WordCount Without _Network Awareness_ - Example like the last command with a change in the output directory and the downlink rates (0,0,0)  `hadoop jar ~/AMJ/mav-had/target/mav-had-0.0.2-SNAPSHOT.jar run.WordCountOR2 input/alice output/alice-000-9-1 5 9 \"master slave1 slave2\" \"0 0 0\" \"alice-000\" 2`\n\n\n### How to Test Cluster Links?\nUnder _DownUp_SpeedTest_ directory there are few options\n1. Check the average downlink results automatically by running from one node, master, `./testClusterD.sh slave1 slave2 slave3 slave4 slave5 ` or `bash testClusterD.sh slave1 slave2 slave3 slave4 slave5 ` when having five more nodes in the cluster.\nThis will create a 50 MB (can be adjusted) file that will be sent between each pair of nodes in the cluster using  _testNodeD.sh_ code.\n\n   The _testClusterD.sh_ function runs in a loop _testNodeD.sh_ of the slaves from the cli (`slave1 slave2 slave3 slave4 slave5`) and writes in the end into the terminal as well as to `downLink.txt` file with the average downlink for each node.\n   In addition, for using it in hadoop we care about their ratios, thus we take the their normalized downlinks instead of the original results from the test.\n\n   The testNodeD.sh is run from one node that receives (using _scp_) a file from the rest of the nodes (the master sent him this list of nodes).\n   The node is receiving the same file four times, rounds, and saves the average of these results.\n   At the end, the node writes to _resFile-file_ file the average downlink between each node in the cluster and the average between all the nodes.\n\n2. Testing downlink and uplink between a pair of nodes using `downlinkSpeed-test.sh` or `scp-speed-test.sh` or `speedtest-cli`.\n\n### How to Collect the Job's Parameters and Make Figures?\nRun JobHistory server `HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver` and define it also in yarn-site.xml then you can browse to http://MASTER_IP_ADDRESS:19888 (check AWS connectivity, group policies) and see the all the informartion regarding all the jobs for the curren runnig cluster, for example the counters ![Image](jobHistory/JobHistory_Counters_Example.PNG)\n\nBefore running the `Nap/jobHistory/downJobHistory.py` python code, please install _selenium_ and _geckodriver_ using _Anaconda_\n ```\n conda install -c conda-forge selenium \n conda install -c conda-forge geckodriver\n ```\n \nUsing `Nap/jobHistory/downJobHistory.py` we can connect to the node with Job History daemon (23.22.43.90) with cluster id _1548932257169_ and parse to xlsx file (with the prefix _jobsHistory__) the counters infromation we want for all the jobs we need.\nFor making figures such as in the article we have published, you can use `HadoopJobStatistics3.nb` and run it on the same directory as the xlsx files.\n\n### Useful Commands\n\n#### HDFS and Hadoop\n\n1. JAR - build jar file from the current directory using POM file `mvn clean install`\n2. Initialization HDFS - Format the filesystem, clear the tree of metadata, the files are lost to the master but are not deleted `hdfs namenode -format`\n3. Start/Stop HDFS - start or stop HDFS (NameNode, DataNode, SecondaryNameNode) `start-dfs.sh \t\tor sbin/stop-dfs.sh`\n4. Start/Stop YARN - start or stop YARN (ResourceManager and NodeManager deamons!) `start-yarn.sh \t\tor sbin/stop-yarn.sh`\n5. Start/Stop history - Start/Stop job history server in MASTER_NAME node\n```\n$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver\n$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh stop historyserver\n```\n6. Upload HDFS - upload to HDFS directory DIRECTORY which will be directed as \"input\", not the file itself! `hdfs dfs -put DIRECTORY input`\n7. Download HDFS - download files `hadoop fs -get /hdfs/source/path /localfs/destination/path`\n8. Delete HDFS - delete from hdfs directory DIRECTORY `hadoop fs -rm -r /user/hadoop2/output/*`\n9. HDFS Premission - change premission of all the files and folders under /app-logs in HDFS `hdfs dfs -chmod -R 755 /app-logs`\n10. Seeking - seek in /user/hadoop2 HDFS direcory `hadoop fs -ls output/book`\n11. Show HDFS - display content of part-r-00000 output `hadoop fs -cat output/book/part-r-00000`\n12. HDFS Space - measure the space occupied in local node or in HDFS \n``` \ndf -h\nhadoop fs - du -h\n```\n13. Kill App - kill application by it's ID, APP_ID `yarn application -kill APP_ID`\n14. List App - Lists applications from the RM `yarn application -list -appStates ALL`\n15. Clean old data - clean version files after namenode Format, some unknown errors of cluser ID incossitent `rm -rf $HADOOP_HOME/data/hadoop-data/*`\n16. Clean Log - clean log files `rm -rf $HADOOP_HOME/logs/*`\n17. Clean Userlog - clean userlog files- stdout stderr `rm -rf $HADOOP_HOME/logs/userlogs/*`\n18. Delete old files - Delete old direcories, logs, temporary files and input/output\n```\nhadoop fs -rm -r /user/*\nhadoop fs -rm -r /app-logs/*\nhadoop fs -rm -r /tmp/*\n```\n19. Updating - Updating the mapred-site and yarn-site with the new MB requirements for the containers or other YARN parameters\n```\nparallel-scp -h $HADOOP_CONF/slaves $HADOOP_CONF/mapred-site.xml $HADOOP_CONF/mapred-site.xml\nparallel-scp -h $HADOOP_CONF/slaves $HADOOP_CONF/yarn-site.xml $HADOOP_CONF/yarn-site.xml\n```\n20. Safemode - leave safemode `hdfs dfsadmin -safemode leave`\n21.  Watching - Checking for DFS changes by terminal, used when I got an error with no DISK left due to large mid-processing data\n ``` watch \"hdfs dfsadmin -report | grep 'DFS Remaining'\" ```\n\n#### Linux and More\n\n22. NCDC - download ncdc dataset- `sudo bash ncdc.sh START_YEAR END_YEATR`\n23. Premission - change premission, to read and write to the directory and it's sub directories `sudo chmod -R ugo+rw FOLDER`\n23. Time - set time `sudo date -s \"03 OCT 2018 13:41:00\"`\n\n## Documentation \n\n\n### Job Modification and Partitioner Class (Java) \nHere, I relate mostly to `Nap\\mav-had\\src\\main\\java\\run\\AcmMJSort.java` when there are three options for _Reducer class_ (IndexReduceOneLoop, HashReduceOneLoop are SQLReduce ) in the code, and _HashReduceOneLoop_ has been chosen as it has been proven to be the fastest.\n\nFor modifing Hadoop (YARN) to the network we can optimize the conatiners assignment/placement (which is hard) or use the default assignment and chaneg the data paritioning between the containers, __*when we know all the locations*__. For that we use the modified Hadoop code, hadoop directory in Nap, which writes to HDFS the mappers and reducers locations, we make the waiting time for the shuffle to minimum (zero seconds, `mapreduce.job.reduce.slowstart.completedmaps` in _mapred-site.xml_), and run a new Partitioner class that assins the data (map output tuples) to the __\"right\"__ reducers.\n\n#### Job Modification\n\nI have managed to control the number of mappers (by manipulating the split size, see _getSplitSize_ function) and separte the mappers evenly between the reducers by changing the number of container allocations per heartbeat (`yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled` and `yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments` fields in `hadoop/etc/hadoop/capacity-scheduler.xm`) .\nFor more, see my [thread](https://stackoverflow.com/questions/54056970/how-to-suggest-a-more-balanced-allocation-of-containers-in-hadoop-cluster/54132756#54132756) in Stackoverflow.\nAn example-\n```\n  \u003cproperty\u003e\n\n\u003cname\u003eyarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled\u003c/name\u003e\n    \u003cvalue\u003etrue\u003c/value\u003e\n    \u003cdescription\u003e\n        Whether to allow multiple container assignments in one NodeManager\nheartbeat. Defaults to true.\n    \u003c/description\u003e\n  \u003c/property\u003e\n  \u003cproperty\u003e\n\n\u003cname\u003eyarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments\u003c/name\u003e\n    \u003cvalue\u003e7\u003c/value\u003e\n    \u003cdescription\u003e\n        If multiple-assignments-enabled is true, the maximum amount of\ncontainers that can be assigned in one NodeManager heartbeat. Defaults to\n-1, which sets no limit.\n    \u003c/description\u003e\n  \u003c/property\u003e\n```\n\n#### Partitioner Class\n\nOverriding the _Partitioner class_ with _getPartition_ that define the partition number (reducer number) for the output tuple of the mapper. In order to use the locations from the HDFS we need a _Configuration_ structure in the Partitioner class, thus our Partitioner class implements `org.apache.hadoop.conf.Configurable`, and we can override `public void setConf (Configuration conf)` and `public Configuration getConf()` functions.\nIn setConf we connect to HDFS for reading the containers locations and reading the downlinks we have as an input to the Hadoop job. Then, we save it to _Private static_ variables for _getPartition_ function that assigns each tuple according to the downlinks we have as an input.\nIt begins by choosing a node based on the downlinks rates, and the tuple's key (hash function).\nThen, after we know the node we use again the tuples's key for choosing uniformly a reducer out of the running containers, _getPartition_ function  does not simply partition the data uniformly between the reducers. \n\n### How to Run Wonder Shaper?\nInstall from [here](https://github.com/magnific0/wondershaper), and then you can run wondershaper on interface eth0 and limit the downlink to 500024 bytes\n``` sudo wondershaper -a eth0 -d 500024 ```\n\n### How to Access Daemons URIs?\nWhen running with EC2 there is a need to be on the same LAN for accesing the node and explicitly these ports.\n\n+ Name_Node_URL http://MASTER_IP_ADDRESS:50070\n+ YARN_URL http://MASTER_IP_ADDRESS:8088\n+ Job_History: http://MASTER_IP_ADDRESS:19888\n+ Secondary_Name_Node_URL: http://MASTER_IP_ADDRESS:50090/\n+ Data_Node_1: http://SLAVE_1_IP_ADDRESS:50075/ \n\n\n## Sources\n+  Alec Jacobson alecjacobsonATgmailDOTcom [scp-speed-test.sh](https://www.alecjacobson.com/weblog/?p=635)\n+  Nap-Hadoop-2.9.1 [repisotory](https://github.com/razo7/Nap-Hadoop-2.9.1) with the Hadoop source code and the network aware changes \n+  Installing Git lfs from [here](https://github.com/git-lfs/git-lfs/wiki/Installation)\n+  Alice's Adventures in Wonderland by Lewis Carroll [text file](http://www.gutenberg.org/ebooks/11?msg=welcome_stranger)\n+  Install java 8 - [Link1](https://tecadmin.net/install-oracle-java-8-ubuntu-via-ppa/), [Link2](   https://stackoverflow.com/questions/43587635/dpkg-error-processing-package-oracle-java8-installer-configure), and [Link3](    https://askubuntu.com/questions/84483/how-to-completely-uninstall-java)\n+ Hadoop Counters Explained and Apache documentation - [Link1](https://www.coding-daddy.xyz/node/8) [Link2](https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html)\n+  Set the time zone in EC2 - [Link1](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html) and [Link2](https://stackoverflow.com/questions/11931566/how-to-set-the-time-zone-in-amazon-ec2)\n+  NCDC Dataset from [here](https://gist.github.com/Alexander-Ignatyev/6478289)\n## Contact\nCreated by Or Raz (razo7) as part of his master's thesis work and it was partly published in the following [article](https://ieeexplore.ieee.org/abstract/document/8935013) of NCA 19 (IEEE) - feel free to contact on [Linkedin](https://www.linkedin.com/in/or-raz/) or email (razo@post.bgu.ac.il)!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frazo7%2Fnap","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frazo7%2Fnap","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frazo7%2Fnap/lists"}