Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/shirshadatta/hadoop-cheatsheet

Your go-to-cheatsheet to learn apache-Hadoop.
https://github.com/shirshadatta/hadoop-cheatsheet

dfs hadoop hadoop-cheatcheet hdfs-client hdfs-cluster jdk- masternode multitier-architecture redhat-enterprise-linux slave-nodes

Last synced: 12 days ago
JSON representation

Your go-to-cheatsheet to learn apache-Hadoop.

Awesome Lists containing this project

README

        

# Hadoop-CheatSheet 🐘

A cheatsheet to get you started with Hadoop

But the question is why should we learn Hadoop? How will it make our life easier?

Read till the end to know more.

Happy learning 👩‍🎓

## Index Of Contents
1. [Introduction](#Introduction)
2. [Installation](#Installation)
3. [Configuration](#Configuration)

i) [NameNode](#NameNode)

ii) [DataNode](#DataNode)

iii) [ClientNode](#ClientNode)
4. [GUI](#GUI)
5. [Frequently Asked Questions](#FAQs)
6. [Testing](#Testing)
7. [Contributing](#Contributions)

i)[Contribution Practices](#Contribution-Practices)

ii)[Pull Request Process](#Pull-Request-Process)

iii)[Branch Policy](#Branch-Policy)
8. [Cool Links to Check out](#Cool-Links-To-Checkout)
7. [License](#License)
8. [Contact](#Contact)

## Introduction

Simple answer to the the above question is to store data. Again the question, when there is Database as well as Drive storage why should we use Hadoop?

TO STORE BIG DATA

Now the question, What is Big Data?
An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people—all from different sources (e.g. Web, sales, customer contact center, social media, mobile data and so on).

To store so much of data we use the concept of DISTRIBUTED STORAGE CLUSTER. To implement these concepts we use Apache Hadoop.



## Installation
(For 1 master and multi slave and multi client nodes)
**For Master,Slave and Client Nodes**
```
This is for RedHat
- Install Java JDK as Hadoop depends on it
wget https://www.oracle.com/webapps/redirect/signon?nexturl=https://download.oracle.com/otn/java/jdk/8u171-b11/512cd62ec5174c3487ac17c61aaa89e8/jdk-8u171-linux-x64.rpm
rpm -i -v -h jdk-8u171-linux-x64.rpm
- Install apache hadoop
wget https://archive.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1-1.x86_64.rpm
rpm -i -v -h hadoop-1.2.1-1.x86_64.rpm --force
- Verify if it is correctly installed with
java -version
hadoop version

```

![Preview Image](./assets/installing.PNG)

## Configuration
## NameNode
(NameNode is also called Master Node)
```
mkdir /nn
vim /etc/hadoop/core-site.xml


fs.default.name
hdfs://MasterIP:PortNo

vim /etc/hadoop/hdfs-site.xml


dfs.name.dir
/nn

```

The configured files:
Logo
#Check if the port number you assigned is free, if not then change the port number in the core-site.xml

Then we will have to format the /nn folder of the namenode.
``` hadoop namenode -format ```


Logo
```
jps
netstat -tnlp
```
We see that the process has not yet started and the assigned port is free

Logo

Then we will have to start the service:
```
hadoop-daemon.sh start namenode
jps
netstat -tnlp
```
We see that the process has started and the port is assigned
Logo

To view the no of slave nodes connected
```hadoop dfsadmin -report```

Logo

### DataNode
(DataNode is also called Slave Node)

```
vim /etc/hadoop/core-site.xml


fs.default.name
hdfs://MasterIP:PortNo


mkdir /dn1
vim /etc/hadoop/hdfs-site.xml


dfs.name.dir
/dn1


```
The Configured files:
Logo

Then we will have to start the service
Make sure that if you doing the setup locally using VM's , then the firewall should be stopped in the master node.
To check so:
```
systemctl status firewalld
- If it is active then stop or disable(if you don't want to start after system reboot)
systemctl stop firewalld
systemctl disable firewalld
```
Logo

```
hadoop-daemon.sh start datanode
jps
```
We see that the process has started.
Logo

To view the no of slave nodes connected

```hadoop dfsadmin -report```
Logo

### ClientNode

```
vim /etc/hadoop/core-site.xml


fs.default.name
hdfs://MasterIP:PortNo

- To see how many files we have in their storage
hadoop fs -ls /
- To add a file
cat > /file1.txt
Hi I am the first file
Ctrl+C
hadoop fs - put /file1.txt /
- To read the contents of the file
hadoop fs -cat /file1.txt
- To check the size of the file
hadoop fs -count /file1.txt
- To create a directory
hadoop fs -mkdir /textfiles
-To upload a blank file on the fly
hadoop fs -touchz /my.txt
-To move a file (source➡destination)
hadoop fs -mv /lw.txt /textfiles
- To copy a file (source➡destination)
hadoop fs -cp /file1.txt /textfiles
- To remove a file
hadoop fs -rm /file1.txt
- To checkout and explore all the available options
hadoop fs
```
The attached screenshots of the above mentioned commands are :
Logo
Logo
Logo

## GUI
We can also visualize using GUI
Namenode : MasterIP:50070
Datanode : SlaveIP:50075
Logo
We can visualize the uploaded files
Logo

We see that if the file is small it is broken in only 1 block
Logo
We can check the size of the name.txt file like:
```
-To see the permissions as well as the size of the block in bytes
ls -l name.txt
-To see the permissions as well as the size of the block
ls -l -h name.txt
```
Logo
The default DFS block size is 32768 , and therefore it is divided into blocks before storing.
Logo

## FAQs
Will come up soon, stay tuned :)

## Testing
These commands are even checked in AWS cloud.

## Contributions
Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are **greatly appreciated**.

### Contribution Guidelines

When contributing to this repository, please first discuss the change you wish to make via issue,
email, or any other method with the owners of this repository before making a change.

### Contribution Practices
* Write clear and meaningful commit messages.
* If you report a bug please provide steps to reproduce the bug.
* In case of changing the backend routes please submit an updated routes documentation for the same.
* If there is an UI related change it would be great if you could attach a screenshot
with the resultant changes so it is easier to review for the maintainers

### Pull Request Process
1. Ensure any install or build dependencies are removed before the end of the layer when doing a
build.
2. Update the README.md with details of changes to the interface, this includes new environment
variables, exposed ports, useful file locations and container parameters.
3. Only send your pull requests to the development branch where once we reach a stable point
it will be merged with the master branch
4. Associate each Pull Request with the required issue number

### Branch Policy
* development: If you are making a contribution make sure to send your Pull Request to this branch . All
developments goes in this branch.
* master: After significant features/bug-fixes are accumulated in development branch we merge it with the master branch.

## Cool Links to Checkout

- [How Facebook stores so much data and its statistics](https://shirshadatta2000.medium.com/how-facebook-stores-so-much-data-and-its-statistics-bd0911ad39a1)

- [Facebook and Hadoop](https://www.facebook.com/notes/facebook-engineering/hadoop/16121578919/)

- [How Google stores massive amounts of data](https://medium.com/@avantikadasgupta/how-google-stores-massive-amounts-of-data-bigtable-d67f49bfc40e)

- [Apache Hadoop Ecosystem](https://www.cloudera.com/products/open-source/apache-hadoop.html)

## License

Distributed under the MIT License. See `LICENSE` for more information.

## Contact

- My Name - Shirsha Datta

- You can contact me at [email protected]

- Connect with me on [LinkedIn](https://www.linkedin.com/in/shirsha-datta-30335a178/)