Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/fahimahammed/hadoop-and-hdfs

This repository provides comprehensive documentation and a handy cheat sheet for managing Apache Hadoop 3.4.0 on Debian-based systems. Whether you're setting up a new Hadoop cluster, running MapReduce jobs, or handling HDFS operations, this repository aims to be your go-to resource for all things related to Hadoop.
https://github.com/fahimahammed/hadoop-and-hdfs

ddbms dfs hadoop hdfs mapreduce

Last synced: 8 days ago
JSON representation

Host: GitHub
URL: https://github.com/fahimahammed/hadoop-and-hdfs
Owner: fahimahammed
License: mit
Created: 2024-09-07T19:27:57.000Z (5 months ago)
Default Branch: main
Last Pushed: 2024-10-16T06:25:28.000Z (4 months ago)
Last Synced: 2024-11-24T18:12:27.271Z (2 months ago)
Topics: ddbms, dfs, hadoop, hdfs, mapreduce
Homepage:
Size: 57.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Apache Hadoop 3.4.0 Documentation on Debian

## Table of Contents

1. [Prerequisites](#prerequisites)

2. [Lab 23: Creating HDFS](#lab-23-creating-hdfs)

3. [Lab 24: Installing Hadoop Framework](#lab-24-installing-hadoop-framework)

4. [Lab 25: Query Processing in HDFS](#lab-25-query-processing-in-hdfs)

5. [Managing Files and Running Jobs](#managing-files-and-running-jobs)

6. [Cheat Sheet](./cheat-sheet.md)

---

## Prerequisites

1. **Java Installation**

   - Ensure Java 8 or higher is installed:

     ```bash

     java --version

     ```

   - Install Java if needed:

     ```bash

     sudo apt update

     sudo apt install openjdk-17-jdk -y

     ```

2. **SSH Configuration**

   - Install SSH:

     ```bash

     sudo apt install openssh-server openssh-client -y

     ```

   - Set up passwordless SSH:

     ```bash

     ssh-keygen -t rsa -P ""

     cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

     ```

---

## Lab 23: Creating HDFS

### Step 1: Download and Install Hadoop 3.4.0

1. **Download Hadoop**:

   ```bash

   wget https://downloads.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz

   ```

2. **Extract Hadoop**:

   ```bash

   tar -xvzf hadoop-3.4.0.tar.gz

   sudo mv hadoop-3.4.0 /usr/local/hadoop

   ```

3. **Set Hadoop Environment Variables**:

   Edit `.bashrc`:

   ```bash

   nano ~/.bashrc

   ```

   Add:

   ```bash

   export HADOOP_HOME=/usr/local/hadoop

   export HADOOP_INSTALL=$HADOOP_HOME

   export HADOOP_MAPRED_HOME=$HADOOP_HOME

   export HADOOP_COMMON_HOME=$HADOOP_HOME

   export HADOOP_HDFS_HOME=$HADOOP_HOME

   export YARN_HOME=$HADOOP_HOME

   export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

   ```

   Apply changes:

   ```bash

   source ~/.bashrc

   ```

### Step 2: Configure Hadoop

1. **Set JAVA_HOME**:

   Find Java path:

   ```bash

   readlink -f $(which java)

   ```

   Update `hadoop-env.sh`:

   ```bash

   nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

   ```

   Add:

   ```bash

   export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64

   ```

2. **Configure HDFS**:

   - **Edit `core-site.xml`**:

     ```bash

     nano $HADOOP_HOME/etc/hadoop/core-site.xml

     ```

     Add:

     ```xml

     

       

         fs.defaultFS

         hdfs://localhost:9000

       

     

     ```

   - **Edit `hdfs-site.xml`**:

     ```bash

     nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

     ```

     Add:

     ```xml

     

       

         dfs.replication

         1

       

       

         dfs.namenode.name.dir

         file:///usr/local/hadoop_data/hdfs/namenode

       

       

         dfs.datanode.data.dir

         file:///usr/local/hadoop_data/hdfs/datanode

       

     

     ```

   - **Create HDFS Directories**:

     ```bash

     sudo mkdir -p /usr/local/hadoop_data/hdfs/namenode

     sudo mkdir -p /usr/local/hadoop_data/hdfs/datanode

     sudo chown -R $USER:$USER /usr/local/hadoop_data

     ```

   - **Format the NameNode**:

     ```bash

     hdfs namenode -format

     ```

### Step 3: Start HDFS

1. **Start HDFS**:

   ```bash

   start-dfs.sh

   ```

2. **Verify Services**:

   Check that NameNode and DataNode are running:

   ```bash

   jps

   ```

3. **Access HDFS Web UI**:

   Navigate to:

   ```

   http://localhost:9870/

   ```

---

## Lab 24: Installing Hadoop Framework (YARN)

### Step 1: Configure YARN

1. **Edit `yarn-site.xml`**:

   ```bash

   nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

   ```

   Add:

   ```xml

   

     

       yarn.resourcemanager.hostname

       localhost

     

     

       yarn.nodemanager.aux-services

       mapreduce_shuffle

     

   

   ```

2. **Start YARN**:

   ```bash

   start-yarn.sh

   ```

3. **Verify YARN Services**:

   Check that ResourceManager and NodeManager are running:

   ```bash

   jps

   ```

---

## Lab 25: Query Processing in HDFS (Using MapReduce)

### Step 1: Write the MapReduce Program

1. **Create WordCount Java Program**:

   ```bash

   nano WordCount.java

   ```

   Insert the following code:

   ```java

   import java.io.IOException;

   import org.apache.hadoop.conf.Configuration;

   import org.apache.hadoop.fs.Path;

   import org.apache.hadoop.io.IntWritable;

   import org.apache.hadoop.io.Text;

   import org.apache.hadoop.mapreduce.Job;

   import org.apache.hadoop.mapreduce.Mapper;

   import org.apache.hadoop.mapreduce.Reducer;

   import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

   import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

   public class WordCount {

       public static class TokenizerMapper extends Mapper {

           private final static IntWritable one = new IntWritable(1);

           private Text word = new Text();

           public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

               String[] words = value.toString().split("\\s+");

               for (String w : words) {

                   word.set(w);

                   context.write(word, one);

               }

           }

       }

       public static class IntSumReducer extends Reducer {

           private IntWritable result = new IntWritable();

           public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {

               int sum = 0;

               for (IntWritable val : values) {

                   sum += val.get();

               }

               result.set(sum);

               context.write(key, result);

           }

       }

       public static void main(String[] args) throws Exception {

           Configuration conf = new Configuration();

           Job job = Job.getInstance(conf, "word count");

           job.setJarByClass(WordCount.class);

           job.setMapperClass(TokenizerMapper.class);

           job.setCombinerClass(IntSumReducer.class);

           job.setReducerClass(IntSumReducer.class);

           job.setOutputKeyClass(Text.class);

           job.setOutputValueClass(IntWritable.class);

           FileInputFormat.addInputPath(job, new Path(args[0]));

           FileOutputFormat.setOutputPath(job, new Path(args[1]));

           System.exit(job.waitForCompletion(true) ? 0 : 1);

       }

   }

   ```

### Step 2: Compile and Run the Program

1. **Compile the Program**:

   ```bash

   hadoop com.sun.tools.javac.Main WordCount.java

   jar cf wordcount.jar WordCount*.class

   ```

2. **Create an Input File**:

   ```bash

   echo "Hello Hadoop Hello HDFS" > input.txt

   ```

3. **Upload File to HDFS**:

   ```bash

   hdfs dfs -mkdir /input

   hdfs dfs -put input.txt /input/input.txt

   ```

4. **Run the MapReduce Job**:

   ```bash

   hadoop jar wordcount.jar WordCount /input/input.txt /output

   ```

5. **View the Output**:

   ```bash

   hdfs dfs -cat /output/part-r-00000

   ```

### Changing the Content of an Existing Input File in HDFS

1. **Remove the Existing Input File from HDFS**:

   ```bash

   hdfs dfs -rm /input/input.txt

   ```

2. **Create a New File with Updated Content Locally**:

   ```bash

   echo "New content for Hadoop processing" > new_input.txt

   ```

3. **Upload the New Input File to HDFS**:

   ```bash

   hdfs dfs -put new_input.txt /input/input.txt

   ```

4. **Run the Map

Reduce Job Again**:

   ```bash

   hadoop jar wordcount.jar WordCount /input/input.txt /output

   ```

5. **View the Output**:

   ```bash

   hdfs dfs -cat /output/part-r-00000

   ```

---

### Stopping Hadoop Services

1. **Stop HDFS**:

   ```bash

   stop-dfs.sh

   ```

2. **Stop YARN**:

   ```bash

   stop-yarn.sh

   ```

3. **Verify Services are Stopped**:

   ```bash

   jps

   ```

   No Hadoop-related processes should be listed.

---

This comprehensive guide should cover everything from setting up Hadoop to running MapReduce jobs and managing input/output files. If you need further assistance, feel free to ask!