Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/melinamoraiti/hadoop-text-analytics

An implementation of Number of files a term appears, Maximum Term Frequency, TF-IDF calculation using Hadoop MapReduce framework.
https://github.com/melinamoraiti/hadoop-text-analytics

hadoop inverted-index mapreduce term-frequency tf-idf

Last synced: about 1 month ago
JSON representation

An implementation of Number of files a term appears, Maximum Term Frequency, TF-IDF calculation using Hadoop MapReduce framework.

Awesome Lists containing this project

README

        

# Hadoop-Text-Analytics-App

## Overview
This project implements text analytics functionalities using the Hadoop MapReduce framework. It calculates the number of files a term appears in, the maximum term frequency, and computes the TF-IDF (Term Frequency-Inverse Document Frequency) for a set of text documents.

## 🛠️ Requirements
- Apache Hadoop
- Java Development Kit (JDK)

## Getting Started

### 📊 Output Format

The output from the final reducer will be structured as follows:
word max_docname max_tf m

- **`word`**: The analyzed term
- **`max_docname`**: The file with the maximum term frequency
- **`max_tf`**: The highest frequency of the term in that file
- **`m`**: The number of files containing the term

## ⚙️ How to Run?

Easily build and run your project with the provided shell scripts:

### Build Script

```bash
./build.sh
```

### Run Script

```bash
./run.sh
```

### Parameters

- ``: Directory for input text files.
- ``: Directory for intermediate output.
- ``: Directory for final output.
- ``: Number of reducers (1, 2, or 4).
- ``: Directory with compiled classes.
- ``: Main class of your application.
- ``: Name of the JAR file to execute.