Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/melinamoraiti/hadoop-text-analytics

📊 An implementation of Number of files a term appears, Maximum Term Frequency, TF-IDF calculation using Hadoop MapReduce framework.
https://github.com/melinamoraiti/hadoop-text-analytics

hadoop inverted-index mapreduce term-frequency tf-idf

Last synced: 23 days ago
JSON representation

📊 An implementation of Number of files a term appears, Maximum Term Frequency, TF-IDF calculation using Hadoop MapReduce framework.

Host: GitHub
URL: https://github.com/melinamoraiti/hadoop-text-analytics
Owner: MelinaMoraiti
Created: 2024-04-22T08:23:11.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-09-30T09:23:22.000Z (4 months ago)
Last Synced: 2024-11-11T16:29:52.542Z (3 months ago)
Topics: hadoop, inverted-index, mapreduce, term-frequency, tf-idf
Language: Java
Homepage:
Size: 54.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Hadoop-Text-Analytics-App

## Overview
This project implements text analytics functionalities using the Hadoop MapReduce framework. It calculates the number of files a term appears in, the maximum term frequency, and computes the TF-IDF (Term Frequency-Inverse Document Frequency) for a set of text documents.

## 🛠️ Requirements
- Apache Hadoop
- Java Development Kit (JDK)

## Getting Started

### 📊 Output Format

The output from the final reducer will be structured as follows:
word max_docname max_tf m

- **`word`**: The analyzed term
- **`max_docname`**: The file with the maximum term frequency
- **`max_tf`**: The highest frequency of the term in that file
- **`m`**: The number of files containing the term

## ⚙️ How to Run?

Easily build and run your project with the provided shell scripts:

### Build Script

```bash
./build.sh
```

### Run Script

```bash
./run.sh
```

### Parameters

- ``: Directory for input text files.
- ``: Directory for intermediate output.
- ``: Directory for final output.
- ``: Number of reducers (1, 2, or 4).
- ``: Directory with compiled classes.
- ``: Main class of your application.
- ``: Name of the JAR file to execute.