Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/melinamoraiti/hadoop-text-analytics
An implementation of Number of files a term appears, Maximum Term Frequency, TF-IDF calculation using Hadoop MapReduce framework.
https://github.com/melinamoraiti/hadoop-text-analytics
hadoop inverted-index mapreduce term-frequency tf-idf
Last synced: about 1 month ago
JSON representation
An implementation of Number of files a term appears, Maximum Term Frequency, TF-IDF calculation using Hadoop MapReduce framework.
- Host: GitHub
- URL: https://github.com/melinamoraiti/hadoop-text-analytics
- Owner: MelinaMoraiti
- Created: 2024-04-22T08:23:11.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-06-10T20:06:53.000Z (6 months ago)
- Last Synced: 2024-06-17T08:46:47.182Z (6 months ago)
- Topics: hadoop, inverted-index, mapreduce, term-frequency, tf-idf
- Language: Java
- Homepage:
- Size: 47.9 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Hadoop-Text-Analytics-App
## Overview
This project implements text analytics functionalities using the Hadoop MapReduce framework. It calculates the number of files a term appears in, the maximum term frequency, and computes the TF-IDF (Term Frequency-Inverse Document Frequency) for a set of text documents.## 🛠️ Requirements
- Apache Hadoop
- Java Development Kit (JDK)## Getting Started
### 📊 Output Format
The output from the final reducer will be structured as follows:
word max_docname max_tf m- **`word`**: The analyzed term
- **`max_docname`**: The file with the maximum term frequency
- **`max_tf`**: The highest frequency of the term in that file
- **`m`**: The number of files containing the term
## ⚙️ How to Run?Easily build and run your project with the provided shell scripts:
### Build Script
```bash
./build.sh
```### Run Script
```bash
./run.sh
```### Parameters
- ``: Directory for input text files.
- ``: Directory for intermediate output.
- ``: Directory for final output.
- ``: Number of reducers (1, 2, or 4).
- ``: Directory with compiled classes.
- ``: Main class of your application.
- ``: Name of the JAR file to execute.