https://github.com/madhans476/spark_nlp_spark_ml_lib

This project combines the power of Spark NLP for natural language processing and Spark MLlib for machine learning, allowing users to efficiently classify text into multiple categories in a distributed computing environment.
https://github.com/madhans476/spark_nlp_spark_ml_lib

hadoop pyspark python3 spark spark-mllib spark-nlp

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/madhans476/spark_nlp_spark_ml_lib
Owner: madhans476
Created: 2024-11-15T18:36:30.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-11-16T06:14:35.000Z (8 months ago)
Last Synced: 2025-01-21T07:11:20.038Z (5 months ago)
Topics: hadoop, pyspark, python3, spark, spark-mllib, spark-nlp
Language: Jupyter Notebook
Homepage:
Size: 3.62 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Leveraging Spark NLP and MLlib for Big Data Text Analysis

## Group Members

| Name         | Email                       |

|--------------|-----------------------------|

| Madhan S     | [email protected]      |

| Bharath L    | [email protected]      |

| Dhrithi K    | [email protected]      |

| Gnanesh A R  | [email protected]      |

| Gopal        | [email protected]      |

## Dataset

The dataset used for this project can be found on Zenoda:  [Food Recall Incidents](https://zenodo.org/records/10891602)

## Prerequisites

- Python 3.x

- Java 8 or 11

- Apache Spark

- Apache Hadoop

## Setting Up the Environment

### 1. Create a Hadoop Spark Cluster

- Follow the instructions to set up a Hadoop-Spark cluster.

- A **3-node cluster** was used to run this project.

### 2. Installing Dependencies

Run the following commands to install the required Python libraries:

```bash

pip install pyspark

pip install spark-nlp

 ```

### 3. Download Dataset

- Download the **Food Hazard** dataset and upload it to **HDFS**.

## Project Pipeline

![Alt text](framework.png)

## Running the Project

1. Start the Hadoop and Spark services.

2. Open the `main.ipynb` notebook.

3. Change the input file path to match your HDFS file path.

3. Execute the notebook to run the project.

## References

- [Spark NLP](https://nlp.johnsnowlabs.com/)

- [Spark MLlib](https://spark.apache.org/mllib/)

---

## Acknowledgment

We acknowledge the authors for providing the dataset:

**Dataset Citation:**

Randl, Korbinian; Karvounis, Manos; Marinos, George; Pavlopoulos, John; Lindgren, Tony; Henriksson, Aron.  

*Food Recall Incidents*.  

Publisher: Zenodo, March 2024.  

DOI: [10.5281/zenodo.10891602](https://doi.org/10.5281/zenodo.10891602)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/madhans476/spark_nlp_spark_ml_lib

Awesome Lists containing this project

README