https://github.com/madhans476/spark_nlp_spark_ml_lib
This project combines the power of Spark NLP for natural language processing and Spark MLlib for machine learning, allowing users to efficiently classify text into multiple categories in a distributed computing environment.
https://github.com/madhans476/spark_nlp_spark_ml_lib
hadoop pyspark python3 spark spark-mllib spark-nlp
Last synced: 2 months ago
JSON representation
This project combines the power of Spark NLP for natural language processing and Spark MLlib for machine learning, allowing users to efficiently classify text into multiple categories in a distributed computing environment.
- Host: GitHub
- URL: https://github.com/madhans476/spark_nlp_spark_ml_lib
- Owner: madhans476
- Created: 2024-11-15T18:36:30.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-11-16T06:14:35.000Z (6 months ago)
- Last Synced: 2025-01-21T07:11:20.038Z (4 months ago)
- Topics: hadoop, pyspark, python3, spark, spark-mllib, spark-nlp
- Language: Jupyter Notebook
- Homepage:
- Size: 3.62 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Leveraging Spark NLP and MLlib for Big Data Text Analysis
## Group Members
| Name | Email |
|--------------|-----------------------------|
| Madhan S | [email protected] |
| Bharath L | [email protected] |
| Dhrithi K | [email protected] |
| Gnanesh A R | [email protected] |
| Gopal | [email protected] |## Dataset
The dataset used for this project can be found on Zenoda: [Food Recall Incidents](https://zenodo.org/records/10891602)## Prerequisites
- Python 3.x
- Java 8 or 11
- Apache Spark
- Apache Hadoop## Setting Up the Environment
### 1. Create a Hadoop Spark Cluster
- Follow the instructions to set up a Hadoop-Spark cluster.
- A **3-node cluster** was used to run this project.### 2. Installing Dependencies
Run the following commands to install the required Python libraries:
```bash
pip install pyspark
pip install spark-nlp
```### 3. Download Dataset
- Download the **Food Hazard** dataset and upload it to **HDFS**.## Project Pipeline

## Running the Project
1. Start the Hadoop and Spark services.
2. Open the `main.ipynb` notebook.
3. Change the input file path to match your HDFS file path.
3. Execute the notebook to run the project.## References
- [Spark NLP](https://nlp.johnsnowlabs.com/)
- [Spark MLlib](https://spark.apache.org/mllib/)---
## AcknowledgmentWe acknowledge the authors for providing the dataset:
**Dataset Citation:**
Randl, Korbinian; Karvounis, Manos; Marinos, George; Pavlopoulos, John; Lindgren, Tony; Henriksson, Aron.
*Food Recall Incidents*.
Publisher: Zenodo, March 2024.
DOI: [10.5281/zenodo.10891602](https://doi.org/10.5281/zenodo.10891602)