https://github.com/abdelrhman-ellithy/nlp-distributed-system-mpi
NLP Model using MPI Communicator as a Distributed System Environment.
https://github.com/abdelrhman-ellithy/nlp-distributed-system-mpi
Last synced: 6 months ago
JSON representation
NLP Model using MPI Communicator as a Distributed System Environment.
- Host: GitHub
- URL: https://github.com/abdelrhman-ellithy/nlp-distributed-system-mpi
- Owner: Abdelrhman-Ellithy
- License: mit
- Created: 2024-12-13T13:42:28.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-12-13T14:16:41.000Z (10 months ago)
- Last Synced: 2025-02-09T22:15:48.400Z (8 months ago)
- Language: Python
- Size: 2.02 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# π NLP-Distributed-System-MPI
An **NLP Model** utilizing the power of **MPI Communicator** to implement a distributed system environment for scalable and efficient data preprocessing and machine learning model training.
---
## π οΈ Features
- **Parallel Data Preprocessing** π: Uses MPI to distribute and preprocess large datasets across multiple processes, enabling faster execution.
- **Flexible NLP Pipelines** π: Includes functions for text cleaning (stopword removal, punctuation stripping, URL elimination, and more).
- **Decision Tree Classification** π³: Implements a basic machine learning pipeline with sklearnβs DecisionTreeClassifier.
- **Efficient Dataset Handling** π: Supports large-scale datasets through intelligent splitting and gathering using MPI.
- **Cross-platform** π: Compatible with any system supporting MPI and Python.---
## π Project Structure
```
NLP-Distributed-System-MPI
|-- Project.py # Main Python script
|-- twitter_training.csv # Example dataset
```
---## βοΈ Setup & Installation
1. **Clone the repository**:
```bash
git clone https://github.com/Abdelrhman-Ellithy/NLP-Distributed-System-MPI.git
cd NLP-Distributed-System-MPI
```2. **Install dependencies**:
- Itβs recommended to use a virtual environment:
```bash
python3 -m venv env
source env/bin/activate # On Windows, use `env\Scripts\activate`
```
- Install required packages:
```bash
pip install -r requirements.txt
```3. **Ensure MPI is installed**:
- For Linux:
```bash
sudo apt-get install mpich
```
- For macOS (using Homebrew):
```bash
brew install open-mpi
```
- Verify installation:
```bash
mpiexec --version
```---
## π How to Run
1. **Run the MPI script**:
```bash
mpiexec -n python Project.py
```
Replace `` with the number of parallel processes you want to use.
---## π§Ή Preprocessing Functions
- **remove_stopwords**: Eliminates common stopwords to improve data quality.
- **remove_punc**: Strips punctuation marks.
- **remove_digits**: Removes numerical characters.
- **remove_html_tags**: Cleans HTML content.
- **remove_url**: Filters out URLs.---
## π§ͺ Example Workflow
1. **Load Dataset**: Reads and prepares the dataset for processing.
2. **Preprocess with MPI**: Distributes preprocessing tasks across multiple processes.
3. **Vectorize Text**: Converts preprocessed text into numerical features using sklearnβs CountVectorizer.
4. **Train Model**: Fits a Decision Tree classifier.
5. **Evaluate**: Outputs a confusion matrix and classification report.---
## π Performance
- **Speedup**: Parallel processing significantly reduces preprocessing time for large datasets.
- **Scalability**: Easily scale the system by increasing the number of processes.---
## π‘οΈ License
This project is licensed under the MIT License. See the LICENSE file for details.
---
## π Acknowledgments
- **MPI4Py**: For enabling Python-based MPI implementations.
- **scikit-learn**: For powerful machine learning tools.
- **Pandas**: For efficient data manipulation.
- **nltk**: For NLP-specific preprocessing utilities.---
## π‘ Future Enhancements
- Integration with advanced classifiers (e.g., Random Forest, Gradient Boosting).
- Adding support for GPU-based preprocessing.
- Extending compatibility with cloud-based environments.---
Happy coding! π