https://github.com/nirmaldeepponnada/codeclauseinternshipproject2

Python, NLTK, Scikit-Learn, Pandas, NumPy, Pickle, SciPy, and JSON are used for text preprocessing, feature engineering, multi-label classification, and model persistence.
https://github.com/nirmaldeepponnada/codeclauseinternshipproject2

nltk numpy pandas pickle python scikit-learn scipy

Last synced: 3 months ago
JSON representation

Python, NLTK, Scikit-Learn, Pandas, NumPy, Pickle, SciPy, and JSON are used for text preprocessing, feature engineering, multi-label classification, and model persistence.

Host: GitHub
URL: https://github.com/nirmaldeepponnada/codeclauseinternshipproject2
Owner: nirmaldeepponnada
Created: 2024-11-08T09:42:18.000Z (6 months ago)
Default Branch: main
Last Pushed: 2024-11-08T09:52:11.000Z (6 months ago)
Last Synced: 2025-01-02T06:44:43.140Z (5 months ago)
Topics: nltk, numpy, pandas, pickle, python, scikit-learn, scipy
Language: Jupyter Notebook
Homepage:
Size: 9.28 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# CodeClauseInternshipProject2

### Project Description

1. **Objective**:
- Predict the genres of movies based on their plot summaries and additional features (e.g., budget, runtime, vote average).

2. **Dataset**:
- Combines `tmdb_5000_movies.csv` and `tmdb_5000_credits.csv`.
- Key features used: `overview` (plot summary), `genres`, `budget`, `runtime`, `vote_average`.

3. **Technologies Used**:
- Python for implementation.
- NLTK for text preprocessing (lemmatization, stopword removal).
- Scikit-Learn for feature extraction (TF-IDF), multi-label classification, and evaluation.
- Pandas and NumPy for data handling.
- Pickle for model persistence.

4. **Preprocessing**:
- Text data (`overview`) cleaned using tokenization, lemmatization, and stopword removal.
- JSON-formatted `genres` column converted to binary multi-label format.
- Additional numerical features filled or normalized.

5. **Feature Engineering**:
- Plot summaries converted to numerical features using TF-IDF vectorization.
- Combined TF-IDF features with numerical features (budget, runtime, vote average).

6. **Model Training**:
- Multi-label classification handled using `OneVsRestClassifier` with Logistic Regression.
- Model trained to predict one or more genres for each movie.

7. **Evaluation**:
- Metrics: Accuracy, precision, recall, and F1-score.
- Evaluation done on a separate test set.

8. **Model Persistence**:
- Trained model, TF-IDF vectorizer, and genre label binarizer saved for future predictions.

9. **Prediction**:
- New movie summaries can be preprocessed and passed through the saved model to predict genres.

10. **Outcome**:
- Demonstrates the use of NLP and machine learning techniques for multi-label text classification.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nirmaldeepponnada/codeclauseinternshipproject2

Awesome Lists containing this project

README