Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nirmaldeepponnada/codeclauseinternshipproject2
Python, NLTK, Scikit-Learn, Pandas, NumPy, Pickle, SciPy, and JSON are used for text preprocessing, feature engineering, multi-label classification, and model persistence.
https://github.com/nirmaldeepponnada/codeclauseinternshipproject2
nltk numpy pandas pickle python scikit-learn scipy
Last synced: 14 days ago
JSON representation
Python, NLTK, Scikit-Learn, Pandas, NumPy, Pickle, SciPy, and JSON are used for text preprocessing, feature engineering, multi-label classification, and model persistence.
- Host: GitHub
- URL: https://github.com/nirmaldeepponnada/codeclauseinternshipproject2
- Owner: nirmaldeepponnada
- Created: 2024-11-08T09:42:18.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2024-11-08T09:52:11.000Z (2 months ago)
- Last Synced: 2024-11-08T10:35:40.733Z (2 months ago)
- Topics: nltk, numpy, pandas, pickle, python, scikit-learn, scipy
- Language: Jupyter Notebook
- Homepage:
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CodeClauseInternshipProject2
### Project Description
1. **Objective**:
- Predict the genres of movies based on their plot summaries and additional features (e.g., budget, runtime, vote average).2. **Dataset**:
- Combines `tmdb_5000_movies.csv` and `tmdb_5000_credits.csv`.
- Key features used: `overview` (plot summary), `genres`, `budget`, `runtime`, `vote_average`.3. **Technologies Used**:
- Python for implementation.
- NLTK for text preprocessing (lemmatization, stopword removal).
- Scikit-Learn for feature extraction (TF-IDF), multi-label classification, and evaluation.
- Pandas and NumPy for data handling.
- Pickle for model persistence.4. **Preprocessing**:
- Text data (`overview`) cleaned using tokenization, lemmatization, and stopword removal.
- JSON-formatted `genres` column converted to binary multi-label format.
- Additional numerical features filled or normalized.5. **Feature Engineering**:
- Plot summaries converted to numerical features using TF-IDF vectorization.
- Combined TF-IDF features with numerical features (budget, runtime, vote average).6. **Model Training**:
- Multi-label classification handled using `OneVsRestClassifier` with Logistic Regression.
- Model trained to predict one or more genres for each movie.7. **Evaluation**:
- Metrics: Accuracy, precision, recall, and F1-score.
- Evaluation done on a separate test set.8. **Model Persistence**:
- Trained model, TF-IDF vectorizer, and genre label binarizer saved for future predictions.9. **Prediction**:
- New movie summaries can be preprocessed and passed through the saved model to predict genres.10. **Outcome**:
- Demonstrates the use of NLP and machine learning techniques for multi-label text classification.