{"id":19039492,"url":"https://github.com/nirmaldeepponnada/codeclauseinternshipproject2","last_synced_at":"2026-04-07T16:31:49.171Z","repository":{"id":261766891,"uuid":"885280105","full_name":"nirmaldeepponnada/CodeClauseInternshipProject2","owner":"nirmaldeepponnada","description":"Python, NLTK, Scikit-Learn, Pandas, NumPy, Pickle, SciPy, and JSON are used for text preprocessing, feature engineering, multi-label classification, and model persistence.","archived":false,"fork":false,"pushed_at":"2024-11-08T09:52:11.000Z","size":9735,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-02T06:44:43.140Z","etag":null,"topics":["nltk","numpy","pandas","pickle","python","scikit-learn","scipy"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nirmaldeepponnada.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-08T09:42:18.000Z","updated_at":"2024-11-08T09:55:32.000Z","dependencies_parsed_at":"2024-11-08T10:52:01.182Z","dependency_job_id":null,"html_url":"https://github.com/nirmaldeepponnada/CodeClauseInternshipProject2","commit_stats":null,"previous_names":["nirmaldeepponnada/codeclauseinternshipproject2"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nirmaldeepponnada%2FCodeClauseInternshipProject2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nirmaldeepponnada%2FCodeClauseInternshipProject2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nirmaldeepponnada%2FCodeClauseInternshipProject2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nirmaldeepponnada%2FCodeClauseInternshipProject2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nirmaldeepponnada","download_url":"https://codeload.github.com/nirmaldeepponnada/CodeClauseInternshipProject2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240100007,"owners_count":19747610,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nltk","numpy","pandas","pickle","python","scikit-learn","scipy"],"created_at":"2024-11-08T22:17:12.299Z","updated_at":"2026-04-07T16:31:49.137Z","avatar_url":"https://github.com/nirmaldeepponnada.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CodeClauseInternshipProject2\n\n### Project Description\n\n1. **Objective**:\n   - Predict the genres of movies based on their plot summaries and additional features (e.g., budget, runtime, vote average).\n\n2. **Dataset**:\n   - Combines `tmdb_5000_movies.csv` and `tmdb_5000_credits.csv`.\n   - Key features used: `overview` (plot summary), `genres`, `budget`, `runtime`, `vote_average`.\n\n3. **Technologies Used**:\n   - Python for implementation.\n   - NLTK for text preprocessing (lemmatization, stopword removal).\n   - Scikit-Learn for feature extraction (TF-IDF), multi-label classification, and evaluation.\n   - Pandas and NumPy for data handling.\n   - Pickle for model persistence.\n\n4. **Preprocessing**:\n   - Text data (`overview`) cleaned using tokenization, lemmatization, and stopword removal.\n   - JSON-formatted `genres` column converted to binary multi-label format.\n   - Additional numerical features filled or normalized.\n\n5. **Feature Engineering**:\n   - Plot summaries converted to numerical features using TF-IDF vectorization.\n   - Combined TF-IDF features with numerical features (budget, runtime, vote average).\n\n6. **Model Training**:\n   - Multi-label classification handled using `OneVsRestClassifier` with Logistic Regression.\n   - Model trained to predict one or more genres for each movie.\n\n7. **Evaluation**:\n   - Metrics: Accuracy, precision, recall, and F1-score.\n   - Evaluation done on a separate test set.\n\n8. **Model Persistence**:\n   - Trained model, TF-IDF vectorizer, and genre label binarizer saved for future predictions.\n\n9. **Prediction**:\n   - New movie summaries can be preprocessed and passed through the saved model to predict genres.\n\n10. **Outcome**:\n    - Demonstrates the use of NLP and machine learning techniques for multi-label text classification.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnirmaldeepponnada%2Fcodeclauseinternshipproject2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnirmaldeepponnada%2Fcodeclauseinternshipproject2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnirmaldeepponnada%2Fcodeclauseinternshipproject2/lists"}