{"id":25806715,"url":"https://github.com/asrot0/imdb_sentiment_analysis","last_synced_at":"2026-01-27T13:03:53.868Z","repository":{"id":278026598,"uuid":"934281865","full_name":"asRot0/IMDB_Sentiment_Analysis","owner":"asRot0","description":"Sentiment Analysis on IMDB Reviews using ML models with optimized preprocessing, SMOTE balancing, and hyperparameter tuning for peak performance.","archived":false,"fork":false,"pushed_at":"2025-02-20T13:54:38.000Z","size":815,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-30T06:08:33.672Z","etag":null,"topics":["ai","imdbreviews","machinelearning","nlp","sentiment-analysis","textclassification","xgboost"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/asRot0.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-17T15:14:43.000Z","updated_at":"2025-05-23T14:27:06.000Z","dependencies_parsed_at":"2025-02-17T16:28:43.601Z","dependency_job_id":"0af2910b-23fe-423b-a9a7-41519987ca45","html_url":"https://github.com/asRot0/IMDB_Sentiment_Analysis","commit_stats":null,"previous_names":["asrot0/imdb_sentiment_analysis"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/asRot0/IMDB_Sentiment_Analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asRot0%2FIMDB_Sentiment_Analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asRot0%2FIMDB_Sentiment_Analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asRot0%2FIMDB_Sentiment_Analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asRot0%2FIMDB_Sentiment_Analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/asRot0","download_url":"https://codeload.github.com/asRot0/IMDB_Sentiment_Analysis/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/asRot0%2FIMDB_Sentiment_Analysis/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28813230,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-27T12:25:15.069Z","status":"ssl_error","status_checked_at":"2026-01-27T12:25:05.297Z","response_time":168,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","imdbreviews","machinelearning","nlp","sentiment-analysis","textclassification","xgboost"],"created_at":"2025-02-27T20:28:57.134Z","updated_at":"2026-01-27T13:03:53.848Z","avatar_url":"https://github.com/asRot0.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# IMDB Sentiment Analysis\n\n## Project Overview\nThis project performs sentiment analysis on **IMDB movie reviews** using **K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Random Forest, and XGBoost**. The goal is to classify reviews as **positive or negative** based on textual content.  \n\n### **Steps Involved**\n1. **Load \u0026 Explore the Dataset**  \n2. **Preprocess Text Data** (Cleaning, Tokenization, Stopword Removal, Stemming)  \n3. **Train-Test Split** (70% Train, 30% Test)  \n4. **Feature Extraction** using **Bag of Words (BoW)**  \n5. **Train KNN Model \u0026 Evaluate**  \n6. **Train SVM, Random Forest, XGBoost on a subset (40%-50%) \u0026 Compare**  \n7. **Hyperparameter Tuning using RandomizedSearchCV**  \n8. **Train the Best Model on the Full Dataset**  \n9. **Evaluate the Final Model (Accuracy, F1 Score, Confusion Matrix, Visualizations)**  \n\n---\n\n## Dataset Details\n- **Dataset Source**: [IMDB Dataset](https://github.com/asRot0/machine-learning/blob/main/datasets/IMDB%20Dataset.csv)  \n- **Size**: 50,000 reviews  \n- **Classes**:  \n  - **Positive (25,000 reviews)**  \n  - **Negative (25,000 reviews)**  \n\nEach review is labeled as **positive** or **negative**, making it a **binary classification problem**.\n\n---\n\n## Technologies Used\n- **Python**  \n- **Pandas, NumPy** (Data Handling)  \n- **Scikit-Learn** (Machine Learning Models)  \n- **XGBoost** (Boosting Algorithm)  \n- **Seaborn, Matplotlib** (Data Visualization)  \n- **NLTK, BeautifulSoup** (Text Processing)  \n\n---\n\n## Model Evaluation \u0026 Insights\n\nTo understand model effectiveness, we analyzed **confusion matrices** and **classification reports** for each model. Below are some key insights:\n\n### 🔹 **K-Nearest Neighbors (KNN)**\n- Performed **poorly** due to the high-dimensional sparse nature of text data.\n- Struggled with decision boundaries, leading to **low accuracy**.\n\n### 🔹 **Support Vector Machine (SVM)**\n- Provided **decent performance** with good generalization.\n- However, training time was relatively **slow** on a large dataset.\n\n### 🔹 **Random Forest**\n- Showed **strong results**, handling non-linear relationships well.\n- Benefited from **ensemble learning**, but had slightly **higher training time**.\n\n### 🔹 **XGBoost**\n- Achieved the **best accuracy**, excelling in feature selection \u0026 boosting weak learners.\n- Benefited significantly from **hyperparameter tuning**.\n- Final model trained on **full dataset** after parameter optimization.\n\n### **Visualizing Model Results**\nHere’s a heatmap of the **best model’s confusion matrix**:\n\n```python\nimport seaborn as sns\nimport matplotlib.pyplot as plt\nfrom sklearn.metrics import confusion_matrix\n\ncm = confusion_matrix(y_test, y_pred_final)\nsns.heatmap(cm, annot=True, fmt='g', cmap=\"Blues\")\nplt.title(\"Best Model (XGBoost) - Confusion Matrix\")\nplt.xlabel(\"Predicted\")\nplt.ylabel(\"Actual\")\nplt.show()\n```\n\n\u003e **Note**: Hyperparameter tuning was performed on the best-performing model before final training.\n\n---\n\n## How to Run\n### **1. Install Dependencies**\n```bash\npip install pandas numpy scikit-learn xgboost seaborn matplotlib nltk beautifulsoup4 tqdm imbalanced-learn\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasrot0%2Fimdb_sentiment_analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fasrot0%2Fimdb_sentiment_analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fasrot0%2Fimdb_sentiment_analysis/lists"}