https://github.com/gaurav-van/toxic-comment-web_app
Data Science Project to classify a comment into several toxicity categories. This Repository is used for deployment of the project.
https://github.com/gaurav-van/toxic-comment-web_app
classification data-science datacleaning exploratory-data-analysis machine-learning nlp nlp-machine-learning python streamlit
Last synced: about 2 months ago
JSON representation
Data Science Project to classify a comment into several toxicity categories. This Repository is used for deployment of the project.
- Host: GitHub
- URL: https://github.com/gaurav-van/toxic-comment-web_app
- Owner: Gaurav-Van
- Created: 2022-08-20T13:06:45.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2024-05-10T04:54:24.000Z (about 2 years ago)
- Last Synced: 2025-02-02T15:14:08.969Z (over 1 year ago)
- Topics: classification, data-science, datacleaning, exploratory-data-analysis, machine-learning, nlp, nlp-machine-learning, python, streamlit
- Language: Python
- Homepage:
- Size: 4.05 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Threat model: threat_model.pkl
Awesome Lists containing this project
README
# Toxic-Comment-App
Note: This Repository is required for deployment of this project on Streamlit Cloud.
Web App Link :- https://gaurav-van-toxic-comment-web-app-app-24y37c.streamlitapp.com/
Project Repo: https://github.com/Gaurav-Van/Data_Science__Machine_Learning-Projects
Classifying Comments in Six different Categories including their Neutral Cases Using Concepts of NLP and ML
- Toxic
- Severe Toxic
- Threat
- Obscene
- Insult
- Identity Hate
## Concept Used
Instead of Multiclass classification, Binary Classification of Each Category is performed
1. Data Collection - From Kaggle: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
2. Data Pre-Procesing - Text Pre-Processing Using Regular Expressions
* Removing \n characters
* Removing Aplha-Numeric Characters
* Removing Punctuations
* Removing Non Ascii Characters
3. EDA - Performaing Data analysis to Discover some Issues and trend of the Data
- Through Bar charts of Each Category :- Prob = Class Imbalance -> Solution = Making Frequency of 0s equal to Frequency of 1s by Making Different Dataset of each Category [ id, comment_text, category].
- Helps to solve the Issue of Class Imbalance and Helps in Binary Classification of Each Category
4. Model Building
* VECTORIZATION :- Using TF-IDF and Unigram Approach
* Model Used For Each Category :- KNN, Logistic Regression, SVM, CNB, BNB, DT and RF
* Model Selected/b> - Logistic Regression
* Exporting Trained ML Models as 6 pickle files [ one of each category ]
* Exporting Trained Vectorized Models as 6 pickle files [ one for each category ]
5. Deployment - Building web app with the help of streamlit and deploying it on Streamlit cloud