https://github.com/ritvik19/toxic-comment-classification
https://github.com/ritvik19/toxic-comment-classification
Last synced: 8 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/ritvik19/toxic-comment-classification
- Owner: Ritvik19
- Created: 2020-05-10T06:53:40.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2020-07-28T16:48:19.000Z (almost 6 years ago)
- Last Synced: 2025-03-16T19:48:25.314Z (about 1 year ago)
- Language: Jupyter Notebook
- Size: 9.7 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Toxic-Comment-Classification
Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.
Some characteristics that can signify that a text is toxic:
* Has a non-neutral tone
* Has an exaggerated tone to underscore a point about a group of people
* Is rhetorical and meant to imply a statement about a group of people
* Is disparaging or inflammatory
* Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
* Makes disparaging attacks/insults against a specific person or group of people
* Based on an outlandish premise about a group of people
* Disparages against a characteristic that is not fixable and not measurable
* Isn't grounded in reality
* Based on false information, or contains absurd assumptions
* Uses sexual content (incest, bestiality, pedophilia) for shock value
**Problem Statement:** to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate
**Sources:** [Kaggle-Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/) and [Kaggle-Jigsaw Unintended Bias in Toxicity Classification](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/)
**Project Objective:** a model to prerform advanced sentiment analysis
___
### Approach Summary
**Performance Measure:** Area Under Receiver Operating Characteristic
**Feature Extraction:** Sublinear Smoothed TFIDF
**Algorithm:** OVR Logistic Regression
___
### Performance Summary
Approach | Algorithm | Mean AUROC | Mean Accuracy | Mean F1
:---|:---|---:|---:|---:
Sampled Data | Logistic Regression | 0.9745 | 0.7812 | 0.8926
Sampled Data | Bagging Classifier | 0.9680 | 0.7616 | 0.8793
Complete Data | Logistic Regression | 0.9717 | 0.8687 | 0.9046
Complete Data | Stacking Classifier | 0.9729 | 0.7940 | 0.8903