https://github.com/saikat-roy/toxiccomments_kaggle
Solutions to the Toxic Comments on Kaggle
https://github.com/saikat-roy/toxiccomments_kaggle
kaggle kaggle-competition machine-learning nlp nlp-machine-learning python
Last synced: 23 days ago
JSON representation
Solutions to the Toxic Comments on Kaggle
- Host: GitHub
- URL: https://github.com/saikat-roy/toxiccomments_kaggle
- Owner: saikat-roy
- License: apache-2.0
- Created: 2018-09-04T22:33:44.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2018-09-04T22:56:28.000Z (over 6 years ago)
- Last Synced: 2025-02-15T15:50:26.186Z (3 months ago)
- Topics: kaggle, kaggle-competition, machine-learning, nlp, nlp-machine-learning, python
- Language: Python
- Size: 1.55 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ToxicComments_Kaggle
Solutions to the Toxic Comments on Kaggle. (https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)## Dependencies
1) Python 3.7
2) NumPy
3) SciPy
4) Pandas
5) scikit-learn## Data Requirements
Requires the "train.csv" and "test.csv" from the Kaggle Toxic Comment Classification Challenge Page.## Basic Methodology
1) Perform basic Cleaning by removing the Reference Text from the Comment Body and non-alpha numeric characters.
2) Vectorization using BOW model, or Tfidf vectorizer (scikit-learn)
3) Trains individual models for each comment class (toxic, severe_toxic, obscene, threat, insult, identity_hate). Class balancing is handled during training weighted updates instead of manual sample balancing.
4) Predictions are handled for each class separately similar to training.## Algorithms Implemented with Scores on Kaggle
1) Logistic Regression - CountVectorizer - 0.8917
2) Logistic Regression - TfidfVectorizer - 0.8964