https://github.com/saikat-roy/toxiccomments_kaggle

Solutions to the Toxic Comments on Kaggle
https://github.com/saikat-roy/toxiccomments_kaggle

kaggle kaggle-competition machine-learning nlp nlp-machine-learning python

Last synced: 23 days ago
JSON representation

Solutions to the Toxic Comments on Kaggle

Host: GitHub
URL: https://github.com/saikat-roy/toxiccomments_kaggle
Owner: saikat-roy
License: apache-2.0
Created: 2018-09-04T22:33:44.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2018-09-04T22:56:28.000Z (over 6 years ago)
Last Synced: 2025-02-15T15:50:26.186Z (3 months ago)
Topics: kaggle, kaggle-competition, machine-learning, nlp, nlp-machine-learning, python
Language: Python
Size: 1.55 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # ToxicComments_Kaggle

Solutions to the Toxic Comments on Kaggle. (https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

## Dependencies

1) Python 3.7

2) NumPy

3) SciPy

4) Pandas

5) scikit-learn

## Data Requirements

Requires the "train.csv" and "test.csv" from the Kaggle Toxic Comment Classification Challenge Page.

## Basic Methodology

1) Perform basic Cleaning by removing the Reference Text from the Comment Body and non-alpha numeric characters.

2) Vectorization using BOW model, or Tfidf vectorizer (scikit-learn)

3) Trains individual models for each comment class (toxic, severe_toxic, obscene, threat, insult, identity_hate). Class balancing is handled during training weighted updates instead of manual sample balancing.

4) Predictions are handled for each class separately similar to training.

## Algorithms Implemented with Scores on Kaggle

1) Logistic Regression - CountVectorizer - 0.8917

2) Logistic Regression - TfidfVectorizer - 0.8964

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/saikat-roy/toxiccomments_kaggle

Awesome Lists containing this project

README