https://github.com/ranpy13/comment-analysis

A simple natural language processing model that analysis the toxicity of the comment from the database.
https://github.com/ranpy13/comment-analysis

Last synced: 3 months ago
JSON representation

A simple natural language processing model that analysis the toxicity of the comment from the database.

Host: GitHub
URL: https://github.com/ranpy13/comment-analysis
Owner: ranpy13
License: cc0-1.0
Created: 2024-04-17T10:51:47.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-04-27T19:24:02.000Z (about 1 year ago)
Last Synced: 2025-03-29T06:22:43.087Z (4 months ago)
Language: Python
Size: 52.6 MB
Stars: 5
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# comment-analysis
A simple natural language processing model that analysis the toxicity of the comment from the database.

---

## Overview
> The project aims to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate, we will use a dataset of comments from Wikipedia’s talk page edits, collected by Kaggle. Improvements to the current model will hopefully help online discussion become more productive and respectful.

## Installation steps:
* Requirements:
* python 3.x
* basic ML libraries (see *requirements.txt*)
* dataset: [kaggle](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data?select=train.csv.zip)

* installation:
```
python -m pip install --upgrade pip
pipenv shell
.\Scripts\activate

pip install -r requirements.txt
```

* Run the base file: `main.py`
```
python main.py
```

---

## Data Preprocessing and Exploratory Data Analysis
* Data Loading
```
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")
test_y = pd.read_csv("data/test_labels.csv")
```
* Data Analysis

train.head

id
comment_text
toxic
severe_toxic
obscene
threat
insult
identity_hate

0
0000997932d777bf
Explanation\nWhy the edits made under my usern...
0
0
0
0
0
0

1
000103f0d9cfb60f
D'aww! He matches this background colour I'm s...
0
0
0
0
0
0

2
000113f07ec002fd
Hey man, I'm really not trying to edit war. It...
0
0
0
0
0
0

3
0001b41b1c6bb37e
"\nMore\nI can't make any real suggestions on ...
0
0
0
0
0
0

4
0001d958c54c6e35
You, sir, are my hero. Any chance you remember...
0
0
0
0
0
0

train.describe

id
comment_text
toxic
severe_toxic
obscene
threat
insult
identity_hate

0
0000997932d777bf
Explanation\nWhy the edits made under my usern...
0
0
0
0
0
0

1
000103f0d9cfb60f
D'aww! He matches this background colour I'm s...
0
0
0
0
0
0

2
000113f07ec002fd
Hey man, I'm really not trying to edit war. It...
0
0
0
0
0
0

3
0001b41b1c6bb37e
"\nMore\nI can't make any real suggestions on ...
0
0
0
0
0
0

4
0001d958c54c6e35
You, sir, are my hero. Any chance you remember...
0
0
0
0
0
0

test.head

id
comment_text

0
00001cee341fdb12
Yo bitch Ja Rule is more succesful then you'll...

1
0000247867823ef7
== From RfC == \n\n The title is fine as it is...

2
00013b17ad220c46
" \n\n == Sources == \n\n * Zawe Ashton on Lap...

3
00017563c3f7919a
:If you have a look back at the source, the in...

4
00017695ad8997eb
I don't anonymously edit articles at all.

test_y.head

id
toxic
severe_toxic
obscene
threat
insult
identity_hate

0
00001cee341fdb12
-1
-1
-1
-1
-1
-1

1
0000247867823ef7
-1
-1
-1
-1
-1
-1

2
00013b17ad220c46
-1
-1
-1
-1
-1
-1

3
00017563c3f7919a
-1
-1
-1
-1
-1
-1

4
00017695ad8997eb
-1
-1
-1
-1
-1
-1

*Notice that the training data contains 159,571 observations with 8 columns and the test datat contains 153,164 observations with 2 columns.*

Below is the plot showing the comment length frequency. As noticed, most of the comments are short with only a few comments longer than 1000 words.

![Comment length frequency plot](share/man/man1/image.png)

Further exploratory shows that label `toxic` has the most observations in the training dataset while `threat` has the least.

![Trainging dataset observations](share/man/man1/image-1.png)

Below is the plot for the labeled data frequency. There is significant class imbalance since majority of the comments are considered non-toxic.

![Significant class imbalance](share/man/man1/image-2.png)

*It might be a great practice to see which labels are likely to appear together with a comment.*

![correlation matrix](share/man/man1/image-3.png)

* As seen in the cross-correlation matrix, there is a high chance of obscene comments to be insulting.

* In order to get an idea of what are the words that contribute the most to different labels, we write a function to generate **word clouds**. The function takes in a parameter label (i.e., toxic, insult, threat, etc)

![word-cloud sample](share/man/man1/image-4.png)

---

## Feature Engineering
Before fitting models, we need to break down the sentence into *unique words* by *tokenizing* the comments. In the `tokenize()` function, we remove punctuations and special characters. We also filtered out non-ascii characters after observing the results of feature engineering. We then *lemmatize* the comments and filter out comments with length below 3. Besides lemmatization, we also tried *stemming* but did not get a better result.

* benchmarking different vectorizers
- We determined to use **TFIDF** to scale down the impact of tokens that occur very frequently in a given corpus and that are hence *empirically less informative* than features that occur in a small fraction of the training corpus.

- Besides TFIDF, we also tried **CountVectorizer**. However, it is not performing as well as TFIDF. The TfidfVectorizer is actually CountVectorizer followed by TfidfTransformer. *TfidfTransformer transforms a count matrix to a normalized Term-Frequency or TermFrequency-InverseDocumentFrequency representation.* The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to *scale down the impact of tokens that occur very frequently* in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus. That's why we can improve the accuracy here.

- For example: Since this corpus consists of data from the Wikipedia's talk page edits, the words such as wiki, Wikipedia, edit, page are very common. But for our classification purposes they do not provide us useful information and that should probably be the reason why TFIDF worked better than CountVectorizer.

---

## Modeling and Evaluation
### Baseline Model
We choose ***Naive Bayes*** as our baseline model, specifically *Multinomial Naive Bayes.*

Also, we want to *compare between different models*, especially models that perform well in text classification. Thus, we choose to compare *Multinomial Naive Bayes with Logistic Regression and Linear Support Vector Machine.*

### Evaluation Metrics
Our main metric for measuring model performance is **F1-score**, since we have 6 labels, the F1-score would be the average of 6 labels. We will also take other metrics into consideration while evaluating models, e.g, *Hamming loss and recall.*

### Cross Validation
We use **Cross Validation** to compare between the baseline model and the other two models that we have chosen *(LogisticRegression and LinearSVC).*

Model
Label
Recall
F1

0
MultinomialNB
toxic
0.483066
0.636650

1
MultinomialNB
severe_toxic
0.021336
0.041091

6
LogisticRegression
toxic
0.610500
0.731340

7
LogisticRegression
severe_toxic
0.256395
0.351711

12
LinearSVC
toxic
0.680528
0.759304

13
LinearSVC
severe_toxic
0.267693
0.355258

Based on the cross validation above the *linear SVC model and Logistic Regression model perform better.* As a baseline model, Multinomial Naive Bayes does not perform well, especially for the `threat` label and `identity_hate` label because these two labels have the least number of observations.

Determinig the models' performance on the actual prediction - the test dataset.

![F1 Score](share/man/man1/image-5.png)

Above are the result table and plot showing a comparison between these different models after training.

Notice that Muninomial Naive Bayes does not perform as well as the other two models while Linear SVC in general out performs the others based on F1 score.

---

## Vizualizing Performance
Visualizing performance till now for each classifier across each cateogry

### F1 and recall

Mulitnomial Naive Bayes regression

![MNB regression](share/man/man1/image-6.png)

Logistic regression

![Logistic](share/man/man1/image-7.png)

Linear SVC

![LSVC](share/man/man1/image-8.png)

### Confusion Matrix

Mulitnomial Naive Bayes regression

![MNB regression](share/man/man1/image-9.png)

Logistic regression

![Logistic](share/man/man1/image-10.png)

Linear SVC

![Linear SVC](share/man/man1/image-11.png)

Based on the above comparison, we could say that for these three models with default settings, **LinearSVC performs better than anyone for 'toxic' label .**

### Aggregated Hamming Loss Score

Model
Hamming_Loss

0
MultinomialNB
0.026939

1
LogisticRegression
0.025675

2
LinearSVC
0.028476

Across all models , **Logistic Regression** is doing a great job overall since it has the lowest percentage of incorrect labels.

### Pipelines
Clean the code with pipeline and use some manually chosen hyperparameters to check how each model behaves.
Manually adjusting `class_weight` for the models to determine if we can achieve better results.

Since Logistic Regression and Linear SVM are performing better, we will focus on these two models. For display purpose, we will only include average F1 score, Recall, and Hamming Loss for comparison.

- Notice that after adjusting `class_weight`, we are getting way better results than the basic models. LinearSVC outperforms LogisticRegression by approximately 1%.

Model
F1
Recall
Hamming_Loss
Training_Time

0
LogisticRegression
0.947921
0.934050
0.065950
2.137849

1
LinearSVC
0.951508
0.941634
0.058366
7.478050

---

## Hyperparameter Tuning with Grid Search
We decide to do grid search to seek for the *"optimal"* hyperparameters for the basic models that we've chose. Later we will make comparison based on the best model from each algorithm, since we have 6 different lables, tuning models for each label would be time expensive, so we will use the *most common label ***"Toxic"*** to tune hyperparameters.*

### Model Selection
We will then compare these two models based on their tunned hyperparameters, we will also include training time as one of the metric when we compare models.

Model
F1
Recall
Hamming_Loss
Traing_Time

0
LinearSVC
0.971706
0.971524
0.028476
5.029654

1
LogisticRegression
0.973227
0.974330
0.025670
13.031119

## Ensembling
Since Ensemble learning helps improve machine learning results by combining several models and allows the production of better predictive performance compared to a single model.
> Testing if ensembling helps us achieve better results.

To ensemble different models, we firstly tried some models based on tree boosting, then use a voting classfier to ensemble one of the boosting model with the basic models in previous parts.

### Boosting Models
Comparing 3 popular boosting models:
- Adaboost
- GradientBoost
- XgBoost

- Score after boosting the models:

Model
F1
Recall
Hamming_Loss
Traing_Time

0
AdaBoostClassifier
0.967605
0.969771
0.030229
50.761416

1
GradientBoostingClassifier
0.969075
0.971748
0.028252
204.453572

2
XGBClassifier
0.967563
0.971790
0.028210
68.613414

Since *gradient boosting outperforms other two* boosting models, we decide to go ahead with ***gradient boosting***.

### Voting Classifier
Ensembled model worked very well but still ***could not outperform*** LinearSVC since we did not tune the hyperparameters for the ensemled model.

Model
F1
Recall
Hamming_Loss
Training_Time

0
Ensemble
0.973026
0.974119
0.025881
64.728463

## Result Interpretation

* What went wrong:
- Analyzing the words misclassified by Logistic Classifier. Checking for 'toxic' label

- Misclassified 1347 as non-toxic which were actually toxic

- ![wordcloud 2](share/man/man1/image-12.png)

- We want to analyze why the model couldn't recognize these words. Were they not present in the training set?

- In order to analyze, we first need to pass these raw comment strings through same tokenizer and check the common tokens.

* `ucking` is a common word in the test set and it seems our classifier hasnt learnt to classify it as toxic.

* This token wasn't common in our training set. That explains why our model couldn't learn it.

### Learning Models visualization

Plot learning rate curve for the estimator with title, training data as X,
labels as y.

![graph23](share/man/man1/image-13.png)

---

## Future Improvements
- Try more ways of vectorizing text data.
- Go deeper on feature engineering : Spelling corrector, Sentiment scores, n-grams, etc.
- Advanced models (e.g., lightgbm).
- Advanced Ensemble model (e.g., stacking).
- Deep learning model (e.g., LSTM).
- Advanced hyperparameter tuning techniques (e.g., Bayesian Optimization).

---

## References:
1. https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

2. https://github.com/nicknochnack/CommentToxicity

3. https://youtu.be/ZUqB-luawZg?feature=shared

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ranpy13/comment-analysis

Awesome Lists containing this project

README