https://github.com/saba-gul/corona-tweet-sentiment-classification

This project aims to classify the sentiment of tweets related to the coronavirus pandemic.
https://github.com/saba-gul/corona-tweet-sentiment-classification

covid-19 covid19-data embeddings-word2vec ensemble-machine-learning nlp-machine-learning nltk-python sentiment-analysis sentiment-classification tweets-classification vectorization

Last synced: 3 months ago
JSON representation

This project aims to classify the sentiment of tweets related to the coronavirus pandemic.

Host: GitHub
URL: https://github.com/saba-gul/corona-tweet-sentiment-classification
Owner: Saba-Gul
License: mit
Created: 2024-07-16T23:46:33.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-07-17T00:51:58.000Z (11 months ago)
Last Synced: 2025-01-13T16:28:18.822Z (5 months ago)
Topics: covid-19, covid19-data, embeddings-word2vec, ensemble-machine-learning, nlp-machine-learning, nltk-python, sentiment-analysis, sentiment-classification, tweets-classification, vectorization
Language: Jupyter Notebook
Homepage:
Size: 5.34 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Corona Tweet Sentiment Classification

## Table of Contents

- [Introduction](#introduction)

- [Dataset](#dataset)

- [Preprocessing Steps](#preprocessing-steps)

- [Models and Techniques](#models-and-techniques)

- [Evaluation](#evaluation)

- [Results](#results)

- [Conclusion](#conclusion)



    

  



## Introduction

The Corona Tweet Sentiment Classification project aims to classify tweets related to COVID-19 into five sentiment categories:

- 0: Neutral

- 1: Positive

- 2: Extremely Positive

- -2: Extremely Negative

- -1: Negative

  

## Dataset

The dataset contains tweets related to COVID-19 along with their corresponding sentiment labels. The tweets have been cleaned and preprocessed to remove unwanted characters, URLs, punctuation, and stop words. The dataset is split into training and testing sets for model evaluation. [COVID-19 NLP Text Classification Dataset](https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification/data)

  





    





    

  



## Preprocessing Steps

1. **Text Cleaning:** Removing URLs, punctuations, and other unwanted characters.

2. **Lowercasing:** Converting all text to lowercase.

3. **Tokenization:** Splitting text into tokens (words).

4. **Removing Stop Words:** Eliminating common words that do not add significant meaning.

5. **Stemming/Lemmatization:** Reducing words to their root form.

6. **Vectorization**

## Models and Techniques

### Vectorization Techniques

- **TF-IDF (Term Frequency-Inverse Document Frequency):** Converts text data into numerical representation by considering the frequency of words and their importance in the document.

- **Word2Vec:** A neural network-based embedding technique that represents words in continuous vector space.

### Classifiers

- **Logistic Regression:** A linear model for binary and multiclass classification tasks.

- **Naive Bayes:** A probabilistic classifier based on Bayes' theorem with strong independence assumptions.

- **Support Vector Machines (SVM):** A classifier that finds the hyperplane that best separates different classes.

- **Random Forest:** An ensemble learning method that constructs multiple decision trees.

- **Ensemble Voting Classifier:** Combines predictions from multiple models to improve performance.

## Evaluation

The models are evaluated using accuracy, precision, recall, F1 score, and confusion matrix.

## Results

The performance of the models using different feature extraction techniques is summarized below:

| Model                      | Vectorization | Accuracy | Precision | Recall  | F1 Score |

|----------------------------|---------------|----------|-----------|---------|----------|

| Logistic Regression        | TF-IDF        | 0.5743   | 0.5472    | 0.5391  | 0.5403   |

| Naive Bayes                | TF-IDF        | 0.3746   | 0.6244    | 0.3661  | 0.2732   |

| SVM                        | TF-IDF        | 0.6229   | 0.6309    | 0.6229  | 0.6241   |

| Random Forest              | TF-IDF        | 0.5279   | 0.5509    | 0.5279  | 0.5204   |

| Voting Classifier          | TF-IDF        | 0.6259   | 0.6347    | 0.6259  | 0.6258   |

| Logistic Regression        | N-Grams       | 0.5391   | 0.5472    | 0.5391  | 0.5403   |

| Naive Bayes                | N-Grams       | 0.3661   | 0.6244    | 0.3661  | 0.2732   |

| Voting Classifier          | N-Grams       | 0.5328   | 0.5472    | 0.5391  | 0.5403   |

| Logistic Regression        | Word2Vec      | 0.5328   | 0.5472    | 0.5391  | 0.5403   |

| Naive Bayes                | Word2Vec      | 0.3661   | 0.6244    | 0.3661  | 0.2732   |

| Voting Classifier          | Word2Vec      | 0.5328   | 0.5472    | 0.5391  | 0.5403   |

### Evaluation Metrics for Best Performing Models (TF-IDF):

#### SVM:

- Accuracy: 0.6229

- Precision: Varies across classes

- Recall: Varies across classes

- F1 Score: Varies across classes

#### Ensemble Voting Classifier:

- Accuracy: 0.6259 

- Precision: Varies across classes

- Recall: Varies across classes

- F1 Score: Varies across classes

## Conclusion

The Voting Classifier with TF-IDF vectorization achieved the highest accuracy (0.6259) and F1 Score (0.6258) among the models tested. The SVM with TF-IDF also showed strong performance with an accuracy of 0.6229 and F1 Score of 0.6241. The Random Forest classifier with TF-IDF performed less effectively compared to the other models, with an accuracy of 0.5279 and F1 Score of 0.5204. Combining multiple models in the Voting Classifier proved to be beneficial for improving classification performance.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/saba-gul/corona-tweet-sentiment-classification

Awesome Lists containing this project

README