https://github.com/recker-dev/exploring-nlp

This Repo, explores various processes for sentiment analysis using Amazon Customer Review dataset.
https://github.com/recker-dev/exploring-nlp

bagofwords distilbert huggingface machine-learning nlp-machine-learning tfidf word2vec

Last synced: 4 months ago
JSON representation

This Repo, explores various processes for sentiment analysis using Amazon Customer Review dataset.

Host: GitHub
URL: https://github.com/recker-dev/exploring-nlp
Owner: Recker-Dev
Created: 2025-01-01T06:45:51.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-01-01T06:49:03.000Z (9 months ago)
Last Synced: 2025-02-24T09:19:21.600Z (7 months ago)
Topics: bagofwords, distilbert, huggingface, machine-learning, nlp-machine-learning, tfidf, word2vec
Language: Jupyter Notebook
Homepage:
Size: 19.2 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.MD

Awesome Lists containing this project

README

# Sentiment Analysis Project with Text Preprocessing Techniques

## Project Overview
This project explores various **text preprocessing techniques** for sentiment analysis using a dataset of customer reviews. The goal is to evaluate the performance of different approaches in classifying the sentiment of text data. The techniques evaluated include:

1. **TF-IDF and Bag-of-Words (BoW)**: These are traditional text representation methods that convert text into numerical features. TF-IDF weighs word frequencies, while BoW represents text as word counts.
2. **Word2Vec and Average Word2Vec**: Word2Vec generates dense vector representations of words, capturing their semantic relationships. Average Word2Vec aggregates word vectors to represent an entire review.
3. **Fine-Tuning with DistilBert**: A pre-trained transformer model, DistilBert, is fine-tuned for sentiment classification to leverage powerful contextual representations of text.

## Findings

### TF-IDF and BoW:
- **Accuracy**: Lower accuracy in sentiment classification compared to other methods.
- **Analysis**: While simple to implement, these techniques fail to capture the semantic meaning of words, which limits their performance on sentiment analysis tasks.

### Word2Vec and Average Word2Vec:
- **Accuracy**: Significantly better performance than TF-IDF and BoW, demonstrating the benefit of capturing the semantic relationships between words.
- **Analysis**: This approach performs better by using word embeddings and similarity search, which provide richer representations of the text.

### Fine-Tuning with DistilBert:
- **Accuracy**: Achieved the highest accuracy, outperforming both TF-IDF, BoW, and Word2Vec.
- **Analysis**: The power of pre-trained language models like DistilBert is showcased in this approach, leveraging contextual understanding of text to achieve superior performance.

## Recommendations
Based on the findings, we recommend the following:

- **Word2Vec or Similar Word Embedding Techniques**: For sentiment analysis tasks on similar datasets, word embeddings like Word2Vec offer a good balance of performance and simplicity, capturing semantic meaning effectively.
- **Fine-Tuning a Pre-Trained Language Model**: Fine-tuning models like **DistilBert** is likely to yield the best results, especially for larger datasets or more complex sentiment analysis tasks, due to their ability to capture contextual relationships in text.

## Notes
- The specific performance of each technique may vary depending on the dataset and machine learning model used.
- Experimentation with different techniques, hyperparameters, and fine-tuning strategies is essential to achieving optimal results in sentiment analysis tasks.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/recker-dev/exploring-nlp

Awesome Lists containing this project

README