https://github.com/recker-dev/exploring-nlp
This Repo, explores various processes for sentiment analysis using Amazon Customer Review dataset.
https://github.com/recker-dev/exploring-nlp
bagofwords distilbert huggingface machine-learning nlp-machine-learning tfidf word2vec
Last synced: 4 months ago
JSON representation
This Repo, explores various processes for sentiment analysis using Amazon Customer Review dataset.
- Host: GitHub
- URL: https://github.com/recker-dev/exploring-nlp
- Owner: Recker-Dev
- Created: 2025-01-01T06:45:51.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-01-01T06:49:03.000Z (9 months ago)
- Last Synced: 2025-02-24T09:19:21.600Z (7 months ago)
- Topics: bagofwords, distilbert, huggingface, machine-learning, nlp-machine-learning, tfidf, word2vec
- Language: Jupyter Notebook
- Homepage:
- Size: 19.2 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.MD
Awesome Lists containing this project
README
# Sentiment Analysis Project with Text Preprocessing Techniques
## Project Overview
This project explores various **text preprocessing techniques** for sentiment analysis using a dataset of customer reviews. The goal is to evaluate the performance of different approaches in classifying the sentiment of text data. The techniques evaluated include:1. **TF-IDF and Bag-of-Words (BoW)**: These are traditional text representation methods that convert text into numerical features. TF-IDF weighs word frequencies, while BoW represents text as word counts.
2. **Word2Vec and Average Word2Vec**: Word2Vec generates dense vector representations of words, capturing their semantic relationships. Average Word2Vec aggregates word vectors to represent an entire review.
3. **Fine-Tuning with DistilBert**: A pre-trained transformer model, DistilBert, is fine-tuned for sentiment classification to leverage powerful contextual representations of text.## Findings
### TF-IDF and BoW:
- **Accuracy**: Lower accuracy in sentiment classification compared to other methods.
- **Analysis**: While simple to implement, these techniques fail to capture the semantic meaning of words, which limits their performance on sentiment analysis tasks.### Word2Vec and Average Word2Vec:
- **Accuracy**: Significantly better performance than TF-IDF and BoW, demonstrating the benefit of capturing the semantic relationships between words.
- **Analysis**: This approach performs better by using word embeddings and similarity search, which provide richer representations of the text.### Fine-Tuning with DistilBert:
- **Accuracy**: Achieved the highest accuracy, outperforming both TF-IDF, BoW, and Word2Vec.
- **Analysis**: The power of pre-trained language models like DistilBert is showcased in this approach, leveraging contextual understanding of text to achieve superior performance.## Recommendations
Based on the findings, we recommend the following:- **Word2Vec or Similar Word Embedding Techniques**: For sentiment analysis tasks on similar datasets, word embeddings like Word2Vec offer a good balance of performance and simplicity, capturing semantic meaning effectively.
- **Fine-Tuning a Pre-Trained Language Model**: Fine-tuning models like **DistilBert** is likely to yield the best results, especially for larger datasets or more complex sentiment analysis tasks, due to their ability to capture contextual relationships in text.## Notes
- The specific performance of each technique may vary depending on the dataset and machine learning model used.
- Experimentation with different techniques, hyperparameters, and fine-tuning strategies is essential to achieving optimal results in sentiment analysis tasks.