https://github.com/davidogalo/twitter-sentiment-analysis
Developed a sentiment analysis model to measure tweet positivity across regions using advanced NLP techniques. This project involved data preprocessing, feature engineering with TF-IDF and Doc2Vec, and training supervised machine learning models. Performance was validated using cross-validation and metrics like accuracy and precision
https://github.com/davidogalo/twitter-sentiment-analysis
cross-validation data-preprocessing feature-engineering machine-learning model-evaluation model-training-and-tuning natural-language-processing performance-metrics
Last synced: about 1 year ago
JSON representation
Developed a sentiment analysis model to measure tweet positivity across regions using advanced NLP techniques. This project involved data preprocessing, feature engineering with TF-IDF and Doc2Vec, and training supervised machine learning models. Performance was validated using cross-validation and metrics like accuracy and precision
- Host: GitHub
- URL: https://github.com/davidogalo/twitter-sentiment-analysis
- Owner: DavidOgalo
- Created: 2024-04-02T17:07:08.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-06-05T12:28:43.000Z (about 2 years ago)
- Last Synced: 2024-06-05T14:11:13.345Z (about 2 years ago)
- Topics: cross-validation, data-preprocessing, feature-engineering, machine-learning, model-evaluation, model-training-and-tuning, natural-language-processing, performance-metrics
- Language: Jupyter Notebook
- Homepage:
- Size: 1.13 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Sentiment Analysis on Social Media Data (Twitter)
Description
Conceptualized and developed a sentiment analysis model to quantify the positivity of tweets across diverse geographic regions. Leveraged advanced Natural Language Processing (NLP) techniques, including count vectorization, TF-IDF, and Doc2Vec, to extract meaningful insights from unstructured text data. This project involved extensive data handling and pre-processing, sophisticated machine learning algorithms, and rigorous model evaluation and validation to ensure robust and reliable performance.
Key Concepts
Data Handling and Pre-processing
> - Data Cleaning: Processed unstructured text data to handle missing values and duplicates, ensuring high-quality input for model training.
> - Feature Engineering: Utilized count vectorization, TF-IDF, and Doc2Vec to create meaningful features from raw text data, enhancing the model's ability to understand sentiment.
> - Data Visualization: Used libraries like Seaborn and Matplotlib to visualize sentiment distribution across regions, helping to identify patterns and trends in the data.
Machine Learning Algorithms
> - Supervised Learning: Trained the sentiment analysis model using supervised learning techniques on labeled tweet data, focusing on accurately classifying sentiment.
> - Supervised Learning: Applied clustering methods to explore patterns in sentiment data, providing additional insights into the data's structure.
Natural Language Processing (NLP)
> - Text Pre-processing: Implemented tokenization, stemming, and lemmatization using NLTK to standardize and clean the text data, making it suitable for analysis.
> - NLP Models: Leveraged advanced models like Doc2Vec for feature extraction, capturing semantic meaning from the text data.
> - Libraries: Utilized NLTK and Gensim for various NLP tasks, ensuring robust and efficient text processing.
Model Evaluation and Validation
> - Metrics: Assessed model performance using metrics such as accuracy, precision, recall, and F1 score to ensure a comprehensive evaluation.
> - Cross-Validation: Conducted k-fold cross-validation to validate model stability and robustness, ensuring the model generalizes well to unseen data.
> - A/B Testing: Performed A/B testing to evaluate model changes and improvements, ensuring continuous enhancement of model performance.
Technologies (Tools and Libraries)
-
Python==3.6: Primary programming language used for the project. -
NLTK==3.4.5: Used for text preprocessing tasks such as tokenization, stemming, and lemmatization. -
Gensim==3.8.3: Employed for advanced NLP tasks including the implementation of Doc2Vec. -
Matplotlib==3.2.1: Utilized for data visualization to explore and understand sentiment distributions. -
Matplotlib==3.2.1: Utilized for data visualization to explore and understand sentiment distributions. -
Seaborn==0.10.1: Enhanced data visualization capabilities for better presentation of sentiment analysis results. -
scikit-learn==0.21.3: scikit-learn: Used for machine learning model training and evaluation.
Project Breakdown
Part 1: Data Collection and Pre-processing
> - Data Collection: Gathered tweets using the Twitter API, ensuring a diverse dataset across various geographic regions. Also used a sample set from kaggle containing tweets extracted using the twitter API.
> - Data Cleaning: Processed the raw tweet data to handle missing values, duplicates, and irrelevant content.
Part 2: Feature Engineering
> - Count Vectorization: Transformed text data into numerical vectors using count vectorization.
> - TF-IDF: Applied Term Frequency-Inverse Document Frequency to weigh the importance of words in the dataset.
> - Doc2Vec: Used Doc2Vec to capture the semantic meaning of tweets, enhancing feature representation.
Part 3: Model Training and Tuning
> - Supervised Learning: Trained a sentiment analysis model using labeled data, employing algorithms like logistic regression and support vector machines.
> - Hyperparameter Tuning: Optimized model parameters to improve performance using techniques like grid search.
Part 4: Model Evaluation and Validation
> - Metrics: Evaluated model performance using accuracy, precision, recall, and F1 score.
> - Cross-Validation: Conducted k-fold cross-validation to ensure model robustness and generalizability.
> - A/B Testing: Implemented A/B testing to compare different model versions and select the best-performing model.
Getting Started
- Clone the Repository
- Install Dependencies: Manually install the required tools and libraries highlighted in the technologies section, versions are specified.
- Dataset: Download the dataset using the Twitter API or a sample dataset from Kaggle (https://www.kaggle.com/datasets/kazanova/sentiment140) and place it in the designated directory.
- Run the Preprocessing Script: Preprocess the tweets using the provided scripts to clean and standardize the data.
- Feature Engineering: Execute the feature engineering scripts to transform the text data into numerical features.
- Train the Model: Use the training scripts to build and optimize the sentiment analysis model.
- Evaluate the Model: Run the evaluation scripts to assess the model performance using various metrics and validation techniques.
Maintainers and Contributors
Maintainer: David Ogalo
Contributors: Contributions are welcome. Please reach out for more information on contribution guidelines on this project.