https://github.com/davidogalo/twitter-sentiment-analysis

Developed a sentiment analysis model to measure tweet positivity across regions using advanced NLP techniques. This project involved data preprocessing, feature engineering with TF-IDF and Doc2Vec, and training supervised machine learning models. Performance was validated using cross-validation and metrics like accuracy and precision
https://github.com/davidogalo/twitter-sentiment-analysis

cross-validation data-preprocessing feature-engineering machine-learning model-evaluation model-training-and-tuning natural-language-processing performance-metrics

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/davidogalo/twitter-sentiment-analysis
Owner: DavidOgalo
Created: 2024-04-02T17:07:08.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-06-05T12:28:43.000Z (about 2 years ago)
Last Synced: 2024-06-05T14:11:13.345Z (about 2 years ago)
Topics: cross-validation, data-preprocessing, feature-engineering, machine-learning, model-evaluation, model-training-and-tuning, natural-language-processing, performance-metrics
Language: Jupyter Notebook
Homepage:
Size: 1.13 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## Sentiment Analysis on Social Media Data (Twitter)

Description

Conceptualized and developed a sentiment analysis model to quantify the positivity of tweets across diverse geographic regions. Leveraged advanced Natural Language Processing (NLP) techniques, including count vectorization, TF-IDF, and Doc2Vec, to extract meaningful insights from unstructured text data. This project involved extensive data handling and pre-processing, sophisticated machine learning algorithms, and rigorous model evaluation and validation to ensure robust and reliable performance.

Key Concepts

Data Handling and Pre-processing

> - Data Cleaning: Processed unstructured text data to handle missing values and duplicates, ensuring high-quality input for model training.
> - Feature Engineering: Utilized count vectorization, TF-IDF, and Doc2Vec to create meaningful features from raw text data, enhancing the model's ability to understand sentiment.
> - Data Visualization: Used libraries like Seaborn and Matplotlib to visualize sentiment distribution across regions, helping to identify patterns and trends in the data.

Machine Learning Algorithms

> - Supervised Learning: Trained the sentiment analysis model using supervised learning techniques on labeled tweet data, focusing on accurately classifying sentiment.
> - Supervised Learning: Applied clustering methods to explore patterns in sentiment data, providing additional insights into the data's structure.

Natural Language Processing (NLP)

> - Text Pre-processing: Implemented tokenization, stemming, and lemmatization using NLTK to standardize and clean the text data, making it suitable for analysis.
> - NLP Models: Leveraged advanced models like Doc2Vec for feature extraction, capturing semantic meaning from the text data.
> - Libraries: Utilized NLTK and Gensim for various NLP tasks, ensuring robust and efficient text processing.

Model Evaluation and Validation

> - Metrics: Assessed model performance using metrics such as accuracy, precision, recall, and F1 score to ensure a comprehensive evaluation.
> - Cross-Validation: Conducted k-fold cross-validation to validate model stability and robustness, ensuring the model generalizes well to unseen data.
> - A/B Testing: Performed A/B testing to evaluate model changes and improvements, ensuring continuous enhancement of model performance.

Technologies (Tools and Libraries)

Python==3.6: Primary programming language used for the project.

NLTK==3.4.5: Used for text preprocessing tasks such as tokenization, stemming, and lemmatization.

Gensim==3.8.3: Employed for advanced NLP tasks including the implementation of Doc2Vec.

Matplotlib==3.2.1: Utilized for data visualization to explore and understand sentiment distributions.

Matplotlib==3.2.1: Utilized for data visualization to explore and understand sentiment distributions.

Seaborn==0.10.1: Enhanced data visualization capabilities for better presentation of sentiment analysis results.

scikit-learn==0.21.3: scikit-learn: Used for machine learning model training and evaluation.

Project Breakdown

Part 1: Data Collection and Pre-processing

> - Data Collection: Gathered tweets using the Twitter API, ensuring a diverse dataset across various geographic regions. Also used a sample set from kaggle containing tweets extracted using the twitter API.
> - Data Cleaning: Processed the raw tweet data to handle missing values, duplicates, and irrelevant content.

Part 2: Feature Engineering

> - Count Vectorization: Transformed text data into numerical vectors using count vectorization.
> - TF-IDF: Applied Term Frequency-Inverse Document Frequency to weigh the importance of words in the dataset.
> - Doc2Vec: Used Doc2Vec to capture the semantic meaning of tweets, enhancing feature representation.

Part 3: Model Training and Tuning

> - Supervised Learning: Trained a sentiment analysis model using labeled data, employing algorithms like logistic regression and support vector machines.
> - Hyperparameter Tuning: Optimized model parameters to improve performance using techniques like grid search.

Part 4: Model Evaluation and Validation

> - Metrics: Evaluated model performance using accuracy, precision, recall, and F1 score.
> - Cross-Validation: Conducted k-fold cross-validation to ensure model robustness and generalizability.
> - A/B Testing: Implemented A/B testing to compare different model versions and select the best-performing model.

Getting Started

Clone the Repository

Install Dependencies: Manually install the required tools and libraries highlighted in the technologies section, versions are specified.

Dataset: Download the dataset using the Twitter API or a sample dataset from Kaggle (https://www.kaggle.com/datasets/kazanova/sentiment140) and place it in the designated directory.

Run the Preprocessing Script: Preprocess the tweets using the provided scripts to clean and standardize the data.

Feature Engineering: Execute the feature engineering scripts to transform the text data into numerical features.

Train the Model: Use the training scripts to build and optimize the sentiment analysis model.

Evaluate the Model: Run the evaluation scripts to assess the model performance using various metrics and validation techniques.

Maintainers and Contributors

Maintainer: David Ogalo

Contributors: Contributions are welcome. Please reach out for more information on contribution guidelines on this project.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome