{"id":28180918,"url":"https://github.com/davidogalo/twitter-sentiment-analysis","last_synced_at":"2025-05-16T03:11:49.850Z","repository":{"id":231184538,"uuid":"781057453","full_name":"DavidOgalo/Twitter-Sentiment-Analysis","owner":"DavidOgalo","description":"Developed a sentiment analysis model to measure tweet positivity across regions using advanced NLP techniques. This project involved data preprocessing, feature engineering with TF-IDF and Doc2Vec, and training supervised machine learning models. Performance was validated using cross-validation and metrics like accuracy and precision","archived":false,"fork":false,"pushed_at":"2024-06-05T12:28:43.000Z","size":1181,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-06-05T14:11:13.345Z","etag":null,"topics":["cross-validation","data-preprocessing","feature-engineering","machine-learning","model-evaluation","model-training-and-tuning","natural-language-processing","performance-metrics"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DavidOgalo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-02T17:07:08.000Z","updated_at":"2024-06-05T13:05:14.000Z","dependencies_parsed_at":"2024-04-02T22:41:38.855Z","dependency_job_id":"36b5b5bc-8653-4968-8857-4183d666fcf4","html_url":"https://github.com/DavidOgalo/Twitter-Sentiment-Analysis","commit_stats":null,"previous_names":["davidogalo/twitter-sentiment-analysis"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavidOgalo%2FTwitter-Sentiment-Analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavidOgalo%2FTwitter-Sentiment-Analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavidOgalo%2FTwitter-Sentiment-Analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavidOgalo%2FTwitter-Sentiment-Analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DavidOgalo","download_url":"https://codeload.github.com/DavidOgalo/Twitter-Sentiment-Analysis/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254459110,"owners_count":22074606,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cross-validation","data-preprocessing","feature-engineering","machine-learning","model-evaluation","model-training-and-tuning","natural-language-processing","performance-metrics"],"created_at":"2025-05-16T03:11:42.950Z","updated_at":"2025-05-16T03:11:49.830Z","avatar_url":"https://github.com/DavidOgalo.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Sentiment Analysis on Social Media Data (Twitter)\n\n\u003ch3\u003e\u003cstrong\u003eDescription\u003c/strong\u003e\u003c/h3\u003e\nConceptualized and developed a sentiment analysis model to quantify the positivity of tweets across diverse geographic regions. Leveraged advanced Natural Language Processing (NLP) techniques, including count vectorization, TF-IDF, and Doc2Vec, to extract meaningful insights from unstructured text data. This project involved extensive data handling and pre-processing, sophisticated machine learning algorithms, and rigorous model evaluation and validation to ensure robust and reliable performance.\n\n\u003ch3\u003e\u003cstrong\u003eKey Concepts\u003c/strong\u003e\u003c/h3\u003e\n\nData Handling and Pre-processing\u003cbr\u003e\n\u003e - \u003cstrong\u003eData Cleaning\u003c/strong\u003e: Processed unstructured text data to handle missing values and duplicates, ensuring high-quality input for model training.  \n\u003e - \u003cstrong\u003eFeature Engineering\u003c/strong\u003e: Utilized count vectorization, TF-IDF, and Doc2Vec to create meaningful features from raw text data, enhancing the model's ability to understand sentiment.\n\u003e - \u003cstrong\u003eData Visualization\u003c/strong\u003e: Used libraries like Seaborn and Matplotlib to visualize sentiment distribution across regions, helping to identify patterns and trends in the data.\n\nMachine Learning Algorithms\u003cbr\u003e\n\u003e - \u003cstrong\u003eSupervised Learning\u003c/strong\u003e:  Trained the sentiment analysis model using supervised learning techniques on labeled tweet data, focusing on accurately classifying sentiment.\n\u003e - \u003cstrong\u003eSupervised Learning\u003c/strong\u003e:  Applied clustering methods to explore patterns in sentiment data, providing additional insights into the data's structure.\n\nNatural Language Processing (NLP)\u003cbr\u003e\n\u003e - \u003cstrong\u003eText Pre-processing\u003c/strong\u003e: Implemented tokenization, stemming, and lemmatization using NLTK to standardize and clean the text data, making it suitable for analysis.\n\u003e - \u003cstrong\u003eNLP Models\u003c/strong\u003e: Leveraged advanced models like Doc2Vec for feature extraction, capturing semantic meaning from the text data.\n\u003e - \u003cstrong\u003eLibraries\u003c/strong\u003e: Utilized NLTK and Gensim for various NLP tasks, ensuring robust and efficient text processing.\n\nModel Evaluation and Validation\u003cbr\u003e\n\u003e - \u003cstrong\u003eMetrics\u003c/strong\u003e: Assessed model performance using metrics such as accuracy, precision, recall, and F1 score to ensure a comprehensive evaluation.\n\u003e - \u003cstrong\u003eCross-Validation\u003c/strong\u003e: Conducted k-fold cross-validation to validate model stability and robustness, ensuring the model generalizes well to unseen data.\n\u003e - \u003cstrong\u003eA/B Testing\u003c/strong\u003e: Performed A/B testing to evaluate model changes and improvements, ensuring continuous enhancement of model performance.\n\n\u003ch3\u003e\u003cstrong\u003eTechnologies (Tools and Libraries)\u003c/strong\u003e\u003c/h3\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003ePython==3.6\u003c/strong\u003e: Primary programming language used for the project.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eNLTK==3.4.5\u003c/strong\u003e: Used for text preprocessing tasks such as tokenization, stemming, and lemmatization.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eGensim==3.8.3\u003c/strong\u003e: Employed for advanced NLP tasks including the implementation of Doc2Vec.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMatplotlib==3.2.1\u003c/strong\u003e: Utilized for data visualization to explore and understand sentiment distributions.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMatplotlib==3.2.1\u003c/strong\u003e: Utilized for data visualization to explore and understand sentiment distributions.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSeaborn==0.10.1\u003c/strong\u003e:  Enhanced data visualization capabilities for better presentation of sentiment analysis results.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003escikit-learn==0.21.3\u003c/strong\u003e: scikit-learn: Used for machine learning model training and evaluation.\u003c/li\u003e\n\u003c/ul\u003e\n\n\u003ch3\u003e\u003cstrong\u003eProject Breakdown\u003c/strong\u003e\u003c/h3\u003e\n\nPart 1: Data Collection and Pre-processing\u003cbr\u003e\n\u003e - \u003cstrong\u003eData Collection\u003c/strong\u003e: Gathered tweets using the Twitter API, ensuring a diverse dataset across various geographic regions. Also used a sample set from kaggle containing tweets extracted using the twitter API.\n\u003e - \u003cstrong\u003eData Cleaning\u003c/strong\u003e: Processed the raw tweet data to handle missing values, duplicates, and irrelevant content.\n\nPart 2: Feature Engineering\u003cbr\u003e\n\u003e - \u003cstrong\u003eCount Vectorization\u003c/strong\u003e: Transformed text data into numerical vectors using count vectorization.\n\u003e - \u003cstrong\u003eTF-IDF\u003c/strong\u003e: Applied Term Frequency-Inverse Document Frequency to weigh the importance of words in the dataset.\n\u003e - \u003cstrong\u003eDoc2Vec\u003c/strong\u003e: Used Doc2Vec to capture the semantic meaning of tweets, enhancing feature representation.\n\nPart 3: Model Training and Tuning\u003cbr\u003e\n\u003e - \u003cstrong\u003eSupervised Learning\u003c/strong\u003e: Trained a sentiment analysis model using labeled data, employing algorithms like logistic regression and support vector machines.\n\u003e - \u003cstrong\u003eHyperparameter Tuning\u003c/strong\u003e: Optimized model parameters to improve performance using techniques like grid search.\n\nPart 4: Model Evaluation and Validation\u003cbr\u003e\n\u003e - \u003cstrong\u003eMetrics\u003c/strong\u003e: Evaluated model performance using accuracy, precision, recall, and F1 score.\n\u003e - \u003cstrong\u003eCross-Validation\u003c/strong\u003e: Conducted k-fold cross-validation to ensure model robustness and generalizability.\n\u003e - \u003cstrong\u003eA/B Testing\u003c/strong\u003e: Implemented A/B testing to compare different model versions and select the best-performing model.\n\n\u003ch3\u003e\u003cstrong\u003eGetting Started\u003c/strong\u003e\u003c/h3\u003e\n\u003col\u003e\n\u003cli\u003eClone the Repository\u003c/li\u003e\n\u003cli\u003eInstall Dependencies: Manually install the required tools and libraries highlighted in the technologies section, versions are specified.\u003c/li\u003e\n\u003cli\u003eDataset: Download the dataset using the Twitter API or a sample dataset from Kaggle (https://www.kaggle.com/datasets/kazanova/sentiment140) and place it in the designated directory.\u003c/li\u003e\n\u003cli\u003eRun the Preprocessing Script: Preprocess the tweets using the provided scripts to clean and standardize the data.\u003c/li\u003e\n\u003cli\u003eFeature Engineering: Execute the feature engineering scripts to transform the text data into numerical features.\u003c/li\u003e\n\u003cli\u003eTrain the Model: Use the training scripts to build and optimize the sentiment analysis model.\u003c/li\u003e\n\u003cli\u003eEvaluate the Model: Run the evaluation scripts to assess the model performance using various metrics and validation techniques.\u003c/li\u003e\n\u003c/ol\u003e\n\n\u003ch3\u003e\u003cstrong\u003eMaintainers and Contributors\u003c/h3\u003e\u003c/strong\u003e\n\u003cstrong\u003eMaintainer\u003c/strong\u003e: David Ogalo \u003cbr\u003e\n\u003cstrong\u003eContributors\u003c/strong\u003e: Contributions are welcome. Please reach out for more information on contribution guidelines on this project.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidogalo%2Ftwitter-sentiment-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidogalo%2Ftwitter-sentiment-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidogalo%2Ftwitter-sentiment-analysis/lists"}