{"id":22357219,"url":"https://github.com/mdh266/twittersentimentanalysis","last_synced_at":"2025-07-25T22:08:43.915Z","repository":{"id":37591828,"uuid":"183344016","full_name":"mdh266/TwitterSentimentAnalysis","owner":"mdh266","description":"Twitter Sentiment Analysis using Spark, MongoDB, and Google Cloud","archived":false,"fork":false,"pushed_at":"2022-06-21T21:55:43.000Z","size":4690,"stargazers_count":5,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-06T06:35:22.095Z","etag":null,"topics":["data-science","etl","google-cloud","machine-learning","mongodb","natural-language-processing","nlp","pyspark","sentiment-analysis","spark","sparkml","twitter","twitter-sentiment-analysis"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mdh266.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-04-25T02:48:02.000Z","updated_at":"2025-02-20T05:56:14.000Z","dependencies_parsed_at":"2022-08-25T19:12:06.627Z","dependency_job_id":null,"html_url":"https://github.com/mdh266/TwitterSentimentAnalysis","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mdh266/TwitterSentimentAnalysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdh266%2FTwitterSentimentAnalysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdh266%2FTwitterSentimentAnalysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdh266%2FTwitterSentimentAnalysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdh266%2FTwitterSentimentAnalysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mdh266","download_url":"https://codeload.github.com/mdh266/TwitterSentimentAnalysis/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdh266%2FTwitterSentimentAnalysis/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267070366,"owners_count":24030979,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-25T02:00:09.625Z","response_time":70,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","etl","google-cloud","machine-learning","mongodb","natural-language-processing","nlp","pyspark","sentiment-analysis","spark","sparkml","twitter","twitter-sentiment-analysis"],"created_at":"2024-12-04T14:13:36.912Z","updated_at":"2025-07-25T22:08:43.831Z","avatar_url":"https://github.com/mdh266.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Twitter Sentiment Analysis With Spark, MongoDB and Google Cloud\n \nIn this two part blog post I go over the classic problem of Twitter sentiment analysis. I found labeled Twitter data with 1.6 million tweets on the Kaggle website \u003ca href=\"https://www.kaggle.com/kazanova/sentiment140\"\u003ehere\u003c/a\u003e.  Through this analysis I'll touch on few different topics related to natural language processing and big data more generally.  While 1.6 million tweets is not substantial amount of data and does not require working with Spark, I wanted to use Spark for ETL as well as machine learning since I haven't seen too many examples of how to do so in the context of Sentiment Analysis. \n\n\n## Part 1: ETL With PySpark and MongoDB\n\nIn the first part I go over Extract-Transform-Load (ETL) operations on text data using \u003ca href=\"https://spark.apache.org/\"\u003ePySpark\u003c/a\u003e and \u003ca href=\"https://www.mongodb.com/\"\u003eMongoDB\u003c/a\u003e expanding on some details of Spark on the way. I then show how one can explore the data in the Mongo database using \u003ca href=\"https://www.mongodb.com/products/compass\"\u003eCompass\u003c/a\u003e and \u003ca href=\"https://api.mongodb.com/python/current/\"\u003ePyMongo\u003c/a\u003e. Spark is a great platform from performing batch ETL work on both structured and unstructed data. MongoDB is a document based NoSQL database that is fast, easy to use, allows for flexible schemas and perfect for working with text data. PySpark and MongoDB work well together allowing for fast, flexible ETL pipelines on large semi-structured data like those coming from the Twitter.  While Part 1 is presented as a Juptyer notebook, the ETL job was submitted as a script `BasicETL.py` in the directory `ETL`.\n\n\n## Part 2: Machine Learning With Spark On Google Cloud\n\nIn this second part I will go over the actual machine learning aspect of Sentiment Anlysis using \u003ca href=\"https://spark.apache.org/docs/latest/ml-guide.html\"\u003eSparkML\u003c/a\u003e and \u003ca href=\"https://spark.apache.org/docs/latest/ml-pipeline.html\"\u003eML Pipelines\u003c/a\u003e to build a basic linear classifier. After building a basic model for sentiment analysis, I'll introduce techniques to improve performance like removing stop words and using N-grams. I also introduce a custom Spark \u003ca href=\"https://spark.apache.org/docs/1.6.2/ml-guide.html#transformers\"\u003eTransformer\u003c/a\u003e class that uses the \u003ca href=\"https://www.nltk.org/\"\u003eNLTK\u003c/a\u003e to performing stemming.  Lastly, I'll review \u003ca href=\"https://spark.apache.org/docs/latest/ml-tuning.html\"\u003ehyper-parameter tunning\u003c/a\u003e with cross-validation to optimize our model.  Using PySpark on this datset was a little too much for my peronsal laptop so I used Spark on a \u003ca href=\"https://hadoop.apache.org/\"\u003eHadoop\u003c/a\u003e cluster with Google Cloud's \u003ca href=\"https://cloud.google.com/dataproc/\"\u003edataproc\u003c/a\u003e and \u003ca href=\"https://cloud.google.com/datalab/\"\u003edatalab\u003c/a\u003e. I'll touch on a few of the details of working on Hadoop and Google Cloud as well.\n\n\n## Requirements\n\n### Part 1 \nPart 1 was completed on my laptop and therefore all the dependencies were installed using \u003ca href=\"https://docs.conda.io/en/latest/miniconda.html\"\u003eminiconda\u003c/a\u003e.  The required dependencies can be installed using the command,\n\n\tconda create -n sparketl -f environment.yml\n\n### Part 2\nPart 2 was completed on Google Cloud on the dataproc image 1.3, the commands to recreate this environment are in `GCP` directory and the Python dependenices to be loaded onto the Hadoop cluster are in the `requirements.txt` file.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmdh266%2Ftwittersentimentanalysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmdh266%2Ftwittersentimentanalysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmdh266%2Ftwittersentimentanalysis/lists"}