{"id":20425870,"url":"https://github.com/mohamedhmini/tweetsolaping","last_synced_at":"2025-06-10T15:31:20.301Z","repository":{"id":37060202,"uuid":"271080551","full_name":"MohamedHmini/tweetsOLAPing","owner":"MohamedHmini","description":"implementing an end-to-end tweets ETL/Analysis pipeline.","archived":false,"fork":false,"pushed_at":"2022-12-08T04:22:01.000Z","size":6281,"stargazers_count":57,"open_issues_count":11,"forks_count":6,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-19T20:45:03.895Z","etag":null,"topics":["analysis","api-client","cube-analysis","datawarehouse","datawarehousing","etl-pipeline","google-api-client","multi-dimensional-analysis","multithreading","powerbi-report","ssas-multidimensional","ssis","tweets","tweets-classification","tweets-scraper","twitter-api","web-crawling"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MohamedHmini.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-06-09T18:31:01.000Z","updated_at":"2025-01-31T22:07:34.000Z","dependencies_parsed_at":"2023-01-24T15:45:26.932Z","dependency_job_id":null,"html_url":"https://github.com/MohamedHmini/tweetsOLAPing","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MohamedHmini%2FtweetsOLAPing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MohamedHmini%2FtweetsOLAPing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MohamedHmini%2FtweetsOLAPing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MohamedHmini%2FtweetsOLAPing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MohamedHmini","download_url":"https://codeload.github.com/MohamedHmini/tweetsOLAPing/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MohamedHmini%2FtweetsOLAPing/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259101276,"owners_count":22805244,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","api-client","cube-analysis","datawarehouse","datawarehousing","etl-pipeline","google-api-client","multi-dimensional-analysis","multithreading","powerbi-report","ssas-multidimensional","ssis","tweets","tweets-classification","tweets-scraper","twitter-api","web-crawling"],"created_at":"2024-11-15T07:14:29.819Z","updated_at":"2025-06-10T15:31:20.273Z","avatar_url":"https://github.com/MohamedHmini.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tweetsOLAPing : an end-to-end social-media data-warehousing project :\r\n\r\n[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)\r\n[![Ask Me Anything !](https://img.shields.io/badge/Ask%20me-anything-1abc9c.svg)](https://GitHub.com/Naereen/ama)\r\n\r\ni'll walk you through an execution example using light-weight (very) data to show you the results.\r\n\r\n## Table of Contents :\r\n- [Twitter DataTypes :](#twitter-datatypes--)\r\n  * [TweetDataType :](#tweetdatatype--)\r\n  * [UserDataType :](#userdatatype--)\r\n- [ENV set-up :](#0--env-set-up-)\r\n- [ETL pipeline :](#1--etl-pipeline--)\r\n  * [a) Extraction :](#a--extraction--)\r\n  * [b) Transformation :](#b--transformation--)\r\n  * [c) Loading :](#c--loading--)\r\n    + [SSIS modeling :](#ssis-modeling--)\r\n    + [SSAS cube modeling :](#ssas-cube-modeling--)\r\n- [Analysis :](#2-analysis--)\r\n  * [MDX queries :](#mdx-queries--)\r\n  * [powerBI report :](#powerbi-report--)\r\n - [References :](#3-references--)\r\n\r\n## Twitter DataTypes :\r\n\r\n### TweetDataType :\r\n\r\n```json\r\n{\r\n   \"created_at\":\"Sat Jul 01 23:47:16 +0000 2017\",\r\n   \"id\":881298189072072708,\r\n   \"id_str\":\"881298189072072708\",\r\n   \"text\":\"seu perfil foi visto por 5 pessoas nas \\u00faltimas 4 horas https:\\/\\/t.co\\/cKb35CahC7\",\r\n   \"source\":\"\\u003ca href=\\\"http:\\/\\/www.twitcom.com.br\\\" rel=\\\"nofollow\\\"\\u003eTwitcom - Comunidades \\u003c\\/a\\u003e\",\r\n   \"truncated\":false,\r\n   \"in_reply_to_status_id\":null,\r\n   \"in_reply_to_status_id_str\":null,\r\n   \"in_reply_to_user_id\":null,\r\n   \"in_reply_to_user_id_str\":null,\r\n   \"in_reply_to_screen_name\":null,\r\n   \"user\": UserDataType,\r\n   \"geo\":null,\r\n   \"coordinates\":null,\r\n   \"place\":null,\r\n   \"contributors\":null,\r\n   \"is_quote_status\":false,\r\n   \"retweet_count\":0,\r\n   \"favorite_count\":0,\r\n   \"entities\":{\r\n      \"hashtags\":[\r\n\r\n      ],\r\n      \"urls\":[\r\n         {\r\n            \"url\":\"https:\\/\\/t.co\\/cKb35CahC7\",\r\n            \"expanded_url\":\"http:\\/\\/twcm.me\\/5bHmW\",\r\n            \"display_url\":\"twcm.me\\/5bHmW\",\r\n            \"indices\":[\r\n               55,\r\n               78\r\n            ]\r\n         }\r\n      ],\r\n      \"user_mentions\":[\r\n\r\n      ],\r\n      \"symbols\":[\r\n\r\n      ]\r\n   },\r\n   \"favorited\":false,\r\n   \"retweeted\":false,\r\n   \"possibly_sensitive\":false,\r\n   \"filter_level\":\"low\",\r\n   \"lang\":\"pt\",\r\n   \"timestamp_ms\":\"1498952836660\"\r\n}\r\n```\r\n\r\n### UserDataType :\r\n\r\n```json\r\n{\r\n   \"id\":2696402179,\r\n   \"id_str\":\"2696402179\",\r\n   \"name\":\"$AVAGE\",\r\n   \"screen_name\":\"SavageHumor\",\r\n   \"location\":null,\r\n   \"url\":null,\r\n   \"description\":\"SAVAGE TWEETS \\nWARNING: 18+ Content\",\r\n   \"protected\":false,\r\n   \"verified\":false,\r\n   \"followers_count\":150201,\r\n   \"friends_count\":0,\r\n   \"listed_count\":94,\r\n   \"favourites_count\":85,\r\n   \"statuses_count\":10696,\r\n   \"created_at\":\"Thu Jul 31 18:52:37 +0000 2014\",\r\n   \"utc_offset\":-18000,\r\n   \"time_zone\":\"Central Time (US \u0026 Canada)\",\r\n   \"geo_enabled\":false,\r\n   \"lang\":\"en\",\r\n   \"contributors_enabled\":false,\r\n   \"is_translator\":false,\r\n   \"profile_background_color\":\"000000\",\r\n   \"profile_background_image_url\":\"http:\\/\\/abs.twimg.com\\/images\\/themes\\/theme1\\/bg.png\",\r\n   \"profile_background_image_url_https\":\"https:\\/\\/abs.twimg.com\\/images\\/themes\\/theme1\\/bg.png\",\r\n   \"profile_background_tile\":false,\r\n   \"profile_link_color\":\"DD2E44\",\r\n   \"profile_sidebar_border_color\":\"000000\",\r\n   \"profile_sidebar_fill_color\":\"000000\",\r\n   \"profile_text_color\":\"000000\",\r\n   \"profile_use_background_image\":false,\r\n   \"profile_image_url\":\"http:\\/\\/pbs.twimg.com\\/profile_images\\/875059551204249601\\/J_XlKaiO_normal.jpg\",\r\n   \"profile_image_url_https\":\"https:\\/\\/pbs.twimg.com\\/profile_images\\/875059551204249601\\/J_XlKaiO_normal.jpg\",\r\n   \"profile_banner_url\":\"https:\\/\\/pbs.twimg.com\\/profile_banners\\/2696402179\\/1416368695\",\r\n   \"default_profile\":false,\r\n   \"default_profile_image\":false,\r\n   \"following\":null,\r\n   \"follow_request_sent\":null,\r\n   \"notifications\":null\r\n}\r\n```\r\n\r\n## 0- ENV set-up:\r\n\r\nthe extraction/transformation steps of the pipeline will need the following environment set-up :\r\n```shell\r\npip install virtualenv\r\n```\r\n```shell\r\nvirtualenv tweetsOLAPingENV\r\n```\r\n```shell\r\nsource tweetsOLAPingENV/bin/activate\r\n```\r\n```shell\r\npip install -r requirements.txt\r\n```\r\n\r\nas far as the Loading ETL step and the final analysis, make sure you have the following :\r\n1. MSSMS (microsoft Sql server managment studio).\r\n2. SSDT (Sql server data tools).\r\n3. SSIS (Sql server integration service).\r\n4. SSAS (Sql server analysis service).\r\n5. powerBI.\r\n\r\n## 1- ETL pipeline : \r\n### a) Extraction :\r\n\r\nthe very first step is to prepare the \u003cb\u003etweetsPOOLs.csv\u003c/b\u003e file as in https://github.com/MohamedHmini/tweetsOLAPing/blob/master/extraction/archivedTweetsCrawler/tweetsPOOLs.csv.\r\n\r\nthen we shall execute the \u003cb\u003eScrapy Spider\u003c/b\u003e to crawl the archive needed website pages as follows : \r\n\r\n```shell\r\n  cd extraction/archivedTweetsCrawler\r\n  scrapy crawl -o tweetsSTREAMs.csv tweets\r\n```\r\nafter that you will get a file like the one in here : https://github.com/MohamedHmini/tweetsOLAPing/blob/master/sample-data/tweetsSTREAMs.csv\r\n\r\nthe next step is to structure that CSV file into a tree like structure composed of directories and files :\r\n```shell\r\n  cd extraction/\r\n  python tweetsPOOLsParser.py tweetsSTREAMs.csv ../root_urls/\r\n```\r\nthe output will be somewhat like this (light-weight) example : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/sample-data/urls_root\r\n\r\nnext we have to perform a random selection to select only some URLs and not all, note that each URL will bring you up to 5000 tweets :\r\n\r\n```shell\r\n  cd extraction/\r\n  python urlsRandomSelector.py ../root_urls/ ../chosen_urls.txt 700\r\n```\r\n\r\nagain check this link for an output example : https://github.com/MohamedHmini/tweetsOLAPing/blob/master/sample-data/chosen_urls.txt\r\n\r\nnow after that we have all the needed URLs in a single file we can start downloading :\r\n\r\n```shell\r\n  cd extraction/\r\n  python tweetsDownloader.py ../chosen_urls.txt ../downloaded_pools/ ../download_error.txt\r\n```\r\n\r\nagain check this link for an output example : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/sample-data/downloaded-pools\r\n\r\nafter we downloaded the files you will notice that they are compressed with a .bz2 file extension thus you have to decompress them somehow, i won't provide a solution in this stage.\r\n\r\nagain check this link for an output example : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/sample-data/decompressed-pools\r\n\r\nnote that i provide a script to lookup tweets from the twitterAPI directly using the downloaded tweets IDs, cause the tweets have been pulled in the stream by the collected you will find that most of them has zero metrics, i solve this solution using a context-aware random generator.\r\n\r\n### b) Transformation :\r\n\r\nas for the transformation it's composed of two parts, first we transform our data from JSON to CSV and create all the needed derived attributes, also we shall remove twitto duplicates using, beware that the cleanUsersCSV.py script will using multi-threading to speed-up the I/O operations and the result will be stored in a directory, you can then merge them on your own.\r\n\r\n```shell\r\n  cd transformation/\r\n  python prepareTweets.py ../decompressed_pools/ ../tweets.csv ../twittos.csv ../trans-err.txt\r\n  python cleanUsersCSV.py ../twittos.csv ../twittos\r\n```\r\n\r\nthe second part consists of performing NLP analysis on the tweets to generate the sentiment-score and the content-classification, you have to provide the projectkey.json file from google NLP APIin the same directory.\r\n\r\n```shell\r\n  cd transformation/\r\n  python performNLPanalysis.py ../decompressed_pools/ ../tweets_sentiments.csv ../sent-err.txt\r\n```\r\n\r\nagain check this link for an output example : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/sample-data/processed\r\n\r\n### c) Loading :\r\n\r\nbefore starting the SSIS process you have to provide a normalized data in the right path (shall be fixed) :\r\n\r\n```shell\r\n  cd loading/\r\n  python dataNormalization.py ../twittos.csv ../tweets.csv ../data/normalized\r\n```\r\n\r\nagain check this link for an output example : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/sample-data/normalized-data\r\n\r\nfor SSIS logic i provide the full model : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/loading/tweetsOLAPing_loading\r\n\r\nas well as the SSAS logic is fully provided : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/analysis/tweetsOLAPing_analysis\r\n\r\n#### SSIS modeling : \r\n\r\nafter setting up all the connections ( the normalized data as well as the OLEDB destinations), we now arrive at the integration step or the data loading : \r\n\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"./imgs/SSIS-global-process.jpg\" /\u003e\r\n\u003c/p\u003e\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"./imgs/dates-data-flow.jpg\" /\u003e\r\n\u003c/p\u003e\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"./imgs/ts-data-flow.jpg\" /\u003e\r\n\u003c/p\u003e\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"./imgs/loc-data-flow.jpg\" /\u003e\r\n\u003c/p\u003e\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"./imgs/twmd-data-flow.jpg\" /\u003e\r\n\u003c/p\u003e\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"./imgs/usrmd-data-flow.jpg\" /\u003e\r\n\u003c/p\u003e\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"./imgs/tw-data-flow.jpg\" /\u003e\r\n\u003c/p\u003e\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"./imgs/usr-data-flow.jpg\" /\u003e\r\n\u003c/p\u003e\r\n\r\n#### SSAS cube modeling : \r\n\r\n\u003cp align=\"center\"\u003e\r\n  \u003cimg src=\"./imgs/model.jpg\" /\u003e\r\n\u003c/p\u003e\r\n\r\n## 2. Analysis :\r\n\r\n### MDX queries : \r\n\r\n```sql\r\nSELECT \r\n  NON EMPTY\r\n  (\r\n    [Twitto Meta Data].[User Category].children, \r\n    [Measures].[Retweet Count]\r\n  ) ON COLUMNS,\r\n  NON EMPTY \r\n  (\r\n    [Twitto Meta Data].[User Activity].children\r\n  ) ON ROWS\r\nFROM\r\n  [TweetsOLAPing_cube]\r\n```\r\n\r\n```sql\r\nSELECT \r\n  NON EMPTY\r\n  (\r\n    [Twitto Meta Data].[User Category].children, \r\n    [Measures].[Retweet Count]\r\n  ) ON COLUMNS,\r\n  NON EMPTY \r\n  (\r\n    [Twitto Meta Data].[User Activity].children, \r\n    [Tweet Meta Data].[Sentiment Tag].children\r\n  ) ON ROWS\r\nFROM\r\n  [TweetsOLAPing_cube]\r\n```\r\n\r\n```sql\r\nSELECT \r\n  NON EMPTY\r\n  (\r\n    [Tweet Meta Data].[Media Type].children, \r\n    [Measures].[Retweet Count]\r\n  ) ON COLUMNS,\r\n  NON EMPTY \r\n  (\r\n    [Tweet Meta Data].[Has Hashtags].children, \r\n    [Date].[Weekday].children\r\n  ) ON ROWS\r\nFROM\r\n  [TweetsOLAPing_cube]\r\n```\r\n\r\n### powerBI report :\r\n\r\nthe final report is provided in : https://github.com/MohamedHmini/tweetsOLAPing/tree/master/analysis/powerBI\r\n\r\nhere are some examples :\r\n\r\n![alt text](analysis-example/analysis1.gif)\r\n![alt text](analysis-example/analysis2.gif)\r\n![alt text](analysis-example/analysis3.gif)\r\n![alt text](analysis-example/analysis4.gif)\r\n![alt text](analysis-example/analysis5.gif)\r\n\r\n\r\n## references : \r\n\r\n[1] Maha Ben Kraiem, Jamel Feki, Ka¨ıs Khrouf, Franck Ravat, Olivier Teste. OLAP of the tweets: From modeling to exploitation. IEEE International Conference on Research Challenges in Information Science, Marrakesh, Morocco, May 2014.\r\n\r\n[2] Maha Ben Kraiem, Jamel Feki, Ka¨ıs Khrouf, Franck Ravat, Olivier Teste. OLAP4Tweets: Multidimensional Modeling of tweets. 19th East-European Conference on Advances in Databases and Information Systems, Poitiers, France, September 2015.\r\n\r\n[3] Nafees Ur Rehman, Svetlana Mansmann, Andreas Weiler, Marc H. Scholl. Building a DataWarehouse for Twitter Stream Exploration. IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Istanbul, turkey, August 2012.\r\n\r\n\r\n\r\n\u003cb\u003e MOHAMED-HMINI 2020\u003c/b\u003e\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmohamedhmini%2Ftweetsolaping","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmohamedhmini%2Ftweetsolaping","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmohamedhmini%2Ftweetsolaping/lists"}