{"id":21864723,"url":"https://github.com/fdmsantos/aws-twitter-data-analytics","last_synced_at":"2026-04-09T16:40:01.181Z","repository":{"id":54608136,"uuid":"521755671","full_name":"fdmsantos/aws-twitter-data-analytics","owner":"fdmsantos","description":"Project to Learn Data analytics in AWS using twitter data","archived":false,"fork":false,"pushed_at":"2022-09-12T11:38:03.000Z","size":38960,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-26T15:33:39.339Z","etag":null,"topics":["aws","data-analytics","data-engineering","data-science","data-visualization","flink","spark","terraform"],"latest_commit_sha":null,"homepage":"","language":"HCL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fdmsantos.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-08-05T19:32:41.000Z","updated_at":"2023-02-08T10:49:31.000Z","dependencies_parsed_at":"2023-01-18T04:45:26.604Z","dependency_job_id":null,"html_url":"https://github.com/fdmsantos/aws-twitter-data-analytics","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fdmsantos%2Faws-twitter-data-analytics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fdmsantos%2Faws-twitter-data-analytics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fdmsantos%2Faws-twitter-data-analytics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fdmsantos%2Faws-twitter-data-analytics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fdmsantos","download_url":"https://codeload.github.com/fdmsantos/aws-twitter-data-analytics/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244868762,"owners_count":20523590,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","data-analytics","data-engineering","data-science","data-visualization","flink","spark","terraform"],"created_at":"2024-11-28T04:11:22.207Z","updated_at":"2026-04-09T16:40:01.145Z","avatar_url":"https://github.com/fdmsantos.png","language":"HCL","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AWS Twitter Data Analytics\n\nProject to Learn Data Analytics in AWS using twitter data.\n\n## Important Notes\n\n* This project/code isn't optimized for Production.\n* Some Architecting decisions doesn't make sense in production only in learning context.\n* The configurations aren't cost optimized due learning reasons.\n* This architecture doesn't follow the security best practices.\n* Most of the AWS services used in this project don't have free tier. Deploy this, will have costs.\n\n## About The Project\n\n![Diagram](Diagram.png)\n\nThe main goal for this project is learning/test/play Data Analytics in AWS using data from twitter.\n\n### Built With\n\n**Data Source**\n\n* [Twitter](https://twitter.com/)\n\n**Deployment**\n\n* [Terraform](https://www.terraform.io/)\n\n**Programming**\n\n* [GO](https://go.dev/)\n* [Python](https://www.python.org/)\n* [PySpark](https://spark.apache.org/docs/latest/api/python/)\n* [Flink](https://flink.apache.org/)\n* [Java](https://www.java.com/)\n* [Hive](https://hive.apache.org/)\n* [GNUMakeFile](https://www.gnu.org/software/make/manual/make.html)\n\n**AWS Services**\n\n* [S3](https://aws.amazon.com/s3/)\n* [Lambda](https://aws.amazon.com/lambda/)\n* [Kinesis Firehose](https://aws.amazon.com/kinesis/data-firehose/)\n* [Kinesis Data Analytics](https://aws.amazon.com/kinesis/data-analytics/)\n* [Kinesis Data Streams](https://aws.amazon.com/kinesis/data-streams/)\n* [SNS](https://aws.amazon.com/sns/)\n* [GLUE - Catalog, Crawler, Job, Workflow](https://aws.amazon.com/glue/)\n* [Elastic Map Reduce](https://aws.amazon.com/emr/)\n* [Step Functions](https://aws.amazon.com/step-functions/)\n* [Redshift](https://aws.amazon.com/redshift/)\n* [Data Pipeline](https://aws.amazon.com/datapipeline/)\n* [DynamoDB](https://aws.amazon.com/dynamodb/)\n\n### Components:\n\n**Data Collection**\n\nData collection consist in application written in go app listen twitter stream for tweets.\nThe go app configure the twitter stream to receive only tweets related with nba.\nThe app sends the tweets to Kinesis Firehose. \nKinesis Firehose will store the tweets in S3 after runs a lambda to transform the twitter record. \nFirehose also store in S3 the original records.\n\n**Glue ETL**\n\nThe GLUE ETL consists in two steps. One Glue Job that runs a python spark script to remove duplicated tweets and a glue crawler to read the tweets records in s3 and create the table in glue data catalog.\nThis project has a glue workflow to run these two steps. First remove duplicates and then runs the crawler.\n\n**EMR Cluster**\n\nAn EMR Cluster is created to do some tests. The EMR Cluster is created with Hive and configured to Hive use the data catalog.\n\n**Step Functions**\n\n![Diagram](Stepfunction.png)\n\nThe state machine creates EMR Cluster to run hive scripts. The hive script result is stored in S3 via an external table.\n\n**Redshift**\n\nThis component creates a redshift cluster and AWS Data Pipeline.\nThe Data pipeline loads the data resulting from hive queries stored in S3 to a redshift table.\n\n* Data Pipeline diagram\n\n![DataPipeline](DataPipeline.png)\n\n**Quicksight**\n\nAWS Data Visualisation tool \n\n![NBA Players Related Tweets](Quicksight.png)\n\n**Kinesis Data Analytics**\n\n![Kinesis Data Analytics](KinesisDataAnalytics.png)\n\nThe Kinesis Data Analytics application develop in Flink, detects if a players from one team did two or more tweets to a player in other team within a specific time window (Tumbling Window).\nThe result is sent to a kinesis Data stream consumed by a lambda. The lambda sent a notifications via SNS.\nThis application uses another stream to control if team is allowed to do tampering. The source of this stream is dynamodb kinesis stream.\nThis application also uses Dynamodb Table to reference data.\nThe Late data is sent to an S3 Bucket, and all tweets also are sent to another S3 data to archive.\n\n![Notification](NbaTamperingEmail.png)\n\nFlink Features in This App:\n\n* Two Connected Streams\n* Keyed Streams\n* Stateful Stream Processing\n* Timely Stream Processing With Watermarks and Event Time\n* Windowing Processing\n* Side Outputs To Late Events and All Tweets\n\n## Getting Started\n\n### Deploy\n\n***Pre Requisites***\n\n* AWS Cli Configured\n* Terraform\n* EC2 Key Pair Created to deploy EMR Cluster\n* VPC Created at least with one Subnet\n* Twitter Keys\n\n```bash\nmake deploy\n```\n\n### Data Collection\n\n**Pre Requisites**\n\n* Golang Installed\n* Copy .env.example to .env and add your variables values\n  * When the app runs locally, you need have AWS PROFILE configure in your aws credentials file with permissions to assume a role with the necessary permissions to send records to Kinesis firehose\n  \nYou can disable the data collection components changing terraform variable `enable_data_collection` to false.\n\nExecute the following command to run go app:\n\n```shell\nmake run-collection\n```\n\n### Glue ETL\n\n**Pre Requisites:** \n\n* AWS Cli Configured\n\nYou can disable the data glue components changing terraform variable `enable_glue_etl` to false.\n\nExecute the following command to Run Glue Job to Drop duplicates\n\n```shell\nmake run-drop-duplicates\n```\n\nExecute the following command Run Crawler\n\n```shell\nmake run-crawler\n```\n\nExecute the following command to Run Glue Workflow\n\n```shell\nmake run-glue-workflow\n```\n\n### EMR Cluster\n\nYou can disable the emr cluster creation by changing terraform variable `enable_emr_cluster` to false.\n\nTo do ssh to emr cluster run the following command:\n\n```shell\nmake EMR_KEY=\u003ckey_location\u003e ssh-emr\n```\n\n### Step Functions\n\nYou can disable step function creation by changing terraform variable `enable_step_functions` to false.\n\nTo run the state machine run the following command:\n\n```shell\nmake STATE_MACHINE_RUN_YEAR=2022 STATE_MACHINE_RUN_MONTH=08 STATE_MACHINE_RUN_DAY=09 run-step-function \n```\n\n### Redshift\n\nYou can disable redshift and aws data pipeline creation by changing terraform variable `enable_redshift` to false.\n\nTo run the data pipeline run the following command:\n\n```shell\nmake run-data-pipeline\n```\n\n* Run Process Manual (Without AWs Data pipeline)\n  * To get `s3_input_dir`, run : `terraform output -json | jq -r .redshift_pipeline_input_s3.value`\n  * To get `redshift_role`, run: `terraform output -json | jq -r .redshift_s3_role_arn.value`\n\n```sql\ncreate table playerstotaltweets(\nyear integer not null,\nmonth integer not null,\nday integer not null,\nplayer varchar(255) not null,\ntotal integer not null);\n\nCOPY twitter.public.playerstotaltweets\n  FROM '\u003cs3_input_dir\u003e'\n  IAM_ROLE '\u003credshift_role\u003e'\n  FORMAT AS JSON 'auto'\n  REGION AS '\u003cregion\u003e';\n```\n\n### Quicksight\n\nYou can disable quicksight creation by changing terraform variable `enable_quicksight` to false.\n\n**Pre Requisites:**\n\n* Quicksight Account and User\n  * To get your user arn, run : `aws quicksight list-users --region \u003cregion\u003e --aws-account-id \u003caccount_id\u003e --namespace default`\n\n### Kinesis Data Analytics\n\nYou can disable kinesis data analytics application creation by changing terraform variable `enable_kinesis_data_analytics` to false.\n\nTo generate players tweets run:\n\n```shell\nmake data-gen\n```\n\n## Work in Progress\n\n* Glue\n  * Glue DataBrew\n* Kinesis Data Analytics\n  * Integrate with GLUE Schema registry [Link](https://docs.aws.amazon.com/glue/latest/dg/schema-registry-integrations.html)\n* Firehose\n  * Enable Comprehension \n  * Enable File Format Conversion to Parquet/ORC\n* Redshift Spectrum\n* Quicksight\n  * Percentil Graph\n  * Regional Graph\n* Amazon Rekogniton\n  * Analysis Athletes Photos and identify the objects\n* Amazon Translate\n  * Translate Tweets\n* Amazon Comprehend\n  * Tweets Sentimental Analysis\n* Data Profiling Solution [Link](https://aws.amazon.com/blogs/big-data/build-an-automatic-data-profiling-and-reporting-solution-with-amazon-emr-aws-glue-and-amazon-quicksight/)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffdmsantos%2Faws-twitter-data-analytics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffdmsantos%2Faws-twitter-data-analytics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffdmsantos%2Faws-twitter-data-analytics/lists"}