{"id":18464633,"url":"https://github.com/rogaha/data-processing-pipeline","last_synced_at":"2025-04-08T08:31:11.579Z","repository":{"id":35240507,"uuid":"39500007","full_name":"rogaha/data-processing-pipeline","owner":"rogaha","description":"Real-Time Data Processing Pipeline \u0026 Visualization with Docker, Spark, Kafka and Cassandra","archived":false,"fork":false,"pushed_at":"2017-05-04T14:42:42.000Z","size":72700,"stargazers_count":84,"open_issues_count":1,"forks_count":29,"subscribers_count":16,"default_branch":"master","last_synced_at":"2024-04-15T12:24:08.059Z","etag":null,"topics":["cassandra","digital-ocean","docker-machine","kafka","spark","twitter","twitter-streaming-api","visualization"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rogaha.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-07-22T10:25:08.000Z","updated_at":"2024-02-17T16:44:43.000Z","dependencies_parsed_at":"2022-08-08T19:15:15.943Z","dependency_job_id":null,"html_url":"https://github.com/rogaha/data-processing-pipeline","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rogaha%2Fdata-processing-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rogaha%2Fdata-processing-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rogaha%2Fdata-processing-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rogaha%2Fdata-processing-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rogaha","download_url":"https://codeload.github.com/rogaha/data-processing-pipeline/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223310572,"owners_count":17124246,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cassandra","digital-ocean","docker-machine","kafka","spark","twitter","twitter-streaming-api","visualization"],"created_at":"2024-11-06T09:10:34.940Z","updated_at":"2024-11-06T09:10:35.629Z","avatar_url":"https://github.com/rogaha.png","language":"HTML","readme":"DATA-PROCESSING-PIPELINE\n==============\n\n## Description\n\nBuild a powerful *Real-Time Data Processing Pipeline \u0026 Visualization* solution using Docker Machine and Compose, Kafka, Cassandra and Spark in 5 steps.\n\nSee below the project's architecture: \n\n![Docker Architecture](images/project-architecture.png \"Project Architecture\")\n\n## What's happening under the hood? \nWe connect to the twitter streaming API (https://dev.twitter.com/streaming/overview) and start to listen to events based on a list of keywords, these events are forwarded directly to Kafka (no parsing). In the middle, there is a spark job collecting those events, converting them to Spark SQL context (http://spark.apache.org/sql/) which filters the kafka message and extract only the fields of interest which in this case are: *user.location, text and user.profile_image_url*, once we have that, we convert the *location* into coordinates (lat,lng) using the google geoconding API (https://developers.google.com/maps/documentation/geocoding/intro) and persist the data into Cassandra. \n\nFinally, there is a web application running that is fetching data from Cassandra and rendering the tweets of interest on the world map.\n\n![Project Screenshot](images/screenshot.png \"Docker Hackday Project\")\n\n### Some Interesting Project Stats\n##### Number of Containers: 8\n##### Number of Open Source Projects Used: 8\n##### Number of Programming Languages Used: 4 (Python, Bash, Scala, Java)\n\n## Pre-requisites\n\n### Docker (https://docs.docker.com/installation/)\n```\n$ wget -qO- https://get.docker.com/ | sh\n```\n\n### Docker Machine (https://docs.docker.com/machine/install-machine/)\n```\n$ curl -L https://github.com/docker/machine/releases/download/v0.8.2/docker-machine_`uname -s`-amd64 \u003e /usr/local/bin/docker-machine\n$ chmod +x /usr/local/bin/docker-machine\n```\n\n### Docker Compose (https://docs.docker.com/compose/install/)\n```\n$ curl -L https://github.com/docker/compose/releases/download/1.8.1/docker-compose-`uname -s`-`uname -m` \u003e /usr/local/bin/docker-compose\n$ chmod +x /usr/local/bin/docker-compose\n```\n\n## Project Installation / Usage\n### Step 1: Create a VM with Docker\nIf you already have a VM running or if you are on Linux, you can skip this step. Otherwise, the steps are the following:\n\n#### On Digital Ocean\n##### a) Create a Digital Ocean Token\nYou need to create a personal access token under “Apps \u0026 API” in the Digital Ocean Control Panel.\n\n##### b) Grab your access token, then run docker-machine create with these details:\n```\n$ docker-machine create --driver digitalocean --digitalocean-access-token=\u003caccess token\u003e Docker-VM\n```\n#### On VirtualBox\nYou just need to run: \n```\n$ docker-machine create -d virtualbox --virtualbox-memory 2048 Docker-VM\n```\n#### On Microsoft Azure\n#### a) Create certificate\n```\n$ openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem\n$ openssl pkcs12 -export -out mycert.pfx -in mycert.pem -name \"My Certificate\"\n$ openssl x509 -inform pem -in mycert.pem -outform der -out mycert.cer\n```\n#### b) Upload the certificate to Microft Azure\nGo to the Azure portal, go to the “Settings” page (you can find the link at the bottom of the left sidebar - you need to scroll), then “Management Certificates” and upload mycert.cer.\n#### c) Grab your subscription ID from the portal (SUBSCRIPTIONS tab), then run docker-machine create with these details:\n```\n$ docker-machine create -d azure --azure-subscription-id=\"SUB_ID\" --azure-subscription-cert=\"mycert.pem\" azure-size=\"Medium\" Docker-VM\n```\n#### d) Expose Port 80\nWhen viewing your VM in the resource group you've created, scroll down to click Endpoints to view the endpoints on the VM. Add a new *endpoint* that exposes the port 80 and give it some name.\n\n#### Access the VM\nBy default *docker-machine* will spin up an Ubuntu 14.04 instance on all cloud providers, as we are running multiple JAVA based applications that consumes a lot of memory, on the *docker-machine create* commands above I added an extra parameter to reserve at least 2GB of memory. The command below will ssh into the VM using your *ssh public key* \n```\n$ docker-machine ssh Docker-VM\n```\n\n### Step 2: Getting Twitter API keys\n\nIn order to access Twitter Streaming API, we need to get 4 pieces of information from Twitter: API key, API secret, Access token and Access token secret. Follow the steps below to get all 4 elements:\n\u003cpre\u003e\nCreate a twitter account if you do not already have one.\nGo to https://apps.twitter.com/ and log in with your twitter credentials.\nClick \"Create New App\"\nFill out the form, agree to the terms, and click \"Create your Twitter application\"\nIn the next page, click on \"API keys\" tab, and copy your \"API key\" and \"API secret\".\nScroll down and click \"Create my access token\", and copy your \"Access token\" and \"Access token secret\".\n\u003c/pre\u003e\n\n### Step 3: Clone this repo and update the docker-compose.yml file (https://docs.docker.com/compose/yml/)\nFirst you need to clone this repo:\n```\n$ git clone git@github.com:rogaha/data-processing-pipeline.git\n```\nThen, we need to update the kafka advertized host name, the twitter API credentials and the keywords you want to track. Below are the enviroment variables that need to be updated:\n```\nKAFKA_ADVERTISED_HOST_NAME: \"\" (public IP or the IP of your local VM)\nACCESS_TOKEN: \"\"\nACCESS_TOKEN_SECRET: \"\"\nCONSUMER_KEY: \"\"\nCONSUMER_SECRET: \"\"\nKEYWORDS_LIST: \"\"\nGOOGLE_GEOCODING_API_KEY: \".\" (use \".\" to ignore it)\n```\nIn order to get the *public IP* of your Digital Ocean droplet you can run from the VM:\n```\n$ /sbin/ifconfig eth0 | grep 'inet addr:' | cut -d: -f2 | awk '{ print $1}'\n```\n\nThe *KEYWORDS_LIST* shoud be a comma separated string, such as: \"python, scala, golang\"\n\n### Step 4: Start All the Containers\nWith docker-compose you can just run:\n```\n$ docker-compose up -d\n```\nThe output should be: \n```\nCreating dataprocessingpipeline_zookeeper_1...\nCreating dataprocessingpipeline_sparkmaster_1...\nCreating dataprocessingpipeline_kafka_1...\nCreating dataprocessingpipeline_twitterkafkaproducer_1...\nCreating dataprocessingpipeline_cassandra_1...\nCreating dataprocessingpipeline_sparkjob_1...\nCreating dataprocessingpipeline_webserver_1...\nCreating dataprocessingpipeline_sparkworker_1...\n```\n\nAfter that you should wait a few seconds, I've a 15 seconds delay before starting the spark-job, kafka producer and webcontainer containers, in order to make all the dependencies are up and running.\n### Step 5: Access the IP/Hostname of your machine from your browser\nI've cloned this repo, updated the environment variables and started the containers on Azure. \n\n## Open Source Projects Used\n\n#### Docker (https://github.com/docker/docker)\nAn open platform for distributed applications for developers and sysadmins\n#### Docker Machine (https://github.com/docker/machine)\nLets you create Docker hosts on your computer, on cloud providers, and inside your own data center\n#### Docker Compose (https://github.com/docker/compose)\nTool for defining and running multi-container applications with Docker\n#### Apache Spark / Spark SQL (https://github.com/apache/spark)\n A fast, in-memory data processing engine. Spark SQL lets you query structured data as a resilient distributed dataset (RDD)\n#### Apache Kafka (https://github.com/apache/kafka)\nA fast and scalable pub-sub messaging service\n#### Apache Kookeeper (https://github.com/apache/zookeeper)\nA distributed configuration service, synchronization service, and naming registry for large distributed systems\n#### Apache Cassandra (https://github.com/apache/cassandra)\n Scalable, high-available and distributed columnar NoSQL database\n#### D3 (https://github.com/mbostock/d3)\nA JavaScript visualization library for HTML and SVG. \n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frogaha%2Fdata-processing-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frogaha%2Fdata-processing-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frogaha%2Fdata-processing-pipeline/lists"}