{"id":20252403,"url":"https://github.com/code-rider/spark-multiple-job-examples","last_synced_at":"2025-10-19T09:24:53.039Z","repository":{"id":88548346,"uuid":"42175464","full_name":"code-rider/Spark-multiple-job-Examples","owner":"code-rider","description":"Spark Mlib clustering and Spark Twitter Steaming tutorial","archived":false,"fork":false,"pushed_at":"2015-09-10T10:04:21.000Z","size":6376,"stargazers_count":4,"open_issues_count":1,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-10T23:16:21.585Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/code-rider.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-09-09T11:53:41.000Z","updated_at":"2018-07-30T09:00:31.000Z","dependencies_parsed_at":"2023-03-01T03:15:10.844Z","dependency_job_id":null,"html_url":"https://github.com/code-rider/Spark-multiple-job-Examples","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code-rider%2FSpark-multiple-job-Examples","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code-rider%2FSpark-multiple-job-Examples/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code-rider%2FSpark-multiple-job-Examples/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/code-rider%2FSpark-multiple-job-Examples/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/code-rider","download_url":"https://codeload.github.com/code-rider/Spark-multiple-job-Examples/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248312133,"owners_count":21082638,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T10:16:32.530Z","updated_at":"2025-10-19T09:24:52.911Z","avatar_url":"https://github.com/code-rider.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"Spark multiple job Examples\n============================\n\u003cp\u003eDownload Spark-multiple-job-Examples\u003c/p\u003e \n\u003ch2\u003eDependencies\u003c/h2\u003e\n\n\u003ca href=\"http://sourceforge.net/projects/simplevoronoi/\"\u003esimplevoronoi\u003c/a\u003e\u003cbr\u003e\n\u003cp\u003eDownload simplevoronoi and place simplevoronoi-version-SNAPSHOT.jar in Spark-multiple-job-Examples/lib directory\u003c/p\u003e\n\n\u003ch2\u003eCreate jar file\u003c/h2\u003e\n\u003cp\u003ecd Spark-multiple-job-Examples and run\u003c/p\u003e\n\u003cblockquote\u003esbt\u003c/blockquote\u003e\n\u003cblockquote\u003e\u003e package\u003c/blockquote\u003e\n\n\u003cp\u003eAfter successful complete in directory target/scala-2.10/spark_multiple_job_examples_2.10-SNAPSHOT-0.1.jar should be created \nnow we will run classes from this jar as spark jobs\u003c/p\u003e\n\n\u003ch2\u003eSpark dependencies\u003c/h2\u003e\n\n\u003ca href=\"http://mvnrepository.com/artifact/org.twitter4j/twitter4j-core/3.0.3\"\u003etwitter4j-core\u003c/a\u003e\u003cbr\u003e\n\u003ca href=\"http://mvnrepository.com/artifact/org.twitter4j/twitter4j-stream/3.0.3\"\u003eTwitter4j Stream\u003c/a\u003e\u003cbr\u003e\n\u003ca href=\"http://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10/1.3.0\"\u003espark-streaming\u003c/a\u003e\u003cbr\u003e\n\u003ca href=\"http://mvnrepository.com/artifact/org.apache.spark/spark-streaming-twitter_2.10/1.4.1\"\u003espark-streaming-twitter\u003c/a\u003e\u003cbr\u003e\n\u003ca href=\"http://sourceforge.net/projects/simplevoronoi/\"\u003esimplevoronoi\u003c/a\u003e\u003cbr\u003e\n\n\u003cp\u003eDownload listed libraries some where on disc\u003c/p\u003e\n\u003cp\u003eand add these in SPARK_CLASSPATH in your user .bashrc file or spark-env.sh file\u003c/p\u003e\n\u003cp\u003ein this tutorial we export these in spark-env.sh so add these lines in you SparkHome/conf/spark-env.sh\u003c/p\u003e\n\u003cbr\u003e\n\u003cblockquote\u003e\nexport SPARK_CLASSPATH=PathToFile/twitter4j-core-3.0.3.jar:$SPARK_CLASSPATH\u003cbr\u003e\nexport SPARK_CLASSPATH=PathToFile/twitter4j-stream-3.0.3.jar:$SPARK_CLASSPATH\u003cbr\u003e\nexport SPARK_CLASSPATH=PathToFile/spark-streaming-twitter_2.10-1.4.1.jar:$SPARK_CLASSPATH\u003cbr\u003e\nexport SPARK_CLASSPATH=PathToFile/simplevoronoi-0.2-SNAPSHOT.jar:$SPARK_CLASSPATH\u003cbr\u003e\n\u003c/blockquote\u003e\n\u003ch2\u003eExamples\u003c/h2\u003e\n\n\u003ch2\u003ewrite tweets from twitter live stream in CSV file\u003c/h2\u003e\n\u003cbr\u003e\n\u003cp\u003e\u003cstrong\u003e1.1:\u003c/strong\u003e we need twitter API credentials if you dont have create frist \u003ca href=\"https://apps.twitter.com/\"\u003ehere\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e1.2:\u003c/strong\u003e create a file with Twitter API credentiale like name twitter-credentials.txt\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e1.3:\u003c/strong\u003e enter credential\u003c/p\u003e\n\u003cblockquote\u003e    \nTWITTER_API_KEY=ApiKey\u003cbr\u003e\nTWITTER_API_SECRET=ApiSecret\u003cbr\u003e\nTWITTER_ACCESS_TOKEN=AccessToken\u003cbr\u003e\nTWITTER_ACCESS_TOKEN_SECRET=AccessTokenSecret\u003cbr\u003e\n\u003c/blockquote\u003e\n\u003cp\u003e\u003cstrong\u003e1.4:\u003c/strong\u003e run sbt job to write live stream from twitter in a csv file\u003c/p\u003e\n\u003cblockquote\u003esbt \"run-main xulu.FetchTweets twitter-credentials.txt tweets.csv\"\u003c/blockquote\u003e\n\u003cp\u003eThis job write tweets in fomate \"Longitude,Latitude,Text \"\u003cbr\u003eto change modifie FetchTweets.scala and run again\u003c/p\u003e\n\n\u003cp\u003eshell output\u003c/p\u003e\n\u003cblockquote\u003e \n\u003cp\u003e-56.544541,-29.089541,Por que ni estamos jugando,\u003c/p\u003e\n\u003cp\u003e-69.922686,18.462675,Aprenda hablar amigo\u003c/p\u003e\n\u003cp\u003e-118.565107,34.280215,today a boy told me I'm pretty\u003c/p\u003e\n\u003cp\u003e121.039399,14.72272,@Kringgelss labuyoo. Hahaha\u003c/p\u003e\n\u003cp\u003e-34.875339,-7.158832@keithmeneses_ oi td bem? sdds 😔💚\u003c/p\u003e\n\u003cp\u003e103.766123,1.380696,Xian Lim on iShine 3 2\u003c/p\u003e\n\u003cp\u003e......\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eWhen you have enough tweets, stop the program by pressing CTRL + C\u003c/p\u003e\n\u003ch2\u003espark live Twitter stream\u003c/h2\u003e\n\n\u003cp\u003e\u003cstrong\u003e2.1:\u003c/strong\u003e \nEdit file Spark-multiple-job-Examples/src/mian/scala/xulu/TwitterLiveStreaming.scala \u003cbr\u003e\nfind the line ssc.checkpoint(\"hdfs://HDFS_IP:9000/checkpoint\") at end of file.\u003cbr\u003e\nreplace  HDFS_IP with Hadoop master IP and rebuild jar file\u003c/p\u003e\n\u003cp\u003ecd Spark-multiple-job-Examples and run\u003c/p\u003e\n\u003cblockquote\u003esbt\u003c/blockquote\u003e\n\u003cblockquote\u003e\u003e package\u003c/blockquote\u003e\n\n\u003cp\u003ecd to spark home\u003c/p\u003e\n\u003cblockquote\u003ebin/spark-submit --class xulu.TwitterLiveStreaming --master spark://sparkMasterIP:7077 PathTo/spark_multiple_job_examples_2.10-SNAPSHOT-0.1.jar (Optional kyeWords separated by space)\n\u003c/blockquote\u003e\n\n\u003cp\u003eThis example should only print tweet text from Twitter live stream on shell\u003c/p\u003e \n\u003cp\u003eyou are able to do write it any where or change the query \u003c/p\u003e\n\u003cp\u003emodifie TwitterLiveStreaming.scala and rebluild you jar  and run again\u003c/p\u003e\n\n\u003ch2\u003eSegmenting Audience with KMeans and Voronoi Diagram using Spark and MLlib\u003c/h2\u003e\n\u003cp\u003eIn this example, we will be using the \u003ca href=\"http://en.wikipedia.org/wiki/K-means_clustering\"\u003ek-means clustering\u003c/a\u003e algorithm implemented in \u003ca href=\"https://spark.apache.org/mllib/\"\u003eSpark Machine Learning Library\u003c/a\u003e(MLLib) to segment the dataset by geolocation.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.1:\u003c/strong\u003e we need twitter data to perform this example.\u003c/p\u003e\n\u003cp\u003ein example one we show how you write Twitter data in a file so run example one to fetch some data from Twitter.\u003c/p\u003e\n\u003cp\u003eFor your convenience, we provide the file tweets_drink.csv\u003c/p\u003e\n\t\t \n\u003cp\u003e\u003cstrong\u003e3.2:\u003c/strong\u003e upload Twitter data tweets_drink.csv in to Hadoop\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003e3.3:\u003c/strong\u003e run the job\u003c/p\u003e\n\u003cblockquote\u003ebin/spark-submit --class xulu.KMeansApp --master spark://SparkMasterIP:7077 PathTo/spark_multiple_job_examples_2.10-SNAPSHOT-0.1.jar hdfs://HDFS_IP:9000/UploadedLocation/tweets_drink.csv results/KmeanApp.png \n\u003c/blockquote\u003e\n\u003ch4\u003eresult\u003c/h4\u003e\n\u003cp\u003e\u003ca href=\"https://raw.githubusercontent.com/code-rider/Spark-multiple-job-Examples/master/results/KmeanApp.png\" target=\"_blank\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/code-rider/Spark-multiple-job-Examples/master/results/KmeanApp.png\" alt=\"Kmean App result\" /\u003e\u003c/a\u003e\u003c/p\u003e\n\n\u003ch2\u003eAnalyzing your audience location with Twitter Streams and Heat Maps\u003c/h2\u003e\n\n\u003cp\u003e\u003cstrong\u003e4.1:\u003c/strong\u003e Downlaod Twitter data about Drink \u003c/p\u003e\n\u003cp\u003eYou can run the Example one With Keywords about Drink to get the tweets having a word related to a drink:\u003c/p\u003e\n\n\u003cblockquote\u003esbt \"run-main xulu.FetchTweets twitter-credentials.txt tweets_drink.csv \\\nredbull schweppes coke cola pepsi fanta orangina soda \\\ncoffee cafe expresso latte tea \\\nalcohol booze alcoholic whiskey tequila vodka booze cognac baccardi \\\ndrink beer rhum liquor gin ouzo brandy mescal alcoholic wine drink\"\t\n\u003c/blockquote\u003e\t\t\n\u003cp\u003eWhen you have enough tweets, stop the program by pressing CTRL + C\u003c/p\u003e\n\t\t\n\u003cp\u003eUnfortunately only a fraction of all the tweets have geolocation information(the publisher has to tweet from a phone and has to opt in to send its position). So you might need to wait several hours (even days if the words are not popular) to get enough tweets to draw on a map. For your convenience, we provide the file tweets_drink.csv which already contains the tweets that we collected using those keywords.\u003c/p\u003e\n\n\u003cp\u003eThe tweet file contains on each line, the tweet longitude ,latitude, and message:\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003e4.2:\u003c/strong\u003e upload Twitter data tweets_drink.csv in to Hadoop\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e4.3:\u003c/strong\u003e run the job\u003c/p\u003e\n\u003cblockquote\u003ebin/spark-submit --class xulu.HeatMap --master spark://SparkMasterIP:7077 PathTo/spark_multiple_job_examples_2.10-SNAPSHOT-0.1.jar hdfs://HDFS_IP:9000/UploadedLocation/tweets_drink.csv results/heatmapApp.png 0.5 coke pepsi\t\n\u003c/blockquote\u003e\n\u003ch4\u003eresult\u003c/h4\u003e\n\u003cp\u003e\n  \u003ca href=\"https://raw.githubusercontent.com/code-rider/Spark-multiple-job-Examples/master/results/heatmapApp.png\" target=\"_blank\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/code-rider/Spark-multiple-job-Examples/master/results/heatmapApp.png\" alt=\"HeatMap App result\" /\u003e\n  \u003c/a\u003e\n\u003c/p\u003e \n\t\t \n\u003cp\u003eOn this map, we can clearly see that the word 'coke'(in green) is used much more than the word 'pepsi'(in red) in the tweets. There are some places which are yellow, that means that there are tweets on coke and tweets on pepsi (yellow = red + green). Interestingly enough, we can see that coke is not used much in South America unlike pepsi which is used in Brazil and in Argentina.\u003c/p\u003e\n\t\t \n\u003ch2\u003eFollowing\u003c/h2\u003e\nhttps://chimpler.wordpress.com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/ \u003cbr\u003e\nhttps://chimpler.wordpress.com/2014/06/26/analyzing-your-audience-location-with-twitter-streams-and-heat-maps/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcode-rider%2Fspark-multiple-job-examples","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcode-rider%2Fspark-multiple-job-examples","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcode-rider%2Fspark-multiple-job-examples/lists"}