Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mariussoutier/spark-intro
Companion code to "Intro to Apache Spark" talk
https://github.com/mariussoutier/spark-intro
Last synced: about 6 hours ago
JSON representation
Companion code to "Intro to Apache Spark" talk
- Host: GitHub
- URL: https://github.com/mariussoutier/spark-intro
- Owner: mariussoutier
- Created: 2015-03-28T19:41:09.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2015-03-29T11:01:29.000Z (over 9 years ago)
- Last Synced: 2024-04-16T10:21:19.635Z (7 months ago)
- Language: Scala
- Size: 92.8 KB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Intro to Spark
Into to exploring data and extracting useful information using [Apache Spark)(http://spark.apache.org).
The code was presented at the Scala User Group Cologne, slides are [here](http://www.slideshare.net/marius-soutier/spark-intro-scala-ug).There are several jobs provided that you can run for each topic. Build the JAR using `sbt ";clean;assembly"`.
Run it on Spark using `spark-submit --master --class target/scala-2.10/spark-demo-assembly-1.0.jar`.For interactive exploration, you can use `spark-shell`. When you run it, it already provides a SparkContext called `sc`.
If you are using an IDE like IntelliJ or Eclipse, you should try out their worksheets or Scala console.
When you are starting a SparkContext, pass the master explicitly, e.g.:
`val sc = new SparkContext(master = "", appName = "Demo")`.*Hint*: You can easily start any job or shell by using the Spark local mode, e.g. `spark-shell --master local[*]`.
## Google Web Graph
* Download `web-Google.txt.gz` from https://snap.stanford.edu/data/web-Google.html and put it in src/main/resources.
* Check out the `GoogleWebGraph.scala` job.## GitHub Commits
1. Obtain an OAuth key on GitHub (Settings > Applications > Personal access tokens > Generate)
2. Execute `download_github.sh `
3. Execute `ProcessGitHubData.scala`
-> the files should be in `src/main/resources/github`, one commit per lineNow play around with `GitHubSql.scala`.
## Audioscrobbler
* Download Audioscrobbler profile data from http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html and put it in src/main/resources.
* Check out the `Audioscrobbler.scala` job.