https://github.com/chyh1990/jruby-spark
https://github.com/chyh1990/jruby-spark
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/chyh1990/jruby-spark
- Owner: chyh1990
- Created: 2016-04-10T15:04:32.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2016-04-27T11:37:34.000Z (about 9 years ago)
- Last Synced: 2024-11-14T08:35:16.176Z (6 months ago)
- Language: Java
- Size: 712 KB
- Stars: 8
- Watchers: 3
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- data-science-with-ruby - jruby-spark
README
# JRuby on Spark
Yuheng Chen
Combining Spark with elegant Ruby closure on JVM.
## Setting up
Download Spark 1.6 distribution.
Note that if you use `rvm` or similiar tools, use
```
rvm use system
```
to avoid loading installed gems.Download JRuby 9.1.0.0
[snapshot](http://ci.jruby.org/snapshots/master/jruby-complete-9.1.0.0-SNAPSHOT.jar)
to `./libs`, then build the jruby spark:```
rake package`
```The release package is available in `pkg`.
## REPL
Run JRuby Spark REPL shell:
```
export SPARK_HOME=$YOUR_SPARK_HOME
./bin/jruby-spark-repl.sh
```By default, `jruby-spark-repl.sh` use `spark-submit` to submit the REPL job
locally, similiar to `pyspark`.Run the word count example:
```
$sc.text_file('SOME_TXT_FILE.txt').flat_map{|x| x.downcase.split(/[^\w']+/)}
.map_to_pair{|x| tuple(x, 1)}.reduce_by_key(:+).collect_as_map
```Note how well Ruby fits into Spark framework:
* Block and lambda
* symbol to `proc`
* use powerful utilities in Ruby's standard library## One script job
## Job with depedencies
## Architecture
### Closure marhshalling
### Java interoperation
### Size estimator
### Conclusion