https://github.com/newfront/odsc-east2019-warmup
Warmup Presentation for The 2019 Open Data Science Conference in Boston
https://github.com/newfront/odsc-east2019-warmup
Last synced: 4 months ago
JSON representation
Warmup Presentation for The 2019 Open Data Science Conference in Boston
- Host: GitHub
- URL: https://github.com/newfront/odsc-east2019-warmup
- Owner: newfront
- License: gpl-3.0
- Created: 2019-02-17T23:16:07.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2019-07-24T18:17:35.000Z (almost 7 years ago)
- Last Synced: 2025-03-24T18:49:58.098Z (about 1 year ago)
- Size: 48.6 MB
- Stars: 1
- Watchers: 1
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
### Real-ish Time Predictive Analytics Warm Up
This is the essential prerequist to the May 1st workshop in Boston at the ODSC East Conference.
#### WarmUp DataSet
[Wine Reviews](https://www.kaggle.com/zynicide/wine-reviews) - Thanks to zynicide and kaggle.com for the data set.
#### Technologies
1. [Spark 2.4.0](https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz)
2. [Zeppelin 0.8.1](http://www.apache.org/dyn/closer.cgi/zeppelin/zeppelin-0.8.1/zeppelin-0.8.1-bin-netinst.tgz)
3. [Hadoop 2.7.7](https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz)
#### Spark Local Setup
Just download, untar, and move spark. I usually just drop into usr/local
`tar -xvzf /path/to/spark-2.4.0-bin-hadoop2.7.tgz && mv /path/to/spark-2.4.0-bin-hadoop2.7/ /usr/local/spark-2.4.0/`
#### Zeppelin Setup
[Spark Setup](https://zeppelin.apache.org/docs/0.8.1/interpreter/spark.html)
#### Hadoop Single Node Setup
[Setup Documentation](http://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-common/SingleCluster.html)
#### Local Aliases in .bashrc or .bash_profile
~~~
export SPARK_HOME=/usr/local/spark-2.4.0
export ZEPPELIN_HOME=/usr/local/zeppelin-0.8.1
export HADOOP_HOME=/usr/local/hadoop-2.7.7
# Zeppelin
alias zeppelin_start="$ZEPPELIN_HOME/bin/zeppelin-daemon.sh --config $ZEPPELIN_HOME/conf/ start"
alias zeppelin_stop="$ZEPPELIN_HOME/bin/zeppelin-daemon.sh --config $ZEPPELIN_HOME/conf/ stop"
# Hadoop
alias start_hdfs="$HADOOP_HOME/sbin/start-dfs.sh"
alias stop_hdfs="$HADOOP_HOME/sbin/stop-dfs.sh"
alias hdfs="$HADOOP_HOME/bin/hdfs"
~~~
```
source ~/.bash_profile
```
#### Zeppelin Config (zeppelin-env.sh)
You will need to setup some basic options in the zeppelin-env.sh
`vim /usr/local/zeppelin-0.8.1/conf/zeppelin-env.sh`
~~~bash
/usr/local/zeppelin-0.8.1/conf/zeppelin-env.sh
#### Spark interpreter configuration ####
## Use provided spark installation ##
## defining SPARK_HOME makes Zeppelin run spark interpreter process using spark-submit
##
export SPARK_HOME=/usr/local/spark-2.4.0 # (required) When it is defined, load it instead of Zeppelin embedded Spark libraries
# export SPARK_SUBMIT_OPTIONS # (optional) extra options to pass to spark submit. eg) "--driver-memory 512M --executor-memory 1G".
# export SPARK_APP_NAME # (optional) The name of spark application.
## Spark interpreter options ##
##
# export ZEPPELIN_SPARK_USEHIVECONTEXT # Use HiveContext instead of SQLContext if set true. true by default.
# export ZEPPELIN_SPARK_CONCURRENTSQL # Execute multiple SQL concurrently if set true. false by default.
# export ZEPPELIN_SPARK_IMPORTIMPLICIT # Import implicits, UDF collection, and sql if set true. true by default.
# export ZEPPELIN_SPARK_MAXRESULT # Max number of Spark SQL result to display. 1000 by default.
export ZEPPELIN_WEBSOCKET_MAX_TEXT_MESSAGE_SIZE=2048000 # Size in characters of the maximum text message to be received by websocket. Defaults to 1024000
export ZEPPELIN_INTERPRETER_OUTPUT_LIMIT=2048000
~~~
#### Zeppelin Interpreter Spark Settings
1. in the terminal, if you have added the aliases to your bash, `zeppelin_start` - should emit green `[OK]` when running
2. go to http://localhost:8080/#/interpreter
3. under the Spark section, click the edit icon, and add `spark.executor.memory: 6g`, `zeppelin.spark.maxResult: 50000`. It is worth noting that with `spark.cores.max: 4` you will need `24g` of ram to run zeppelin. `spark.cores.max * spark.executor.memory = runtime ram dependency`
4. click `save` and the interpreter will restart with your updated settings.
[Zeppelin Spark Doc](https://zeppelin.apache.org/docs/0.8.1/interpreter/spark.html)
#### Note to Self
It is possible for localhost alias to be broken on your computer due to firewall issues. If `etc/hosts` is pointing localhost to 127.0.0.1 you are golden. Sometimes VPN makes things not work - you can use `127.0.0.1:8080` for the zeppelin UI if localhost bindings are somehow hosed.