Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gaussalgo/big_data_hackathon_-_slovak_telekom
https://github.com/gaussalgo/big_data_hackathon_-_slovak_telekom
Last synced: 13 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/gaussalgo/big_data_hackathon_-_slovak_telekom
- Owner: gaussalgo
- Created: 2017-05-11T08:34:42.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-05-12T17:18:02.000Z (over 7 years ago)
- Last Synced: 2024-11-08T12:34:35.748Z (2 months ago)
- Language: Jupyter Notebook
- Size: 30.3 KB
- Stars: 0
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Big Data Hackathon by Slovak Telekom
## Environment setup
The hackathon setup consists of these elements:
* Team nodes
* Clouder CDH enterpirse cluster## Environment Patterns
Each team has a private virtual server. All servers are in the same subnet and have access to the CDH cluster. Access to these servers are possible via a public address with a unique private and public key pair available for each team.
## Team nodes
Each team nodes is equipped with a set tools and configuration to access the environment. These tools are pre-installed on a private node for each team. Participating teams may use the server however they wish. The teamX user will have sudo privileges.
### Jupyter Notebooks
* launching Jupyter Notebooks with Python kernel and Pyspark connectivity
```bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --ip=0.0.0.0 --allow-root"
pyspark2
```* optionaly you can also switch to a R kernel
> it's recommended to run these commands in screen or tmux in case of loss of connectivity
### HDFS file Access
* basic Hadoop clients are configured on each team node
* to list your home directory on HDFS you can run the following command```bash
hdfs dfs -ls /user/team
```* for more details check [Hadoop documention](https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html)
> An alternative option, which could be more attractive for web development is using webHSFD.
### Using Spark to access data in Hive Tables
* assuming you have pyspark2 shell running with SparkContext available, the code below reads the table content into a Spark DataFrame
```python
from pyspark.sql import HiveContextsc = SparkContext()
sqlContext = HiveContext(sc)df = sqlContext.sql("select * from database.table")
```* for more details check the Spark [SQL documention](http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables)
### Apache Kafka
* Apache kafka will run using the default Cloudera setup
* brokers will be listnening on port 9092
* zookeeper available on 2181