Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gaussalgo/big_data_hackathon_-_slovak_telekom

Last synced: 13 days ago
JSON representation

Host: GitHub
URL: https://github.com/gaussalgo/big_data_hackathon_-_slovak_telekom
Owner: gaussalgo
Created: 2017-05-11T08:34:42.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2017-05-12T17:18:02.000Z (over 7 years ago)
Last Synced: 2024-11-08T12:34:35.748Z (2 months ago)
Language: Jupyter Notebook
Size: 30.3 KB
Stars: 0
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Big Data Hackathon by Slovak Telekom

## Environment setup

The hackathon setup consists of these elements:

* Team nodes

* Clouder CDH enterpirse cluster

## Environment Patterns

Each team has a private virtual server. All servers are in the same subnet and have access to the CDH cluster. Access to these servers are possible via a public address with a unique private and public key pair available for each team.

## Team nodes

Each team nodes is equipped with a set tools and configuration to access the environment. These tools are pre-installed on a private node for each team. Participating teams may use the server however they wish. The teamX user will have sudo privileges.

### Jupyter Notebooks

* launching Jupyter Notebooks with Python kernel and Pyspark connectivity

```bash

export PYSPARK_DRIVER_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --ip=0.0.0.0 --allow-root"

pyspark2

```

* optionaly you can also switch to a R kernel

> it's recommended to run these commands in screen or tmux in case of loss of connectivity

### HDFS file Access

* basic Hadoop clients are configured on each team node

* to list your home directory on HDFS you can run the following command

```bash

hdfs dfs -ls /user/team

```

* for more details check [Hadoop documention](https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html)

> An alternative option, which could be more attractive for web development is using webHSFD.

### Using Spark to access data in Hive Tables

* assuming you have pyspark2 shell running with SparkContext available, the code below reads the table content into a Spark DataFrame

```python

from pyspark.sql import HiveContext

sc = SparkContext()

sqlContext = HiveContext(sc)

df = sqlContext.sql("select * from database.table")

```

* for more details check the Spark [SQL documention](http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables)

### Apache Kafka

* Apache kafka will run using the default Cloudera setup

* brokers will be listnening on port 9092

* zookeeper available on 2181