https://github.com/kavimaluskam/spark-exercise

Spark exercise on page view data
https://github.com/kavimaluskam/spark-exercise

Last synced: 2 months ago
JSON representation

Spark exercise on page view data

Host: GitHub
URL: https://github.com/kavimaluskam/spark-exercise
Owner: kavimaluskam
Created: 2018-07-14T19:06:13.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2018-07-16T14:51:48.000Z (almost 7 years ago)
Last Synced: 2025-02-13T16:47:05.357Z (4 months ago)
Language: Jupyter Notebook
Size: 12.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Spark Exercise
Spark "exercise" on page view data

## Running the script

### Prerequisite
- Python2.7
- Spark (of course!)
- Jupyter
- Hadoop (optional if you are getting data from AWS s3)

### Page view data
In this project data are previously downloaded and imported locally, for simplicity sake.

One may also setup the AWS access and getting the data from AWS s3. But remember to update the script manually.

### Suggested Installation

#### 1. Use virtualenv
1. Ensure python2.7, virtualenv are locally installed
2. `$ virtualenv --python=python2 venv`
3. `$ source venv/bin/activate`
4. `$ pip install pyspark jupter`

#### 2. Use spark directly
1. Ensure python2.7, spark are locally installed
2. `$ pip2 install jupyter`
3. Setup env (you can also paste into `~/.bashrc` or `~/.zshrc`)
```bash
$ export PYSPARK_DRIVER_PYTHON=jupyter
$ export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
```

## Assumptions / Remarks
- In this repo data are assumed download,
- Data with empty article_id / user_id are discovered (as shown in notebook); But such data are not removed from the analysis.

### Answer of exercise
[Is here](./answer.md)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kavimaluskam/spark-exercise

Awesome Lists containing this project

README