https://github.com/kavimaluskam/spark-exercise
Spark exercise on page view data
https://github.com/kavimaluskam/spark-exercise
Last synced: 2 months ago
JSON representation
Spark exercise on page view data
- Host: GitHub
- URL: https://github.com/kavimaluskam/spark-exercise
- Owner: kavimaluskam
- Created: 2018-07-14T19:06:13.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2018-07-16T14:51:48.000Z (almost 7 years ago)
- Last Synced: 2025-02-13T16:47:05.357Z (4 months ago)
- Language: Jupyter Notebook
- Size: 12.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Spark Exercise
Spark "exercise" on page view data## Running the script
### Prerequisite
- Python2.7
- Spark (of course!)
- Jupyter
- Hadoop (optional if you are getting data from AWS s3)### Page view data
In this project data are previously downloaded and imported locally, for simplicity sake.One may also setup the AWS access and getting the data from AWS s3. But remember to update the script manually.
### Suggested Installation
#### 1. Use virtualenv
1. Ensure python2.7, virtualenv are locally installed
2. `$ virtualenv --python=python2 venv`
3. `$ source venv/bin/activate`
4. `$ pip install pyspark jupter`#### 2. Use spark directly
1. Ensure python2.7, spark are locally installed
2. `$ pip2 install jupyter`
3. Setup env (you can also paste into `~/.bashrc` or `~/.zshrc`)
```bash
$ export PYSPARK_DRIVER_PYTHON=jupyter
$ export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
```## Assumptions / Remarks
- In this repo data are assumed download,
- Data with empty article_id / user_id are discovered (as shown in notebook); But such data are not removed from the analysis.### Answer of exercise
[Is here](./answer.md)