Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dhiguero/spark-exercises

Spark exercises
https://github.com/dhiguero/spark-exercises

Last synced: about 1 month ago
JSON representation

Spark exercises

Host: GitHub
URL: https://github.com/dhiguero/spark-exercises
Owner: dhiguero
License: apache-2.0
Created: 2015-10-01T13:27:21.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2015-10-02T15:35:16.000Z (over 9 years ago)
Last Synced: 2024-10-16T10:37:25.981Z (3 months ago)
Language: Scala
Size: 195 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # spark-exercises

This repository contains the skeleton for solving basic Spark exercises.

## Exercises

### Exercise 1

1. Load the data found in _data/web_events.log_ into an RDD. Take into account that each line 

contains an entry, and the _;_ has been used as separator. Each line contains:

```

sourceHost, timestamp, method, URL, HTTPCode

```

2. Obtain the number of events per host.

3. Obtain the number of events per HTTPCode.

4. Determine the number of different hosts.

**Notes**

* Use a case class WebEvent to represent each line.

### Exercise 2

1. Load the data found in _data/auth_events.log_ into an RDD. Each line contains the following 

elements.

```

timestamp, sourceHost, Process, Message

```

2. Obtain the number of events per host.

3. Obtain the number of events per process.

4. Filter those hosts that have at least one failed authentication, and one failed requests.

5. Obtain the percentage of successful web requests per host.

6. Obtain the percentage of successful authentication per host.

**Notes**

* Use a case class AuthEvent to represent each line.

* Step 4 requires a join operation.

* Consider computing step 5 and 6 using common transformations.

### Exercise 3

1. Load the file _data/web_events.csv_ using the Spark-CSV library provided by databricks.

2. Solve the scenarios presented in Exercise using SparkSQL when possible.

**Notes**

* The file contains a header.

* Each column is separated by _;_