Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/federicoserini/dend-project-2-song-play-analysis-with-nosql

Project 2 - Data Engineering Nanodegree
https://github.com/federicoserini/dend-project-2-song-play-analysis-with-nosql

apache-cassandra data-engineering udacity-nanodegree

Last synced: 6 days ago
JSON representation

Project 2 - Data Engineering Nanodegree

Awesome Lists containing this project

README

        

# Project 2: Song Play Analysis With NoSQL

[![Project passed](https://img.shields.io/badge/project-passed-success.svg)](https://img.shields.io/badge/project-passed-success.svg)

## Summary
* [Preamble](#Preamble)
* [ETL process](#ETL-process)
* [How to run](#How-to-run)
* [Project structure](#Project-structure)
* [CQL queries](#CQL-queries)

-------------------------------------------

#### Preamble

This project is not focused on ETL process, but on data modeling
on Cassandra and how it differs from relational database data modeling.

Let's start with ``PRIMARY KEY``, in Cassandra primary keys works slightly different from RDBMS ones,
in fact in Cassandra, the ``PRIMARY KEY`` is made up of either just the ``PARTITION KEY``
or with the addition of ``CLUSTERING COLUMNS``.
And you can have just one record unique,
if we insert data with same ``PRIMARY KEY`` then data will be overwritten with the latest record state

In ``WHERE`` conditions the ``PRIMARY KEY`` must be included and you can use your ``CLUSTERING KEY``
to order your data
but they are not necessary, by default data will be ordered as ``DESC``

In Cassandra the ``denormalization`` process is a must, you cannot apply normalization like an RDBMS
because you cannot use JOINs. More the denormalization process is done more the query will
run faster, in fact, Cassandra has been optimized for fast writes not for fast reads.

To reach the complete denormalization you have to follow the pattern ``1 Query - 1 Table``.
This leads to data duplication but it does not matter, the denormalization process
by nature itself produces data duplication.

Following the ``CAP theorem``, Cassandra embraces the ``AP`` guarantees.

It provides only AP because of its structure, data are shared by nodes so if

a node goes down another one can satisfy the client request but due

the high number of nodes data could be not updated for each node that is why

Cassandra offers only ``Eventual Consistency``

--------------------------------------------

#### ETL process

Although this is not an ETL focused project, we must care how data will be
ingested in our database, and how to read our input properly.
Our ETL process consist in read every file in /event_data, this data have to be
aggregated in a file called event_datafile_new.csv and after the aggregation,

the file has to be parsed and persisted on the database

--------------------------------------------

#### How to run
First of all, you need a Cassandra instance up and running

Here you can find the [Binary packages](http://cassandra.apache.org/download/) for your preferred operating system

After downloading the package, If you do not know to move on just follow this [Documentation](http://cassandra.apache.org/doc/latest/getting_started/index.html)

You have to install also [Python](https://www.python.org/downloads/) and [Jupyter Notebook](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/install.html)

Note:

In this example we will not use any authorization mechanism

After installing your database, Python and Jupyter on your local machine

open your terminal and type

`jupyter notebook`

It will start the service, after the service has been started just drag and drop the notebook

--------------------------------------------

#### Project structure
This is the project structure, if the bullet contains ``/``

means that the resource is a folder:

* /event_data - The directory of CSV files partitioned by date
* /images - Simply a folder with images that are used in Project_1B_ Project_Template notebook
* Project_1B_ Project_Template.ipynb - It is a notebook that illustrates the project step by step
* event_datafile_new.csv - The aggregated CSV composed by all event_data files

--------------------------------------------

#### CQL queries

Query 1: Give me the artist, song title and song's length in the music app history that was heard
during
sessionId = 338 and itemInSession = 4

``` SQL
CREATE TABLE IF NOT EXISTS song_data_by_session (session_id INT,
session_item_number INT,
artist_name TEXT,
song_title TEXT,
song_length DOUBLE,
PRIMARY KEY ((session_id, session_item_number)))
```
In this case session_id and session_item_number are enough to
make a record unique for our request.
Our complete ``PRIMARY KEY`` is composed by session_id, session_item_number

``` SQL
SELECT artist_name, song_title, song_length
FROM song_data_by_session
WHERE session_id = 338 AND session_item_number = 4
```

Query 2: Give me only the following: name of artist, song (sorted by itemInSession)
and user (first and last name) for userid = 10, sessionid = 182

``` SQL
CREATE TABLE IF NOT EXISTS song_user_data_by_user_and_session_data (user_id INT,
session_id INT,
session_item_number INT,
artist_name TEXT,
song_title TEXT,
user_first_name TEXT,
user_last_name TEXT,
PRIMARY KEY ((user_id, session_id), session_item_number))
```
In this case user_id and session_id are the ``COMPOUND PARTITION KEY``
this allows us to have a unique ``PRIMARY KEY`` for our query, but for this request we have to
order by session_item_number but not to query on that, so we have to declare session_item_number as ``CLUSTERING KEY``.
Our complete PRIMARY KEY is composed by user_id, session_id, session_item_number
``` SQL
SELECT artist_name, song_title, user_first_name, user_last_name
FROM song_user_data_by_user_and_session_data
WHERE user_id = 10 AND session_id = 182
```

Query 3: Give me every user name (first and last) in my music app history who listened
to the song 'All Hands Against His Own'

``` SQL
CREATE TABLE IF NOT EXISTS user_data_by_song_title (song_title TEXT, user_id INT,
user_first_name TEXT,
user_last_name TEXT,
PRIMARY KEY ((song_title), user_id))
```
In this case song_title is the ``PARTITION KEY`` and user_id
is the ``CLUSTERING KEY``, the request asks to retrieve the user name
by song title, so we have to set song_title as ``PARTITION KEY``, but
more users can listen to the same song so we may have many ``INSERT`` with the
same key, Cassandra overwrites data with the same key so we need to add a ``CLUSTERING KEY``
because we need to have a unique record but not to query on that.
Our complete ``PRIMARY KEY`` is composed by song_title, user_id
``` SQL
SELECT user_first_name, user_last_name
FROM user_data_by_song_title
WHERE song_title = 'All Hands Against His Own'
```

----------------------------