https://github.com/alexandrabaturina/data-modeling-with-apache-cassandra

ETL pipeline to transfer data from csv files into Apache Cassandra database using Python and CQL
https://github.com/alexandrabaturina/data-modeling-with-apache-cassandra

apache-cassandra cql-queries data-modeling python-etl-pipeline

Last synced: 4 months ago
JSON representation

ETL pipeline to transfer data from csv files into Apache Cassandra database using Python and CQL

Host: GitHub
URL: https://github.com/alexandrabaturina/data-modeling-with-apache-cassandra
Owner: alexandrabaturina
Created: 2022-07-23T15:16:02.000Z (almost 3 years ago)
Default Branch: master
Last Pushed: 2022-08-11T19:17:41.000Z (almost 3 years ago)
Last Synced: 2025-01-21T02:25:50.043Z (6 months ago)
Topics: apache-cassandra, cql-queries, data-modeling, python-etl-pipeline
Language: Jupyter Notebook
Homepage:
Size: 620 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Data Engineering ND #2: Data Modeling With Apache Cassandra

## Overview

Data Modeling With Apache Cassandra is the second project of Udacity [Data Engineering Nanodegree](https://d20vrrgs8k4bvw.cloudfront.net/documents/en-US/Data+Engineering+Nanodegree+Program+Syllabus.pdf). It requires to create an [Apache Cassandra](https://cassandra.apache.org/_/index.html) database for a music streaming app and write an ETL pipeline that transfers data from directory of CSV files into this database using Python and [CQL](https://cassandra.apache.org/doc/latest/cassandra/cql/index.html).

The project has the following goals:

 * Model a NoSQL database with [Apache Cassandra](https://cassandra.apache.org/_/index.html)

 * Perform insert and select queries using [CQL](https://cassandra.apache.org/doc/latest/cassandra/cql/index.html)

 * Create ETL pipeline with Python

## Dependencies

* [Apache Cassandra](https://cassandra.apache.org/_/index.html)

## Repo Contents

The repo contains the ```event_data``` directory of CSV files on user activity on the app, the ```images``` directory with ```image_event_datafile_new.jpg``` reference file showing columns that will be used for Apache Cassandra tables, as well as the following files:

 * ```cql_queries.py```: contains all CQL queries.

 * ```create_tables.py```: drops existing tables and creates new ones.

 * ```data_modeling_with_cassandra.ipynb```: reads and processes CSV files from ```event_data```, loads the data into tables, and runs ```SELECT``` queries.

 * ```etl.py```:  creates ```event_datafile_full.csv``` file from CSV files in ```event_data```, loads the data from this file into tables, and runs ```SELECT``` queries.

## Apache Cassandra Database

### Database Purpose

A startup called *Sparkify* wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. 

The **music_app_data** database is designed to answer the following questions:

 1. Find the artist, song title and song's length in the music app history that was heard during sessionId = 338, and itemInSession = 4.

 2. Find only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182.

 3.  Find every user name (first and last) in my music app history who listened to the song *'All Hands Against His Own'*.

 

### Data Description

The ```event_data``` directory of CSV files partitioned by date. Here are examples of filepaths to two files in the dataset:

```

event_data/2018-11-08-events.csv

event_data/2018-11-09-events.csv

```

Each CSV file in ```event_data``` directory contains data within the following columns:

  * artist

  * auth

  * firstName

  * gender

  * itemInSession

  * lastName

  * length

  * level

  * location

  * method

  * page

  * registration

  * sessionId

  * song

  * status

  * ts

  * userId

The ```event_datafile_new.csv``` file that will be used for Apache Cassandra tables has the following columns:

  * artist

  * firstName

  * gender

  * itemInSession

  * lastName

  * length

  * level

  * location

  * sessionId

  * song

  * userId

Below is an example of how the data in ```event_datafile_new.csv``` looks like.

![image](https://user-images.githubusercontent.com/53233637/182494081-365123f5-2d5b-4a8a-9b44-185708f9cf3b.png)

## Database Design

The **music_app_data** database is designed so that each table answers one question. It contains the following tables.

### ```song_in_session```

The ```song_in_session``` table has the following fields:

  * sessionId: int

  * itemInSession: int

  * artist: text

  * song: text

  * length: float

  * **PRIMARY KEY (sessionId, itemInSession)**

### ```user_songs```

The ```user_songs``` table has the following fields:

 * userId: int

 * sessionId: int

 * itemInSession: int

 * song: text

 * artist: text

 * firstName: text

 * lastName: text

 * **PRIMARY KEY (userId, sessionId, itemInSession))**

### ```song_listeners```

The ```song_listeners``` table has the following fields:

 * song: text

 * userId: int

 * firstName: text

 * lastName: text

 * **PRIMARY KEY (song, userId))**

## Getting Started

To run ETL pipeline locally,

1. Clone this repo.

2. ```cd``` into project directory.

3. Run ```create_tables.py``` to reset tables:

```

root@8fa691392031:/home/workspace# python create_tables.py

Tables are dropped.

Tables are created.

```

> Remember to run ```create_tables.py``` every time before running ```etl.py``` to reset tables.

4. Run ```etl.py```. Once tables are creates, query results are displayed.

```

root@8fa691392031:/home/workspace# python etl.py

Total number of rows: 8056

Inserting data in song_in_session table...

Completed: [==============================] 100%

Inserting data in user_songs table...

Completed: [==============================] 100%

Inserting data in song_listeners table...

Completed: [==============================] 100%

Query 1:

Find the artist, song title and song's length in the music app history

that was heard during sessionId = 338, and itemInSession = 4.

Faithless Music Matters (Mark Knight Dub) 495.30731201171875

Query 2:

Find only the following: name of artist, song (sorted by itemInSession) and

user (first and last name) for userid = 10, sessionid = 182.

0 Down To The Bone Sylvie Cruz

1 Three Drives Sylvie Cruz

2 Sebastien Tellier Sylvie Cruz

3 Lonnie Gordon Sylvie Cruz

Query 3:

Find every user name (first and last) in my music app history who listened

to the song 'All Hands Against His Own'.

Jacqueline Lynch

Tegan Levine

Sara Johnson

```

## Authors

[Alexandra Baturina](https://www.linkedin.com/in/alexandrabaturina/)

## Acknowledgments

Implementation of python progress bar is based on [this solution](https://gist.github.com/sibosutd/c1d9ef01d38630750a1d1fe05c367eb8).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alexandrabaturina/data-modeling-with-apache-cassandra

Awesome Lists containing this project

README