https://github.com/kenhanscombe/project-cassandra
Udacity data engineering nanodegree project
https://github.com/kenhanscombe/project-cassandra
apache-cassandra data-engineering python3 udacity-nanodegree
Last synced: about 1 year ago
JSON representation
Udacity data engineering nanodegree project
- Host: GitHub
- URL: https://github.com/kenhanscombe/project-cassandra
- Owner: kenhanscombe
- Created: 2019-11-11T11:22:38.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2019-11-11T14:11:26.000Z (over 6 years ago)
- Last Synced: 2025-04-04T03:51:07.676Z (over 1 year ago)
- Topics: apache-cassandra, data-engineering, python3, udacity-nanodegree
- Language: Jupyter Notebook
- Size: 270 KB
- Stars: 4
- Watchers: 0
- Forks: 12
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Project 2: Data Modeling with Apache Cassandra
- [project rubric](https://review.udacity.com/#!/rubrics/2475/view)
> **Note:** The whole exercise can be run in a docker container. See instruction below.
This **Udacity Data Engineering nanodegree** project creates an Apache Cassandra database `sparkifyks` for a music app, *Sparkify*. The purpose of the NoSQL database is to answer queries on song play data. The data model includes a table for each of the following queries:
1. Give me the artist, song title and song's length in the music app history that was heard during sessionId = 338, and itemInSession = 4
2. Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182
3. Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'
## Data pre-processing, ETL pipeline, and data modeling
The data are stored as a collection of csv files partitioned by date. The ETL pipeline and data modeling are written in a single jupyter notebook, **Project_1B_Project_Template.ipynb**.
ETL copies data from the date-partitioned csv files to a single csv file **event_datafile_new.csv** which is used to populate the denormalized Cassandra tables optimised for the 3 queries above. The 3 tables in the model are named after the song play query they are created to solve:
1. **`songinfo_by_session_by_item`** includes artist, song title and song length information for a given `sessionId` and `itemInSessionId`.
2. **`songinfo_by_user_by_session`** includes artist, song and user for a given `userId` and `sessionId`.
3. **`userinfo_by_song`** includes user names for a given song.
The example queries are returned as pandas dataframes to facilitate further data manipulation.
## Run in a Docker container
With docker installed, pull the latest Apache Cassandra image and run a container as follows:
```{bash}
docker pull cassandra
docker run --name cassandra-container -p 9042:9042 -d cassandra:latest
```
This will allow you to develop the data model (i.e., work through the jupyter notebook), without altering the provided connection code which connects to the localhost with default port 9042.
```{python}
from cassandra.cluster import Cassandra
cluster = Cluster(['127.0.0.1'])
session = cluster.connect()
```
To stop and remove the container after the exercise
```{bash}
docker stop cassandra-container
docker rm cassandra-container
```