https://github.com/dina-hosny/sparkify---data-modeling-with-cassandra

Sparkify - Data Modeling with Cassandra - Udacity Data Engineering Expert Track.
https://github.com/dina-hosny/sparkify---data-modeling-with-cassandra

cassandra cql data-analysis data-engineering data-modeling data-warehousing etl python

Last synced: 7 months ago
JSON representation

Sparkify - Data Modeling with Cassandra - Udacity Data Engineering Expert Track.

Host: GitHub
URL: https://github.com/dina-hosny/sparkify---data-modeling-with-cassandra
Owner: Dina-Hosny
Created: 2022-10-09T04:52:47.000Z (almost 3 years ago)
Default Branch: master
Last Pushed: 2022-10-09T06:41:11.000Z (almost 3 years ago)
Last Synced: 2025-01-13T22:26:50.038Z (9 months ago)
Topics: cassandra, cql, data-analysis, data-engineering, data-modeling, data-warehousing, etl, python
Language: Jupyter Notebook
Homepage:
Size: 8.79 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Sparkify - Data Modeling with Cassandra
Sparkify - Data Modeling with Cassandra - Udacity Data Engineering Expert Track.

In this project, I built an ETL pipeline using Python, and created Apache Cassandra tables using CQL for a particular analytic focus which is analyze Sparkify's collected data on songs and user activity on their new music streaming app. Then, wrote an ETL pipeline that transfers data from CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables.
## Project Datasets:

- event_data:

CSV files which partitioned by date

## Project Details:

The project purpose is analyze the Sparkify collected data on songs and user activity on their new music streaming app, to understand what songs users are listening to and answer given questions, by analyze the their activities data.

For this analysis; I processed the ```event_datafile_new.csv``` dataset to create a denormalized dataset, then modeled the data tables and the needed queries, finally loaded the data into Apache Cassandra tables and run the queries

## Tools and Technologies:
- Python 3.
- Pandas, NumPy, Psycopg2 Python Libraries.
- ETL: Extract, Transform, Load Data
- Big Data and NoSQL concepts.
- CQL.
- Jupyter Notebook.

## Project Steps:
- 1- Design tables to answer the given queries.
- 2- Write Apache Cassandra CREATE KEYSPACE and SET KEYSPACE statements.
- 3- Develop the CREATE statement for each of the tables to address each query.
- 4- Load the data with INSERT statement for each of the tables.
- 5- Test by running the proper select statements with the correct WHERE clause.

## How To Run The Project?

- 1- Install Python 3.
- 2- Install Apache Cassandra.
- 3- Download the scripts and the datasets.
- 4- Run the Jupyter Notebook project.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dina-hosny/sparkify---data-modeling-with-cassandra

Awesome Lists containing this project

README