{"id":15056852,"url":"https://github.com/kenhanscombe/project-cassandra","last_synced_at":"2025-05-16T16:30:48.686Z","repository":{"id":201940756,"uuid":"220965285","full_name":"kenhanscombe/project-cassandra","owner":"kenhanscombe","description":"Udacity data engineering nanodegree project","archived":false,"fork":false,"pushed_at":"2019-11-11T14:11:26.000Z","size":276,"stargazers_count":4,"open_issues_count":0,"forks_count":12,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-04-04T03:51:07.676Z","etag":null,"topics":["apache-cassandra","data-engineering","python3","udacity-nanodegree"],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kenhanscombe.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-11-11T11:22:38.000Z","updated_at":"2024-04-27T04:31:55.000Z","dependencies_parsed_at":null,"dependency_job_id":"f2219174-ff15-41b3-af44-d3831874343a","html_url":"https://github.com/kenhanscombe/project-cassandra","commit_stats":null,"previous_names":["kenhanscombe/project-cassandra"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kenhanscombe%2Fproject-cassandra","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kenhanscombe%2Fproject-cassandra/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kenhanscombe%2Fproject-cassandra/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kenhanscombe%2Fproject-cassandra/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kenhanscombe","download_url":"https://codeload.github.com/kenhanscombe/project-cassandra/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254567261,"owners_count":22092740,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-cassandra","data-engineering","python3","udacity-nanodegree"],"created_at":"2024-09-24T21:57:09.438Z","updated_at":"2025-05-16T16:30:48.297Z","avatar_url":"https://github.com/kenhanscombe.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Project 2: Data Modeling with Apache Cassandra\n\n- [project rubric](https://review.udacity.com/#!/rubrics/2475/view)\n\n\n\u003e **Note:** The whole exercise can be run in a docker container. See instruction below.\n\nThis **Udacity Data Engineering nanodegree** project creates an Apache Cassandra database `sparkifyks` for a music app, *Sparkify*. The purpose of the NoSQL database is to answer queries on song play data. The data model includes a table for each of the following queries:\n\n1. Give me the artist, song title and song's length in the music app history that was heard during  sessionId = 338, and itemInSession  = 4\n\n2. Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182\n    \n3. Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'\n\n\n## Data pre-processing, ETL pipeline, and data modeling\n\nThe data are stored as a collection of csv files partitioned by date. The ETL pipeline and data modeling are written in a single jupyter notebook, **Project_1B_Project_Template.ipynb**.\n\nETL copies data from the date-partitioned csv files to a single csv file **event_datafile_new.csv** which is used to populate the denormalized Cassandra tables optimised for the 3 queries above. The 3 tables in the model are named after the song play query they are created to solve:\n\n1. **`songinfo_by_session_by_item`** includes artist, song title and song length information for a given `sessionId` and `itemInSessionId`.\n\n2. **`songinfo_by_user_by_session`** includes artist, song and user for a given `userId` and `sessionId`.\n\n3. **`userinfo_by_song`** includes user names for a given song.\n\nThe example queries are returned as pandas dataframes to facilitate further data manipulation.\n\n\u003cbr\u003e\n\n## Run in a Docker container\n\nWith docker installed, pull the latest Apache Cassandra image and run a container as follows:\n\n```{bash}\ndocker pull cassandra\n\ndocker run --name cassandra-container -p 9042:9042 -d cassandra:latest\n```\n\nThis will allow you to develop the data model (i.e., work through the jupyter notebook), without altering the provided connection code which connects to the localhost with default port 9042.\n\n```{python}\nfrom cassandra.cluster import Cassandra\n\ncluster = Cluster(['127.0.0.1'])\nsession = cluster.connect()\n```\n\nTo stop and remove the container after the exercise\n\n```{bash}\ndocker stop cassandra-container\ndocker rm cassandra-container\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkenhanscombe%2Fproject-cassandra","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkenhanscombe%2Fproject-cassandra","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkenhanscombe%2Fproject-cassandra/lists"}