{"id":15056798,"url":"https://github.com/collumbus/data-modeling-with-cassandra","last_synced_at":"2026-04-02T18:44:44.449Z","repository":{"id":130255239,"uuid":"438057560","full_name":"Collumbus/Data-Modeling-With-Cassandra","owner":"Collumbus","description":"It is a project where I applied concepts data modelling with Apache Cassandra and built an ETL pipeline using Python. To complete the project has been defined a data model by creating tables in Apache Cassandra to run queries. I am provided with part of the ETL pipeline that transfers data from a set of CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables.","archived":false,"fork":false,"pushed_at":"2021-12-15T18:26:23.000Z","size":910,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-14T07:45:52.832Z","etag":null,"topics":["apache-cassandra","data-engineering","etl","pipeline","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Collumbus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-13T23:39:11.000Z","updated_at":"2021-12-15T18:53:05.000Z","dependencies_parsed_at":"2024-03-08T10:04:06.405Z","dependency_job_id":null,"html_url":"https://github.com/Collumbus/Data-Modeling-With-Cassandra","commit_stats":{"total_commits":8,"total_committers":1,"mean_commits":8.0,"dds":0.0,"last_synced_commit":"099f43eeadfc2e55cc336fdd3509392317ab9779"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Collumbus%2FData-Modeling-With-Cassandra","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Collumbus%2FData-Modeling-With-Cassandra/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Collumbus%2FData-Modeling-With-Cassandra/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Collumbus%2FData-Modeling-With-Cassandra/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Collumbus","download_url":"https://codeload.github.com/Collumbus/Data-Modeling-With-Cassandra/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243544665,"owners_count":20308168,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-cassandra","data-engineering","etl","pipeline","python"],"created_at":"2024-09-24T21:56:33.011Z","updated_at":"2025-12-29T15:41:21.332Z","avatar_url":"https://github.com/Collumbus.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Project: Data Modeling with Apache Cassandra\nIt is a project where I applied concepts data modelling with Apache Cassandra and built an ETL pipeline using Python. To complete the project has been defined a data model by creating tables in Apache Cassandra to run queries. I am provided with part of the ETL pipeline that transfers data from a set of CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables.\n\nThis project models user activity data for a music streaming app called Sparkify to optimize queries for understanding what songs users are listening to by using **Apache Cassandra**.\n\n1. Build up ETL to iterate/process events raw dataset and generate new dataset\n2. Creating appropriate Apache Cassandra tables to answer 3 specific questions\n3. Inserting data from new dataset to Apache Cassandra tables\n4. Testing the results by select statements\n\n\n## Project Structure\n\n```\nData Modeling with Cassandra\n|____data                        # Dataset\n| |____event_data                # Raw dataset  (csv files)\n| |____event_datafile_new.csv    # New dataset by iterating event_data\n|   |____...events.csv\n|\n|____jupyter_notebooks\t\t # Notebooks for developing and testing ETL\n| |____etl_cassandra.ipynb       # Notebook for Apache Cassandra queries\n|\n|____scripts        \t\t # Python codes\n| |____etl_cassandra.py\t\t # ETL builder\n|\n|____images                      # Referenced image for new dataset\n| |____image_event_datafile_new\n```\n\n### Example of query and results for song play analysis\n##### 1. Give me the artist, song title and song's length in the music app history that was heard during  sessionId = 338, and itemInSession  = 4\n```\nSELECT \n        artist_name,\n        song_title, \n        song_lengh \nFROM artist_songs \nWHERE session_id = 338 AND item_in_session = 4\n```\n\n### Result\n```\n  | artist_name\t      | song_title                           | song_lengh\n--------------------------------------------------------------------------\n0 |\tFaithless     | Music Matters (Mark Knight Dub)      | 495.307312\n```\n\n##### 2. Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182\n```\nSELECT \n        artist_name, \n        song_title, \n        first_name, \n        last_name \nFROM user_songs \nWHERE user_id = 10 AND session_id = 182\n```\n\n### Result\n```\n  | artist_name       | song_title                | first_name | last_name\n--------------------------------------------------------------------------------\n0 | Down To The Bone  | Keep On Keepin' On        | Sylvie     | Cruz\n1 | Three Drives      | Greece 2000               | Sylvie     | Cruz\n2 | Sebastien Tellier | Kilometer                 | Sylvie     | Cruz\n3 | Lonnie Gordon     | Catch You Baby (Steve ... | Sylvie     | Cruz\n```\n##### 3. Give me every user name (first and last) in my music app history who listened to the song 'All Hands Against His Own'\n```\nSELECT \n        first_name, \n        last_name \nFROM listened_songs \nWHERE song_title = 'All Hands Against His Own\n```\n\n### Result\n```\n  | first_name \t | last_name\n------------------------------\n0 | Jacqueline \t | Lynch\n1 | Tegan \t | Levine\n2 | Sara \t | Johnson\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcollumbus%2Fdata-modeling-with-cassandra","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcollumbus%2Fdata-modeling-with-cassandra","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcollumbus%2Fdata-modeling-with-cassandra/lists"}