https://github.com/cintia0528/data_science-unsupervised_machine_learning
I aim to automate playlist creation for Moosic, a startup known for manual curation, using Machine Learning, while addressing skepticism about the ability of audio features to capture playlist "mood."
https://github.com/cintia0528/data_science-unsupervised_machine_learning
data data-preprocessing data-scaling data-science data-visualization datacleaning elbow-method kclustering machine-learning pandas python silhouette-score unsupervised-machine-learning
Last synced: 3 months ago
JSON representation
I aim to automate playlist creation for Moosic, a startup known for manual curation, using Machine Learning, while addressing skepticism about the ability of audio features to capture playlist "mood."
- Host: GitHub
- URL: https://github.com/cintia0528/data_science-unsupervised_machine_learning
- Owner: Cintia0528
- Created: 2023-09-26T15:28:05.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-09-26T15:58:21.000Z (over 1 year ago)
- Last Synced: 2025-03-31T05:35:18.383Z (3 months ago)
- Topics: data, data-preprocessing, data-scaling, data-science, data-visualization, datacleaning, elbow-method, kclustering, machine-learning, pandas, python, silhouette-score, unsupervised-machine-learning
- Language: Jupyter Notebook
- Homepage:
- Size: 2.18 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Unsupervised Machine Learning
## Goal
To evaluate whether Machine Learning can be used to automatise playlist creation.## Overview
Moosic is a small startup that creates playlists curated manually by music experts. Their listeners love the personal touch, which they achieve by capturing the "mood" or "vibe"._Board_: Believes that they need at least a **degree of automatisation**, as music experts are not able to keep up with the demand. **Currently** the whole creation process is **done manually**.
_Music Experts_: Are **skeptical** that audio features on their own are not enough to capture the "mood" which is very subjective that **only a human can judge**.## Context
Moosic wants the data science team to use a dataset that has been collected from the Spotify API and contains the audio features (tempo, energy, danceability…) for a few thousand songs. After useing a basic **clustering algorithm** such as K-Means to divide the dataset into a few clusters the data team shall answer the following two questions:1. *Are Spotify’s audio features able to identify “similar songs”, as defined by humanly detectable criteria?*
2. *Is K-Means a good method to create playlists?*### Task:
* Import list of 5000 songs collected from Spotify API
* Use basic clustering ex.: K-Means to divide dataset into clusters
* Validate clusters, export clusters (playlists) to Spotify and listen to some of the songs#### Challenges:
* Difficult to evaluate the results without listening to each playlist
* No tangible way to measure accuracy
* Unevenly large clusters
* Subjective - what is a good playlist?#### Solutions:
* Must be visualized, so we can see the overlaps and the outliers
* Limit the number of features to 3 (or multiples of 3) so it can be visualized in 3D scatterplot
* Find a balance between K-score and the business objectives
* Instead of replacing music experts, ML does the "heavy lifting" and they fine-tune the results## Approach
1. Evaluate the database; basic cleaning, ex.: missing, corrupted values, correct data types
2. Exploration of audio features
3. Decide which features to drop, and which features to use
4. K-Means clustering
5. Evaluation of clusters
6. Sub-clustering
7. Evaluation of final clusters## Deliverables
5 minute **PowerPoint presentation** found [here](https://drive.google.com/file/d/1vUTZUToQtD97X_53d7Ht7nSnJrvjJ5G_/view?usp=sharing) to the Board of Directors, that summarizes the findings and suggests a course of action.
**Python code** is found [here](4_0_5000_songs_FINAL_NOTEBOOK.ipynb).## Skills & Tools
1. Data Cleaning & Quality Assurance
2. Data Preprocessing: Scaling
3. K-Means Clustering
4. Elbow Method and Silhouette Score
5. Data Visualization (3D Scatterplot)