{"id":23182023,"url":"https://github.com/javorraca/unsupervised-ml","last_synced_at":"2025-04-05T03:16:54.867Z","repository":{"id":189893802,"uuid":"149189648","full_name":"JavOrraca/Unsupervised-ML","owner":"JavOrraca","description":"A short exercise using R to perform unsupervised machine learning (clustering) on a sample data set.","archived":false,"fork":false,"pushed_at":"2018-09-19T16:25:09.000Z","size":1965,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-10T11:23:11.123Z","etag":null,"topics":["ade4","clustering","clustering-algorithm","clustering-analysis","data-analysis","data-analytics","data-science","dplyr","jupyter","k-means-clustering","machine-learning","machinelearning","ml","r","r-programming","sse","unsupervised-machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JavOrraca.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-09-17T21:22:07.000Z","updated_at":"2018-09-19T16:25:10.000Z","dependencies_parsed_at":"2023-08-22T09:08:13.114Z","dependency_job_id":null,"html_url":"https://github.com/JavOrraca/Unsupervised-ML","commit_stats":null,"previous_names":["javorraca/unsupervised-ml"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JavOrraca%2FUnsupervised-ML","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JavOrraca%2FUnsupervised-ML/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JavOrraca%2FUnsupervised-ML/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JavOrraca%2FUnsupervised-ML/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JavOrraca","download_url":"https://codeload.github.com/JavOrraca/Unsupervised-ML/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247280255,"owners_count":20912967,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ade4","clustering","clustering-algorithm","clustering-analysis","data-analysis","data-analytics","data-science","dplyr","jupyter","k-means-clustering","machine-learning","machinelearning","ml","r","r-programming","sse","unsupervised-machine-learning"],"created_at":"2024-12-18T08:19:10.315Z","updated_at":"2025-04-05T03:16:54.851Z","avatar_url":"https://github.com/JavOrraca.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Unsupervised Machine Learning\n\nThis exercise relies on the k-means algorithm to perform unsupervised machine learning for clustering a company's customers via the R programming language.\n\n## Prerequisites\n\n* [Anaconda Distribution](https://www.anaconda.com/distribution/): Allows for R environment installation, including Jupyter setup\n* [Jupyter](http://jupyter.org/): Web-based R notebook that can be launched via the Anaconda Navigator and natively reads the .ipynb document included in the Files folder\n* [dplyr](https://dplyr.tidyverse.org/): This R package is for data manipulation and is part of the tidyverse R toolset\n* [ade4](https://cran.r-project.org/web/packages/ade4/index.html): This R package can used for converting categorical data into numerical dummy data and for multivariate data analysis\n\n## Data\n\nThe data used in this analysis reflects a historical snapshot of Sun Country Airlines customer data used for educational purposes. No private / personal identification information was included as part of this sample data set. This repository will not cover the full data cleanup and sanitization process. We'll start with a pre-processed data set ready for clustering via R. \n\n## Running the code\n\nThis clustering exercise can be summarized into four major parts:\n  1. Data preparation\n  2. Converting categorical data into numerical data\n  3. Analyzing k-means clusters via sum of squared errors (SSE) comparison \n  4. Cluster assignment in data set\n\nTo begin, install the dplyr and ade4 R packages for data manipulation and clustering (if not already installed).\n\n```R\ninstall.packages(\"dplyr\")\nlibrary(dplyr)\n\ninstall.packages(\"ade4\")\nlibrary(ade4)\n```\n\nThen, download the file SC_data_CleanedUp.csv, assign it to \"data\" and then aggregate to the customer-trip level. Ensure aggregation was successful by 1) confirming the \"customer_data\" object is a data frame and 2) displaying the dimensions of the data frame.\n\n\n```R\ndata \u003c- read.csv(\"SC_data_CleanedUp.csv\")\n\ncustomer_data \u003c- data %\u003e% \n  group_by(PNRLocatorID,CustID) %\u003e%\n  summarise(PaxName = first(PaxName),\n            BookingChannel = first(BookingChannel), \n            amt = max(TotalDocAmt), \n            UFlyRewards = first(UFlyRewardsNumber), \n            UflyMemberStatus = first(UflyMemberStatus), \n            age_group = first(age_group), \n            true_origin = first(true_origin), \n            true_destination = first(true_destination), \n            round_trip = first(round_trip), \n            group_size = first(group_size), \n            group = first(group), \n            seasonality = first(Seasonality), \n            days_pre_booked = max(days_pre_booked))\n\nis.data.frame(customer_data)\n\nTRUE\n\ndim(customer_data)\n\n17946  15\n```\n\n\nRemove unnecessary variables (encrypted names, customer IDs, etc.). Normalize the amt, days_pre_booked, and group_size variables.\n\n\n```R\nclustering_data \u003c- subset(customer_data,\n  select = -c(PNRLocatorID,CustID,PaxName,UFlyRewards))\n\nnormalize \u003c- function(x){\n  return ((x - min(x))/(max(x) - min(x)))\n  }\n\nclustering_data = mutate(clustering_data,\n  amt = normalize(amt),\n  days_pre_booked = normalize(days_pre_booked), \n  group_size = normalize(group_size))\n```\n\nSince the k-means clustering algorithm works only with numerical data, we need to convert each of the categorical factor levels into numerical dummy variables (\"0\" or \"1\"). The ade4 package will be used to convert the categorical data into these numerical dummy variables.\n\n\n```R\nclustering_data \u003c- as.data.frame(clustering_data)\nclustering_data \u003c- clustering_data %\u003e% \n  cbind(acm.disjonctif(clustering_data[,\n    c(\"BookingChannel\",\"age_group\",\n    \"true_origin\",\"true_destination\",\n    \"UflyMemberStatus\",\"seasonality\")])) %\u003e%\n  ungroup()\n```\n\nFor cleanliness, remove the original, non-dummy-coded variables.\n\n```R\nclustering_data \u003c- clustering_data %\u003e%\n  select(-BookingChannel,-age_group,-true_origin,\n         -true_destination,-UflyMemberStatus,-seasonality)\n```\n\nWe'll now run k-means to gain a better understanding at the within SSE curve. Plot to visualize and understand the impact on SSE comparing 1 to 15 clusters.\n\n_Note: The sum of squared errors (or \"SSE\") is the sum of the squared differences between each observation and its cluster's mean. In the context of this clustering analysis, SSE is used as a measure of variation. If all observations within a cluster are identical, the SSE would be equal to 0. Modeling for \"the lowest SSE possible\" is not ideal as this results in model overfitting._\n\n\n```R\nSSE_curve \u003c- c()\nfor (n in 1:15) {\n  kcluster \u003c- kmeans(clustering_data, n)\n  sse \u003c- sum(kcluster$withinss)\n  SSE_curve[n] \u003c- sse\n}\n\nplot(1:15, SSE_curve, type=\"b\", main=\"SSE Curve for Ideal k-Value\",\n  xlab=\"Number of Clusters\", ylab=\"Sum of Squared Errors (SSE)\")\n```\n\n\n![png](output_9_0.png)\n\n\nGiven the plot above, the change in SSE decreases significantly after ~5 clusters. Let's select 5 clusters for the purpose of this customer segmentation.\n\n\n```R\nkcluster \u003c- kmeans(clustering_data, 5)\n\n#The following will print the size of each cluster:\n\nkcluster$size\n\n3909  3384  2325  4656  3672\n```\n\nLastly, add a new column with the cluster assignment into the CSV file, and call this field \"Segment\", for each observation in customer_data. After running the code below, this analysis will be completed.\n\n\n```R\nsegment \u003c- as.data.frame(kcluster$cluster)\ncolnames(segment) \u003c- \"Segment\" \ncustomer_segment_data \u003c- cbind.data.frame(customer_data, segment)\nwrite.csv(customer_segment_data, \"SC_customer_segment_data.csv\")\n```\n\n## Acknowledgments\n\n* [UC Irvine, Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)\n* [UC Irvine, MS in Business Analytics curriculum](https://merage.uci.edu/programs/masters/master-science-business-analytics/curriculum.html)\n* [University of Minnesota, Carlson Analytics Lab](https://carlsonschool.umn.edu/news/sun-country-airlines-engages-business-analytics-students-decode-data)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjavorraca%2Funsupervised-ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjavorraca%2Funsupervised-ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjavorraca%2Funsupervised-ml/lists"}