{"id":21009550,"url":"https://github.com/memgonzales/meta-learning-clustering","last_synced_at":"2026-02-06T17:03:07.938Z","repository":{"id":112399008,"uuid":"510684335","full_name":"memgonzales/meta-learning-clustering","owner":"memgonzales","description":"Presented at the 2022 IEEE Region 10 Conference (TENCON 2022). Our main contribution is twofold: (1) the construction of a meta-learning model for recommending a distance metric for k-means clustering and (2) a fine-grained analysis of the importance and effects of the meta-features on the model's output","archived":false,"fork":false,"pushed_at":"2024-05-05T09:59:21.000Z","size":102615,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-07-03T18:58:38.170Z","etag":null,"topics":["clustering","distance-metric","k-means","k-means-clustering","machine-learning","meta-features","meta-learning","random-forest"],"latest_commit_sha":null,"homepage":"https://doi.org/10.1109/TENCON55691.2022.9978037","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/memgonzales.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-07-05T10:14:01.000Z","updated_at":"2024-05-05T09:59:29.000Z","dependencies_parsed_at":"2025-03-02T04:53:15.896Z","dependency_job_id":null,"html_url":"https://github.com/memgonzales/meta-learning-clustering","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/memgonzales/meta-learning-clustering","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/memgonzales%2Fmeta-learning-clustering","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/memgonzales%2Fmeta-learning-clustering/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/memgonzales%2Fmeta-learning-clustering/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/memgonzales%2Fmeta-learning-clustering/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/memgonzales","download_url":"https://codeload.github.com/memgonzales/meta-learning-clustering/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/memgonzales%2Fmeta-learning-clustering/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29169384,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-06T16:33:35.550Z","status":"ssl_error","status_checked_at":"2026-02-06T16:33:30.716Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","distance-metric","k-means","k-means-clustering","machine-learning","meta-features","meta-learning","random-forest"],"created_at":"2024-11-19T09:17:11.518Z","updated_at":"2026-02-06T17:03:07.918Z","avatar_url":"https://github.com/memgonzales.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Distance Metric Recommendation for $k$-Means Clustering: A Meta-Learning Approach\n\n![badge][badge-jupyter]\n![badge][badge-python]\n![R](https://img.shields.io/badge/r-%23276DC3.svg?style=flat\u0026logo=r\u0026logoColor=white)\n![badge][badge-pandas]\n![badge][badge-numpy]\n![badge][badge-scipy]\n![scikit-learn](https://img.shields.io/badge/scikit--learn-%23F7931E.svg?style=flat\u0026logo=scikit-learn\u0026logoColor=white)\n\n**This work was accepted for paper presentation at the 2022 IEEE Region 10 Conference ([TENCON 2022](https://www.ieeer10.org/wp-content/uploads/2021/03/2022-TENCON-Hong-Kong-Section-updated2.pdf)), held virtually and in-person in Hong Kong:**\n\n- The final version of our paper (as published in the conference proceedings of TENCON 2022) can be accessed via this [link](https://ieeexplore.ieee.org/abstract/document/9978037).\n  - Our preprint can be accessed via this [link](https://github.com/memgonzales/meta-learning-clustering/blob/master/Distance%20Metric%20Recommendation%20for%20k-Means%20Clustering%20A%20Meta-Learning%20Approach.pdf).\n  - Our TENCON 2022 presentation slides can be accessed via this [link](https://github.com/memgonzales/meta-learning-clustering/blob/master/Presentation%20Slides.pdf).\n- Our [dataset of datasets](https://github.com/memgonzales/meta-learning-clustering/tree/master/dataset_of_datasets) is publicly released for future researchers. \n- Kindly refer to [`0. Directory.ipynb`](https://github.com/memgonzales/meta-learning-clustering/blob/master/0.%20Directory.ipynb) for a guide on navigating through this repository.\n\nIf you find our work useful, please consider citing:\n```\n@INPROCEEDINGS{9978037,\n  author={Gonzales, Mark Edward M. and Uy, Lorene C. and Sy, Jacob Adrianne L. and Cordel, Macario O.},\n  booktitle={TENCON 2022 - 2022 IEEE Region 10 Conference (TENCON)}, \n  title={Distance Metric Recommendation for k-Means Clustering: A Meta-Learning Approach}, \n  year={2022},\n  pages={1-6},\n  doi={10.1109/TENCON55691.2022.9978037}}\n```\n\nThis repository is also archived on [Zenodo](https://doi.org/10.5281/zenodo.7880146).\n\n## Description\n\n**ABSTRACT:** The choice of distance metric impacts the clustering quality of centroid-based algorithms, such as $k$-means. Theoretical attempts to select the optimal metric entail deep domain knowledge, while experimental approaches are resource-intensive. This paper presents a meta-learning approach to automatically recommend a distance metric for $k$-means clustering that optimizes the Davies-Bouldin score. Three distance measures were considered: Chebyshev, Euclidean, and Manhattan. General, statistical, information-theoretic, structural, and complexity meta-features were extracted, and random forest was used to construct the meta-learning model; borderline SMOTE was applied to address class imbalance. The model registered an accuracy of 70.59%. Employing Shapley additive explanations, it was found that the mean of the sparsity of the attributes has the highest meta-feature importance. Feeding only the top 25 most important meta-features increased the accuracy to 71.57%. The main contribution of this paper is twofold: the construction of a meta-learning model for distance metric recommendation and a fine-grained analysis of the importance and effects of the meta-features on the model’s output.\n\n**INDEX TERMS:** meta-learning, meta-features, $k$-means, clustering, distance metric, random forest\n\n\u003cimg src=\"https://github.com/memgonzales/meta-learning-clustering/blob/master/figures/fig.PNG?raw=True\" alt=\"App Screenshots\" width = 750\u003e \n\n\n## Authors\n\n- \u003cb\u003eMark Edward M. Gonzales\u003c/b\u003e \u003cbr/\u003e\n  mark_gonzales@dlsu.edu.ph \u003cbr/\u003e\n  \n- \u003cb\u003eLorene C. Uy\u003c/b\u003e \u003cbr/\u003e\n  lorene_c_uy@dlsu.edu.ph \u003cbr/\u003e\n\n- \u003cb\u003eJacob Adrianne L. Sy\u003c/b\u003e \u003cbr/\u003e\n  jacob_adrianne_l_sy@dlsu.edu.ph \u003cbr/\u003e\n\n- \u003cb\u003eDr. Macario O. Cordel, II\u003c/b\u003e\u003cbr/\u003e\n  macario.cordel@dlsu.edu.ph\n  \nThis is the major course output in a machine learning class for master's students under Dr. Macario O. Cordel, II of the Department of Computer Technology, De La Salle University. The task is to create a ten-week investigatory project that applies machine learning to a particular research area or offers a substantial theoretical or algorithmic contribution to existing machine learning techniques.\n\n[badge-jupyter]: https://img.shields.io/badge/Jupyter-F37626.svg?\u0026style=flat\u0026logo=Jupyter\u0026logoColor=white\n[badge-python]: https://img.shields.io/badge/python-3670A0?style=flat\u0026logo=python\u0026logoColor=white\n[badge-pandas]: https://img.shields.io/badge/Pandas-2C2D72?style=flat\u0026logo=pandas\u0026logoColor=white\n[badge-numpy]: https://img.shields.io/badge/Numpy-777BB4?style=flat\u0026logo=numpy\u0026logoColor=white\n[badge-scipy]: https://img.shields.io/badge/SciPy-654FF0?style=flat\u0026logo=SciPy\u0026logoColor=white\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmemgonzales%2Fmeta-learning-clustering","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmemgonzales%2Fmeta-learning-clustering","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmemgonzales%2Fmeta-learning-clustering/lists"}