{"id":13710304,"url":"https://github.com/awslabs/amazon-denseclus","last_synced_at":"2025-05-06T19:30:45.513Z","repository":{"id":40279867,"uuid":"392481310","full_name":"awslabs/amazon-denseclus","owner":"awslabs","description":"Clustering for mixed-type data","archived":false,"fork":false,"pushed_at":"2024-07-29T22:54:49.000Z","size":4752,"stargazers_count":99,"open_issues_count":11,"forks_count":21,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-18T06:10:55.303Z","etag":null,"topics":["clustering","embedding","machinelearning-python","python"],"latest_commit_sha":null,"homepage":"https://aws.amazon.com/blogs/opensource/introducing-denseclus-an-open-source-clustering-package-for-mixed-type-data/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit-0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/awslabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-08-03T23:11:08.000Z","updated_at":"2025-04-04T02:06:20.000Z","dependencies_parsed_at":"2023-12-17T14:23:31.704Z","dependency_job_id":"2b7ffe36-d0b0-40c7-a71d-5eec530a6bc4","html_url":"https://github.com/awslabs/amazon-denseclus","commit_stats":{"total_commits":31,"total_committers":9,"mean_commits":"3.4444444444444446","dds":0.7419354838709677,"last_synced_commit":"91a0b876cd0e17cb3e708662a648ea76c63156a8"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Famazon-denseclus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Famazon-denseclus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Famazon-denseclus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Famazon-denseclus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/awslabs","download_url":"https://codeload.github.com/awslabs/amazon-denseclus/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252753175,"owners_count":21798927,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","embedding","machinelearning-python","python"],"created_at":"2024-08-02T23:00:54.229Z","updated_at":"2025-05-06T19:30:44.906Z","avatar_url":"https://github.com/awslabs.png","language":"Jupyter Notebook","funding_links":[],"categories":["Personalisation / Segmentation"],"sub_categories":[],"readme":"\n# Amazon DenseClus\n\n\u003cp align=\"left\"\u003e\n\u003ca href=\"https://github.com/awslabs/amazon-denseclus/actions/workflows/tests.yml\"\u003e\u003cimg alt=\"build\" src=\"https://github.com/awslabs/amazon-denseclus/actions/workflows/cd.yml/badge.svg\"\u003e\u003c/a\u003e\n\u003ca\u003e\u003cimg alt=\"total download\" src=\"https://static.pepy.tech/personalized-badge/amazon-denseclus?period=total\u0026units=international_system\u0026left_color=black\u0026right_color=green\u0026left_text=Total Downloads\"\u003e\u003c/a\u003e\n\u003ca\u003e\u003cimg alt=\"month download\" src=\"https://static.pepy.tech/personalized-badge/amazon-denseclus?period=month\u0026units=international_system\u0026left_color=black\u0026right_color=green\u0026left_text=Monthly Downloads\"\u003e\u003c/a\u003e\n\u003ca\u003e\u003cimg alt=\"weekly download\" src=\"https://static.pepy.tech/personalized-badge/amazon-denseclus?period=week\u0026units=international_system\u0026left_color=black\u0026right_color=green\u0026left_text=Weekly Downloads\"\u003e\u003c/a\u003e\n\u003ca href=\"https://badge.fury.io/py/Amazon-DenseClus\"\u003e\u003cimg alt=\"PyPI version\" src=\"https://badge.fury.io/py/Amazon-DenseClus.svg\"\u003e\u003c/a\u003e\n\u003ca\u003e\u003cimg alt=\"PyPI - Python Version\" src=\"https://img.shields.io/pypi/pyversions/Amazon-DenseClus\"\u003e\u003c/a\u003e\n\u003ca\u003e\u003cimg alt=\"PyPI - Wheel\" src=\"https://img.shields.io/pypi/wheel/Amazon-DenseClus\"\u003e\u003c/a\u003e\n\u003ca\u003e\u003cimg alt=\"PyPI - License\" src=\"https://img.shields.io/pypi/l/Amazon-DenseClus\"\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/psf/black\"\u003e\u003cimg alt=\"Code style: black\" src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/marketplace/actions/super-linter\"\u003e\u003cimg alt=\"Github Super-Linter\" src=\"https://github.com/awslabs/amazon-denseclus/workflows/Lint%20Code%20Base/badge.svg\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\n\nDenseClus is a Python module for clustering mixed type data using [UMAP](https://github.com/lmcinnes/umap) and [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan). Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in clustering.\n\n## Installation\n\n```bash\npython3 -m pip install amazon-denseclus\n```\n\n## Quick Start\n\nDenseClus requires a Panda's dataframe as input with both numerical and categorical columns.\nAll preprocessing and extraction are done under the hood, just call fit and then retrieve the clusters!\n\n```python\nfrom denseclus import DenseClus\nfrom denseclus.utils import make_dataframe\n\n\ndf = make_dataframe()\nclf = DenseClus(df)\nclf.fit(df)\n\nscores = clf.evaluate()\nprint(scores[0:10])\n```\n\n\n## Usage\n\n### Prediction\n\nDenseClus uses a `predict` method when `umap_combine_method` is set to `ensemble`.\nResults are return in 2d array with the first part being the labels and the second part the probabilities.\n\n```python\nfrom denseclus import DenseClus\nfrom denseclus.utils import make_dataframe\n\nRANDOM_STATE = 10\n\ndf = make_dataframe(random_state=RANDOM_STATE)\ntrain = df.sample(frac=0.8, random_state=RANDOM_STATE)\ntest = df.drop(train.index)\nclf = DenseClus(random_state=RANDOM_STATE, umap_combine_method='ensemble')\nclf.fit(train)\n\npredictions = clf.predict(test)\nprint(predictions) # labels, probabilities\n```\n\n\n### On Combination Method\n\nFor a slower but more **stable** results select `intersection_union_mapper` to combine embedding layers via a third UMAP, which will provide equal weight to both numerics and categoriel columns. By default, you are setting the random seed which eliminates the ability for UMAP to run in parallel but will help circumevent some of [the randomness](https://umap-learn.readthedocs.io/en/latest/reproducibility.html) of the algorithm.\n\n```python\nclf = DenseClus(\n    umap_combine_method=\"intersection_union_mapper\",\n)\n```\n\n### To Use with GPU with Ensemble\n\nTo use with gpu first have [rapids installed](https://docs.rapids.ai/install#selector).\nYou can do this as setup by providing cuda verision.\n`pip install amazon-denseclus[gpu-cu12]`\n\nThen to run:\n\n```python\nclf = DenseClus(\n    umap_combine_method=\"ensemble\",\n    use_gpu=True\n)\n```\n\n\n### Advanced Usage\n\nFor advanced users, it's possible to select more fine-grained control of the underlying algorithms by passing\ndictionaries into `DenseClus` class for either UMAP or HDBSCAN.\n\nFor example:\n```python\nfrom denseclus import DenseClus\nfrom denseclus.utils import make_dataframe\n\numap_params = {\n    \"categorical\": {\"n_neighbors\": 15, \"min_dist\": 0.1},\n    \"numerical\": {\"n_neighbors\": 20, \"min_dist\": 0.1},\n}\nhdbscan_params = {\"min_cluster_size\": 10}\n\ndf = make_dataframe()\n\nclf = DenseClus(umap_combine_method=\"union\"\n             , umap_params=umap_params\n             , hdbscan_params=hdbscan_params\n             , random_state=None) # this will run in parallel\n\nclf.fit(df)\n```\n\n\n## Examples\n\n### Notebooks\n\nA hands-on example with an overview of how to use is currently available in the form of a [Example Jupyter Notebook](/notebooks/01_DenseClusExampleNB.ipynb).\n\nShould you need to tune HDBSCAN, here is an optional approach: [Tuning with HDBSCAN Notebook](/notebooks/02_TuningwithHDBSCAN.ipynb)\n\nShould you need to validate UMAP emeddings, there is an approach to do so in the [Validation for UMAP Notebook](/notebooks/03_ValidationForUMAP.ipynb)\n\n### Blogs\n\n\n[AWS Blog: Introducing DenseClus, an open source clustering package for mixed-type data](https://aws.amazon.com/blogs/opensource/introducing-denseclus-an-open-source-clustering-package-for-mixed-type-data/)\n\n[TDS Blog: How To Tune HDBSCAN](https://towardsdatascience.com/tuning-with-hdbscan-149865ac2970)\n\n[TDS Blog: On the Validation of UMAP](https://towardsdatascience.com/on-the-validating-umap-embeddings-2c8907588175)\n\n\n\n## References\n\n```bibtex\n@article{mcinnes2018umap-software,\n  title={UMAP: Uniform Manifold Approximation and Projection},\n  author={McInnes, Leland and Healy, John and Saul, Nathaniel and Grossberger, Lukas},\n  journal={The Journal of Open Source Software},\n  volume={3},\n  number={29},\n  pages={861},\n  year={2018}\n}\n```\n\n```bibtex\n@article{mcinnes2017hdbscan,\n  title={hdbscan: Hierarchical density based clustering},\n  author={McInnes, Leland and Healy, John and Astels, Steve},\n  journal={The Journal of Open Source Software},\n  volume={2},\n  number={11},\n  pages={205},\n  year={2017}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fawslabs%2Famazon-denseclus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fawslabs%2Famazon-denseclus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fawslabs%2Famazon-denseclus/lists"}