{"id":27424377,"url":"https://github.com/xhan97/streakhc","last_synced_at":"2025-04-14T11:49:24.543Z","repository":{"id":49524027,"uuid":"251657265","full_name":"xhan97/StreaKHC","owner":"xhan97","description":"A novel incremental hierarchical clustering algorithm (KDD 22)","archived":false,"fork":false,"pushed_at":"2024-12-18T06:06:42.000Z","size":113550,"stargazers_count":4,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-12-18T07:22:47.659Z","etag":null,"topics":["heirarchical-clustering","kernel-methods","stream-clustering"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xhan97.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-03-31T16:05:21.000Z","updated_at":"2024-06-20T08:59:53.000Z","dependencies_parsed_at":"2023-10-27T09:26:33.950Z","dependency_job_id":null,"html_url":"https://github.com/xhan97/StreaKHC","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xhan97%2FStreaKHC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xhan97%2FStreaKHC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xhan97%2FStreaKHC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xhan97%2FStreaKHC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xhan97","download_url":"https://codeload.github.com/xhan97/StreaKHC/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248877515,"owners_count":21176235,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["heirarchical-clustering","kernel-methods","stream-clustering"],"created_at":"2025-04-14T11:49:23.984Z","updated_at":"2025-04-14T11:49:24.496Z","avatar_url":"https://github.com/xhan97.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# StreaKHC #\n\n**StreaKHC**  is a novel incremental hierarchical clustering algorithm for efficiently mining massive streaming data. It uses a scalable point-set kernel to measurethe similarity between an existing cluster in the cluster tree and a new point in a stream. It also has an efficient hierarchical structure updating mechanism to continuously maintain a high-quality cluster tree in real-time. Technical details and analysis of the algorithm can be found in paper.\n\n## Setup ##\n\nDownload and Install Anaconda's Python3\n\n```\nhttps://docs.continuum.io/anaconda/install\n```\n\nInstall numba\n\n```\nconda install numba\n```\n\nSet environment variables:\n\n```\nsource bin/setup.sh\n```\n\nIf want to visulize the build tree, install Graphviz\n\n```\nsudo apt install graphviz\n```\n\n## Run test ##\n\nRun test on data set:\n```\n ./bin/run_grid_evaluation.sh\n```\n\nThe evaluation result is shown in /exp_out/ default. For each of the randomly shuffled data of a specified data set, the dengrogram purity result and figure of built tree is shown in score.tsv and tree.png, respectively.\n\n## Notes ##\n\n  - If do not need to visualize the generated tree, you can comment out the corresponding code in the /bin/run_evaluation.sh.\n  - Perl is used to shuffle the data.You'll need perl installed on your system to run experiment shell scripts.  If you can't run perl, you can change this to another shuffling method of your choice.\n  - The scripts in this project use environment variables set in the setup script. You'll need to source this set up script in each shell session running this project.\n  - Most of the program running time is used to calculate dendrogram purity.\n\n## Citing ##\nIf you have used this codebase in a scientific publication and wish to\ncite it, please use the following publication (Bibtex format):\n\n```bibtex\n@inproceedings{HZTZL22Streaming,\n     author = {Han, Xin and Zhu, Ye and Ting, Kai Ming and Zhan, De-Chuan and Li, Gang},\n     title = {Streaming Hierarchical Clustering Based on Point-Set Kernel},\n     year = {2022},\n     isbn = {9781450393850},\n     publisher = {Association for Computing Machinery},\n     address = {New York, NY, USA},\n     url = {https://doi.org/10.1145/3534678.3539323},\n     doi = {10.1145/3534678.3539323},\n     pages = {525–533},\n     numpages = {9},\n     keywords = {streaming data, hierarchical clustering, isolation kernel},\n     location = {Washington DC, USA},\n     series = {KDD '22}\n}\n ```\n\n## License ##\n\nApache License, Version 2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxhan97%2Fstreakhc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxhan97%2Fstreakhc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxhan97%2Fstreakhc/lists"}