{"id":21432462,"url":"https://github.com/gagolews/clustering-data-v0","last_synced_at":"2025-09-15T03:48:17.408Z","repository":{"id":151189510,"uuid":"414444146","full_name":"gagolews/clustering-data-v0","owner":"gagolews","description":"Datasets for Clustering [DEPRECATED – A NEW VERSION IS AVAILABLE]","archived":false,"fork":false,"pushed_at":"2022-09-10T11:02:39.000Z","size":39942,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-06-07T09:04:22.218Z","etag":null,"topics":["clustering","data","dataset","machine-learning"],"latest_commit_sha":null,"homepage":"https://clustering-benchmarks.gagolewski.com/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gagolews.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-07T03:06:35.000Z","updated_at":"2022-09-10T11:04:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"0501ed5c-23e3-446f-b868-67a3f367c3dd","html_url":"https://github.com/gagolews/clustering-data-v0","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gagolews/clustering-data-v0","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gagolews%2Fclustering-data-v0","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gagolews%2Fclustering-data-v0/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gagolews%2Fclustering-data-v0/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gagolews%2Fclustering-data-v0/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gagolews","download_url":"https://codeload.github.com/gagolews/clustering-data-v0/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gagolews%2Fclustering-data-v0/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275202444,"owners_count":25423009,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-15T02:00:09.272Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","data","dataset","machine-learning"],"created_at":"2024-11-22T23:18:38.461Z","updated_at":"2025-09-15T03:48:17.397Z","avatar_url":"https://github.com/gagolews.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# A Benchmark Suite for Clustering Algorithms - Version 0 [DEPRECATED]\n\n## Important Note\n\nThis list has been superseded by the\n[Framework for Benchmarking Clustering Algorithms](https://clustering-benchmarks.gagolewski.com/)!\n\n## General Remarks\n\nIf used in publications (as a whole), please cite this dataset battery as: Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, *Information Sciences* **363**, 2016, pp. 8-23, doi:[10.1016/j.ins.2016.05.003](https://dx.doi.org/10.1016/j.ins.2016.05.003).\n\nIn each case, there is a data text file, storing an n * d matrix (n observations in a d dimensional space), and the corresponding labels file which consists of n labels being integers from the set 1,…,k, where k is the number of underlying clusters.\n\n## Datasets\n\n## MNIST Handwritten Digits (images)\n\nDownload files:\n\n* digits70k_pixels.data.gz (15 MB), digits70k_pixels.labels.gz (37 kB), n=70000, d=784, k=10,\n* digits2k_pixels.data.gz (440 kB), digits2k_pixels.labels.gz (1 kB), n=2000, d=784, k=10.\n\nThis data come from [The MNIST database](http://yann.lecun.com/exdb/mnist/)\nof handwritten digits by Yann LeCun, Corinna Cortes,\nand Christopher J.C. Burges. The dataset was originally released\nin form of binary files.\n\n`digits70k_pixels` consists of 70000 of 28x28 pixel images from the MNIST database, in order of appearance: 30000 SD-3 training patterns, 30000 SD-1 training patterns, 5000 SD-3 test patterns, and 5000 SD-1 test patterns. Moreover, `digits2k_pixels` gives first 2000 objects from `digits70k_pixels`.\n\nTo import the dataset in Python, execute:\n\n```python\nimport numpy as np\ndata = np.loadtxt(\"digits2k_pixels.data.gz\", ndmin=2)/255.0\ndata.shape = (data.shape[0], int(np.sqrt(data.shape[1])), int(np.sqrt(data.shape[1])))\nlabels = np.loadtxt(\"digits2k_pixels.labels.gz\", dtype='int')\n# display:\nimport matplotlib.pyplot as plt\ni = 122\nprint(labels[i])\nplt.imshow(data[i,:,:], cmap=plt.get_cmap(\"gray\"))\nplt.show()\n```\n\nTo do the same in R, write:\n\n```r\ndata \u003c- as.matrix(read.table(gzfile(\"digits2k_pixels.data.gz\")))/255\ndim(data) \u003c- c(nrow(data), 28, 28)\nlabels \u003c- scan(gzfile(\"digits2k_pixels.labels.gz\"), quiet=TRUE)\n# draw:\ni \u003c- 123\npar(mar=rep(0,4))\nimage(data[i,,], asp=1, col=gray.colors(256), ylim=c(1,0), axes=FALSE)\n```\n\nDistribution of labels:\n\n```\n##                     0    1    2    3    4    5    6    7    8    9\n## digits2k_pixels   191  220  198  191  214  180  200  224  172  210\n## digits70k_pixels 6903 7877 6990 7141 6824 6313 6876 7293 6825 6958\n```\n\n### MNIST Handwritten Digits (point sets)\n\nDownload files:\n\n\n* digits70k_points.data.gz (18 MB), digits70k_points.labels.gz (37 kB), n=70000, d=2, k=10,\n* digits2k_points.data.gz (555 kB), digits2k_points.labels.gz (1 kB), n=2000, d=2, k=10.\n\nBased on the aforementioned dataset, we can represent each digit as a set of\npoints in R². Brightness cutoff of 0.75 was used to generate the data.\nEach digit was shifted and scaled.\n\nWarning. The dataset consists of 3 columns. The 1st column indicates to\nwhich digit (one of 70000 or 2000) a point with x and y coordinates given\nby the 2nd and the 3rd column, respectively, belongs to. Therefore, the dataset must be preprocessed before use.\n\nTo do so in R, execute:\n\n```r\ndata \u003c- as.matrix(read.table(gzfile(\"digits2k_points.data.gz\")))\ndata \u003c- lapply(split(data[,-1], data[,1]), function(digit) matrix(digit, ncol=2))\n# now data is a list of 2-column matrices\nlabels \u003c- scan(gzfile(\"digits2k_points.labels.gz\"), quiet=TRUE)\n# draw:\ni \u003c- 123\npar(mar=rep(0,4))\nplot(data[[i]][,1], data[[i]][,2], asp=1, axes=FALSE, ann=FALSE, pch=16)\n```\n\nEquivalent Python code:\n\n```python\nimport numpy as np\ndata = np.loadtxt(\"digits2k_points.data.gz\", ndmin=2)\nlabels = np.loadtxt(\"digits2k_pixels.labels.gz\", dtype='int')\nbrk, = np.nonzero(np.diff(data[:,0]))\ndata = np.array_split(data[:,1:], brk+1, 0)\n# draw:\nimport matplotlib.pyplot as plt\ni = 122\nfig = plt.figure()\nfig.add_subplot(111, aspect='equal')\nplt.scatter(data[i][:,0], data[i][:,1])\nplt.show()\n```\n\n\nLabel distribution:\n\n\n```\n##                     0    1    2    3    4    5    6    7    8    9\n## digits2k_points   191  220  198  191  214  180  200  224  172  210\n## digits70k_points 6903 7877 6990 7141 6824 6313 6876 7293 6825 6958\n```\n\nIn this case, try playing with the Hausdorff (e.g., Euclidean-based)\ndistance, see `hausdorff.cpp` for a few auxiliary Rcpp routines.\n\n\n### Iris(es)\n\n\nDownload files:\n\n* iris.data.gz (681 B), iris.labels.gz (31 B), n=150, d=4, k=3,\n* iris5.data.gz (520 B), iris5.labels.gz (30 B), n=105, d=4, k=3.\n\n\nThis is the famous Fisher’s *iris* dataset, available in the R `datasets`\npackage. `iris5` is an imbalanced version in which we take\nonly 5 last observations from the 1st group (*iris setosa*).\n\n\nDistribution of labels:\n\n```\n##        1  2  3\n## iris  50 50 50\n## iris5  5 50 50\n```\n\n### SIPU Benchmark Data\n\nProf. P. Fränti and his colleagues form the University of\nEastern Finland prepared a list of example benchmarks, which is available\n[here](http://cs.joensuu.fi/sipu/datasets/). As some of the datasets\ncome with no labels, we make them available here in a concise format.\nWe chose only the datasets of sizes \u003c= 10000 and such that some of the\nhierarchical clustering algorithms had problems with correctly guessing the\nproper labels.\n\n\n#### S-sets\n\nDownload files:\n\n* s1.data.gz (34 kB), s1.labels.gz (83 B), n=5000, d=2, k=15\n* s2.data.gz (35 kB), s2.labels.gz (83 B), n=5000, d=2, k=15\n* s3.data.gz (35 kB), s3.labels.gz (83 B), n=5000, d=2, k=15\n* s4.data.gz (35 kB), s4.labels.gz (83 B), n=5000, d=2, k=15\n\n\nSource: P. Fränti, O. Virmajoki, Iterative shrinking method for clustering problems, *Pattern Recognition*, **39**(5), 2006, pp. 761-765.\n\nDistribution of labels:\n\n```\n##      1   2   3   4   5   6   7   8   9  10  11  12  13  14  15\n## s1 300 316 314 318 325 326 334 338 341 342 347 349 350 350 350\n## s2 300 317 315 320 321 329 334 333 340 345 346 350 350 350 350\n## s3 300 321 316 323 322 331 333 337 334 337 346 350 350 350 350\n## s4 300 316 327 320 323 324 327 336 337 344 347 350 349 350 350\n```\n\n#### A-sets\n\nDownload files:\n\n* a1.data.gz (17 kB), a1.labels.gz (82 B), n=3000, d=2, k=20\n* a2.data.gz (29 kB), a2.labels.gz (112 B), n=5250, d=2, k=35\n* a3.data.gz (41 kB), a3.labels.gz (144 B), n=7500, d=2, k=50\n\n\nSource: I. Kärkkäinen, P. Fränti, *Dynamic local search algorithm for the clustering problem*, Research Report A-2002-6.\n\nDistribution of labels: Classes are fully balanced.\n\n\n#### G2-sets\n\nDownload files:\n\n* g2-2-100.data.gz (7 kB), g2-2-100.labels.gz (43 B), n=2048, d=2, k=2\n* g2-16-100.data.gz (52 kB), g2-16-100.labels.gz (43 B), n=2048, d=16, k=2\n* g2-64-100.data.gz (200 kB), g2-64-100.labels.gz (43 B), n=2048, d=64, k=2\n\nGaussian clusters of varying dimensions, high variance.\n\nDistribution of labels: Classes are fully balanced.\n\n\n#### Other\n\nDownload files:\n\n* unbalance.data.gz (37 kB), unbalance.labels.gz (65 B), n=6500, d=2, k=8\n* Aggregation.data.gz (3 kB), Aggregation.labels.gz (48 B), n=788, d=2, k=7\n* Compound.data.gz (1 kB), Compound.labels.gz (43 B), n=399, d=2, k=6\n* pathbased.data.gz (1 kB), pathbased.labels.gz (36 B), n=300, d=2, k=3\n* spiral.data.gz (1 kB), spiral.labels.gz (31 B), n=312, d=2, k=3\n* D31.data.gz (20 kB), D31.labels.gz (97 B), n=3100, d=2, k=31\n* R15.data.gz (3 kB), R15.labels.gz (63 B), n=600, d=2, k=15\n* flame.data.gz (878 B), flame.labels.gz (35 B), n=240, d=2, k=2\n* jain.data.gz (1 kB), jain.labels.gz (31 B), n=373, d=2, k=2\n\n\nSources:\n\n* A. Gionis, H. Mannila, P. Tsaparas, Clustering aggregation, *ACM\n    Transactions on Knowledge Discovery from Data (TKDD)*, 2007, pp. 1-30.\n* C.T. Zahn, Graph-theoretical methods for detecting and describing gestalt\n    clusters, *IEEE Transactions on Computers* **C-20**(1), 1971, pp. 68-86.\n* H. Chang, D.Y. Yeung, Robust path-based spectral clustering, *Pattern\n    Recognition* **41**(1), 2008, pp. 191-203.\n* C.J. Veenman, M.J.T. Reinders, E. Backer, A maximum variance cluster\n    algorithm, *IEEE Transactions on Pattern Analysis and Machine\n    Intelligence* **24**(9), 2002, pp. 1273-1280.\n* A. Jain, M. Law, Data clustering: A user’s dilemma, *Lecture Notes\n    in Computer Science* **3776**, 2005, pp. 1-10.\n* L. Fu, E. Medico, FLAME, a novel fuzzy clustering method for the analysis\n    of DNA microarray data, *BMC bioinformatics* **8**, 2007, p. 3.\n\nLabel distributions:\n\n```\n##                     1    2    3   4   5   6   7   8\n## unbalance        2000 2000 2000 100 100 100 100 100\n##\n##                   1   2   3   4  5   6  7\n## Aggregation      45 170 102 273 34 130 34\n##\n##                   1  2  3  4   5  6\n## Compound         50 92 38 45 158 16\n##\n##                    1  2  3\n## pathbased        110 97 93\n##\n##                    1   2   3\n## spiral           101 105 106\n##\n## D31              balanced\n##\n## R15              balanced\n##\n##                   1   2\n## flame            87 153\n##\n##                    1  2\n## jain             276 97\n```\n\n\n### Character Strings\n\n#### ACTG Sequences\n\nDownload files:\n\n\n* actg1.data.gz (77 kB), actg1.labels.gz (2 kB), n=2500, mean d=99.9, k=20\n* actg2.data.gz (149 kB), actg2.labels.gz (1 kB), n=2500, mean d=199.9, k=5\n* actg3.data.gz (187 kB), actg3.labels.gz (1 kB), n=2500, mean d=250.2, k=10\n\n\nThe datasets consist of character strings (of varying lengths) over the\n{a,c,t,g} alphabet. First, *k* random strings (of identical lengths)\nwere generated for the purpose of being cluster centres. Each string in\nthe dataset was created by selecting a random cluster centre and then\nperforming many Levenshtein edit operations (character insertions,\ndeletions, substitutions) at randomly chosen positions.\n\nPreferably for use with the Levenshtein distance.\n\n```r\nlibrary(\"stringi\")\ndata \u003c- readLines(gzfile(\"actg1.data.gz\"))\nlabels \u003c- scan(gzfile(\"actg1.labels.gz\"), quiet=TRUE)\n# five observations in the 1st group:\ncat(data[labels==1][1:5], sep='\\n')\n## ctttctgtgctcgcgagctaaacgtgtgtaggcccttgtactacaaccaactgctagaatagtgacgcccctttgcctggcgcgccgctacttttagcgggcatgacg\n## ctttgatgtgctgaataatctcagggctgtgtactacatcaagtccaccactactagttggcgaccgctttcctagagacagcgcaagcattcacatacg\n## ccaccttatgctgcatgaacgggcggattggatctacaaccgcaattgctagaattcgcctcctttggacaattacgtgctacttaaagcgcctcg\n## cacttcatgaacggataccgatgtggggcatttgtactactccgaacactagcgattcgaccgcgttttctggacaacgccaagactgttttaacgtcaga\n## cctagtgcacgtgacacactggtgtggctgggtaacgtcccacaacacctgctagaatcgacccgcacttaggaacagcaagtactgttaagcgcattct\n```\n\n\nLabel distributions:\n\n```\n##                    1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20\n## actg1            137 121 133 132 123 124 131 111 118 120 122 139 142 123 124 116 122 124 124 114\n##\n##                   1   2   3   4   5\n## actg2            50 246 571 783 850\n##\n##                   1   2   3   4   5   6   7   8  9 10\n## actg3            50 181 390 487 501 384 267 132 65 43\n```\n\n\n#### Binary Sequences\n\n\nDownload files:\n\n* binstr1.data.gz (44 kB), binstr1.labels.gz (2 kB), n=2500, d=100, k=25\n* binstr2.data.gz (85 kB), binstr2.labels.gz (1 kB), n=2500, d=200, k=5\n* binstr3.data.gz (105 kB), binstr3.labels.gz (1 kB), n=2500, d=250, k=10\n\nDatasets consist of character strings (each of the same length *d*) over the\n{0,1} alphabet. First, *k* random strings were generated for the purpose\nof being cluster centres. Each string in the dataset was created by selecting\na random cluster centre and then modifying digits at randomly chosen positions.\n\nPreferably for use with the Hamming distance.\n\n```r\nlibrary(\"stringi\")\ndata \u003c- readLines(gzfile(\"binstr1.data.gz\"))\nlabels \u003c- scan(gzfile(\"binstr1.labels.gz\"), quiet=TRUE)\n# 1st cluster median (w.r.t. the Hamming distance)\nmode \u003c- function(x) { t \u003c- table(x); names(t)[which.max(t)] }\ncat(stri_flatten(apply(do.call(rbind, stri_split_boundaries(data[labels==1],\ntype=\"character\")), 2, mode)))\n## 0101101110101101000111111111001111001000000000000100101001101000101110111000010001010011100101001001\n# five observations in the 1st group:\ncat(\"\\n\", data[labels==1][1:5], sep='\\n')\n## 0101001000111001001011111110001111101100100000101100101000100000111110111011000001111010000101101011\n## 0011101010111001000011100001101111010000000111001100100001111001110110101000000000010001110001001100\n## 0010100100100101000111001110011111001000110001000110011001101011100110111100010001110111100101001001\n## 0101001001000001000011001001001111000011000010010101111100101110101110111010000001000011000101001001\n## 1101001001001100010111011111011001111000001100000100001001101000000010111000110001010011110110000001\n```\n\nLabel distributions:\n\n```\n##                   1   2   3   4   5  6   7  8   9  10 11 12  13  14 15  16  17  18 19 20 21  22 23  24  25\n## binstr1          97 112 112 101 104 91 106 88 105 104 86 95 113 107 76 101 110 105 98 90 76 108 91 111 113\n##\n##                   1   2   3   4   5\n## binstr2          51 267 540 756 886\n##\n##                   1  2   3   4   5   6   7   8   9  10\n## binstr3          12 90 220 332 467 446 381 277 175 100\n```\n\n\n## Other\n\nFor more benchmark data, see:\n\n* [A Framework for Benchmarking Clustering Algorithms](https://clustering-benchmarks.gagolewski.com/)\n\n* [A Benchmark Suite for Clustering Algorithms - Version 1](https://github.com/gagolews/clustering-data-v1)\n\n* [SIPU datasets](http://cs.joensuu.fi/sipu/datasets/) – by P. Fränti (et al.)\n\n* [The Fundamental Clustering Problems Suite (FCPS)](https://www.uni-marburg.de/fb12/arbeitsgruppen/datenbionik/data?language_sync=1) – by A. Ultsch\n\n* [CLUTO Datasets](http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download) by G. Karypis (et al.)\n\n* Graves D., Pedrycz W., Kernel-based fuzzy clustering and fuzzy clustering:\n    A comparative experimental study, *Fuzzy Sets and Systems* **161**(4), 2010, pp. 522-543.\n    \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgagolews%2Fclustering-data-v0","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgagolews%2Fclustering-data-v0","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgagolews%2Fclustering-data-v0/lists"}