https://github.com/gagolews/clustering-data-v1
A framework for benchmarking clustering algorithms – Benchmark suite, version 1
https://github.com/gagolews/clustering-data-v1
benchmark benchmark-datasets clustering data dataset datasets machine-learning
Last synced: 30 days ago
JSON representation
A framework for benchmarking clustering algorithms – Benchmark suite, version 1
- Host: GitHub
- URL: https://github.com/gagolews/clustering-data-v1
- Owner: gagolews
- License: other
- Created: 2022-08-17T06:00:06.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2023-10-22T03:24:06.000Z (over 1 year ago)
- Last Synced: 2025-04-13T01:46:23.373Z (about 1 month ago)
- Topics: benchmark, benchmark-datasets, clustering, data, dataset, datasets, machine-learning
- Language: Jupyter Notebook
- Homepage: https://clustering-benchmarks.gagolewski.com/
- Size: 173 MB
- Stars: 8
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# [A Framework for Benchmarking Clustering Algorithms](https://clustering-benchmarks.gagolewski.com)
## Benchmark Suite Version 1 (with updates)The aim of this project is to **aggregate, polish, and standardise the
existing clustering benchmark batteries** referred to across the machine
learning and data mining literature, and to introduce **new datasets**
of different dimensionalities, sizes, and cluster types.This repository is part of the
[**Framework for Benchmarking Clustering Algorithms**](https://clustering-benchmarks.gagolewski.com).
It hosts the datasets from version 1 of the benchmark suite.Refer to
for a detailed description, file format specification,
example Python/R/MATLAB code, datasets explorer,
and literature references.**Editor/Maintainer**:
[Marek Gagolewski](https://www.gagolewski.com).**How to Cite**: Please cite the following paper which describes
the overall benchmarking methodology:> Gagolewski M., A framework for benchmarking clustering algorithms,
*SoftwareX* **20**, 2022, 101270, ,
DOI: [10.1016/j.softx.2022.101270](https://doi.org/10.1016/j.softx.2022.101270).Additionally, mention the exact version of this benchmark suite
(see *Changelog* below for version information):> Gagolewski M. et al. (Eds.), *A benchmark suite for clustering algorithms:
Version 1.1.0*, 2022,
,
DOI: [10.5281/zenodo.7088171](https://doi.org/10.5281/zenodo.7088171).The datasets are provided **solely for research purposes**,
unless stated otherwise. Please cite the literature references mentioned
in the corresponding dataset description files in any publications
that make use of these.## Changelog
The datasets and the reference labels included in this suite
are versioned. This ensures reproducibility.See for
downloadable snapshots.### 1.1.0 (2022-09-17)
- Each battery is now equipped with a README.txt file.
- New label vectors:
wut/x2.labels1,
wut/x3.labels1.- Prettified (slightly) label vectors:
graves/fuzzyx.labels[1-4],
graves/parabolic.labels1.- Deleted now redundant label vectors:
graves/fuzzyx.labels5.- The historical snapshot of this release is available at
DOI: [10.5281/zenodo.7088171](https://doi.org/10.5281/zenodo.7088171).### 1.0.1 (2022-09-10)
- Updated dataset description files, e.g., fixed broken links.
- The code and the data repositories were separated; see
and
.- The project's homepage has been created. It is available at
.- The historical snapshot of this release is available at
DOI: [10.5281/zenodo.7066690](https://doi.org/10.5281/zenodo.7066690).### 1.0.0 (2020-05-08)
- Datasets in the 1st (v1.0.0) version of the benchmark
battery are now frozen.- The historical snapshot of this release is available at
DOI: [10.5281/zenodo.3815066](https://doi.org/10.5281/zenodo.3815066).### 0.0.0 (2015-12-29)
- Version 0 of the benchmark suite consists of the datasets
studied in: Gagolewski M., Bartoszuk M., Cena A.,
Genie: A new, fast, and outlier-resistant hierarchical
clustering algorithm, *Information Sciences* **363**, 2016, pp. 8–23,
DOI: [10.1016/j.ins.2016.05.003](https://doi.org/10.1016/j.ins.2016.05.003).
The datasets have been archived at
.## See Also
gives a detailed description
of the whole framework for benchmarking clustering algorithms.It also mentions where to find raw and aggregated results generated
by many clustering methods when run on the datasets from this repository.