{"id":27642262,"url":"https://github.com/m-py/anticlust","last_synced_at":"2025-04-23T23:53:59.551Z","repository":{"id":41833816,"uuid":"154145585","full_name":"m-Py/anticlust","owner":"m-Py","description":"Subset partitioning via anticlustering","archived":false,"fork":false,"pushed_at":"2025-04-22T09:44:46.000Z","size":5052,"stargazers_count":34,"open_issues_count":9,"forks_count":7,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-23T23:53:38.447Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/m-Py.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-22T13:06:35.000Z","updated_at":"2025-04-16T20:02:44.000Z","dependencies_parsed_at":"2023-10-04T23:52:12.784Z","dependency_job_id":"ba05d32f-bc04-4cac-a124-bafc4fa578d0","html_url":"https://github.com/m-Py/anticlust","commit_stats":{"total_commits":1298,"total_committers":4,"mean_commits":324.5,"dds":0.005392912172573205,"last_synced_commit":"b7cce2223c424d1cee3ceb82cbf437ed729713e9"},"previous_names":[],"tags_count":33,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/m-Py%2Fanticlust","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/m-Py%2Fanticlust/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/m-Py%2Fanticlust/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/m-Py%2Fanticlust/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/m-Py","download_url":"https://codeload.github.com/m-Py/anticlust/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250535070,"owners_count":21446506,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-23T23:53:58.817Z","updated_at":"2025-04-23T23:53:59.531Z","avatar_url":"https://github.com/m-Py.png","language":"R","readme":"anticlust \u003ca href='https://m-py.github.io/anticlust/'\u003e\u003cimg src='man/figures/anticlustStickerV1-0.svg' style=\"float:right; height:160px\" /\u003e\u003c/a\u003e\n==============================================================================================================================================\n\nAnticlustering partitions a pool of elements into clusters (or\n*anticlusters*) with the goal of achieving high between-cluster\nsimilarity and high within-cluster heterogeneity. This is accomplished\nby maximizing instead of minimizing a clustering objective function,\nsuch as the intra-cluster variance (used in k-means clustering) or the\nsum of pairwise distances within clusters. The package `anticlust`\nimplements anticlustering methods as described in Papenberg and Klau\n(2021;\n\u003ca href=\"https://doi.org/10.1037/met0000301\" class=\"uri\"\u003ehttps://doi.org/10.1037/met0000301\u003c/a\u003e),\nBrusco et al. (2020;\n\u003ca href=\"https://doi.org/10.1111/bmsp.12186\" class=\"uri\"\u003ehttps://doi.org/10.1111/bmsp.12186\u003c/a\u003e),\nPapenberg (2024;\n\u003ca href=\"https://doi.org/10.1111/bmsp.12315\" class=\"uri\"\u003ehttps://doi.org/10.1111/bmsp.12315\u003c/a\u003e),\nand Papenberg et al. (2025;\n\u003ca href=\"https://doi.org/10.1101/2025.03.03.641320\" class=\"uri\"\u003ehttps://doi.org/10.1101/2025.03.03.641320\u003c/a\u003e).\n\nInstallation\n------------\n\nThe stable release of `anticlust` is available from\n[CRAN](https://CRAN.R-project.org/package=anticlust) and can be\ninstalled via:\n\n    install.packages(\"anticlust\")\n\nA (potentially more recent) version of `anticlust` can also be installed\nvia [R Universe](https://m-py.r-universe.dev/anticlust):\n\n    install.packages('anticlust', repos = c('https://m-py.r-universe.dev', 'https://cloud.r-project.org'))\n\nor directly via Github:\n\n    library(\"remotes\") # if not available: install.packages(\"remotes\")\n    install_github(\"m-Py/anticlust\")\n\nCitation\n--------\n\nIf you use `anticlust` in your research, it would be courteous if you\ncite the following reference:\n\n-   Papenberg, M., \u0026 Klau, G. W. (2021). Using anticlustering to\n    partition data sets into equivalent parts. *Psychological Methods,\n    26*(2), 161–174.\n    \u003ca href=\"https://doi.org/10.1037/met0000301\" class=\"uri\"\u003ehttps://doi.org/10.1037/met0000301\u003c/a\u003e\n\nDepending on which `anticlust` functions you are using, including other\nreferences may also be fair. [Here you can find out in detail how to\ncite\n`anticlust`](https://github.com/m-Py/anticlust/blob/main/inst/HOW_TO_CITE_ANTICLUST.md).\n\nAnother great way of showing your appreciation of `anticlust` is to\nleave a star on this Github repository.\n\nHow do I learn about `anticlust`\n--------------------------------\n\nThis README contains some basic information on the `R` package\n`anticlust`. More information is available via the following sources:\n\n-   Up until now, we published 3 papers describing the theoretical\n    background of `anticlust`.\n    -   The initial presentation of the `anticlust` package is given in\n        Papenberg and Klau (2021)\n        (\u003ca href=\"https://doi.org/10.1111/bmsp.12315\" class=\"uri\"\u003ehttps://doi.org/10.1111/bmsp.12315\u003c/a\u003e;\n        [Preprint](https://doi.org/10.31234/osf.io/7jw6v)).\n    -   The k-plus anticlustering method is described in\n        Papenberg (2024)\n        (\u003ca href=\"https://doi.org/10.1037/met0000527\" class=\"uri\"\u003ehttps://doi.org/10.1037/met0000527\u003c/a\u003e;\n        [Preprint](https://doi.org/10.31234/osf.io/dhzrc)).\n    -   A new paper describes the must-link feature and provides\n        additional comparisons to alternative methods, focusing on\n        categorical variables (Papenberg et al., 2025;\n        \u003ca href=\"https://doi.org/10.1101/2025.03.03.641320\" class=\"uri\"\u003ehttps://doi.org/10.1101/2025.03.03.641320\u003c/a\u003e).\n    -   The R documentation of the main functions is actually quite rich\n        and up to date, so you should definitely check that out when\n        using the `anticlust` package. The most important background is\n        provided in `?anticlustering`.\n-   A [video](https://youtu.be/YGrhSmi1oA8) is available in German\n    language where I illustrate the main functionalities of the\n    `anticlustering()` function. My plan is to make a similar video in\n    English in the future.\n-   The [package website](https://m-py.github.io/anticlust/) contains\n    all documentation as a convenient website. At the current time, the\n    website also has four package vignettes, while additional vignettes\n    are planned.\n\nA quick start\n-------------\n\nIn this initial example, I use the main function `anticlustering()` to\ncreate five similar sets of plants using the classical iris data set:\n\nFirst, load the package via\n\n    library(\"anticlust\")\n\nCall the `anticlustering()` method:\n\n    anticlusters \u003c- anticlustering(\n      iris[, -5],\n      K = 5,\n      objective = \"kplus\",\n      method = \"local-maximum\",\n      repetitions = 10\n    )\n\nThe output is a vector that assigns a group (i.e, a number between 1 and\n`K`) to each input element:\n\n    anticlusters\n    #\u003e   [1] 1 2 4 5 3 4 2 3 2 2 1 5 1 2 4 1 2 3 2 5 1 5 4 5 1 1 3 4 5 5 5 4 5 2 1 1 3\n    #\u003e  [38] 4 3 3 4 2 3 5 2 5 3 4 3 1 2 2 5 1 2 3 3 4 4 1 5 1 2 3 3 1 2 4 4 4 4 1 3 4\n    #\u003e  [75] 2 4 5 2 5 2 3 3 1 5 4 1 5 3 2 1 2 5 3 4 1 4 1 2 4 5 2 2 3 1 4 1 3 4 4 5 3\n    #\u003e [112] 2 3 1 5 2 5 3 1 5 4 1 2 5 1 2 3 1 3 3 5 1 2 5 5 4 3 5 4 3 5 5 1 4 4 1 3 4\n    #\u003e [149] 2 2\n\nBy default, each group has the same number of elements (but the argument\n`K` can be adjusted to request different group sizes):\n\n    table(anticlusters)\n    #\u003e anticlusters\n    #\u003e  1  2  3  4  5 \n    #\u003e 30 30 30 30 30\n\nLast, let’s compare the features’ means and standard deviations across\ngroups to find out if the five groups are similar to each other:\n\n    knitr::kable(mean_sd_tab(iris[, -5], anticlusters), row.names = TRUE)\n\n\u003ctable\u003e\n\u003cthead\u003e\n\u003ctr class=\"header\"\u003e\n\u003cth style=\"text-align: left;\"\u003e\u003c/th\u003e\n\u003cth style=\"text-align: left;\"\u003eSepal.Length\u003c/th\u003e\n\u003cth style=\"text-align: left;\"\u003eSepal.Width\u003c/th\u003e\n\u003cth style=\"text-align: left;\"\u003ePetal.Length\u003c/th\u003e\n\u003cth style=\"text-align: left;\"\u003ePetal.Width\u003c/th\u003e\n\u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n\u003ctr class=\"odd\"\u003e\n\u003ctd style=\"text-align: left;\"\u003e1\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e5.84 (0.84)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e3.06 (0.44)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e3.76 (1.79)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e1.20 (0.77)\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr class=\"even\"\u003e\n\u003ctd style=\"text-align: left;\"\u003e2\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e5.84 (0.84)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e3.06 (0.45)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e3.76 (1.79)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e1.20 (0.77)\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr class=\"odd\"\u003e\n\u003ctd style=\"text-align: left;\"\u003e3\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e5.84 (0.84)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e3.06 (0.44)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e3.75 (1.79)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e1.20 (0.77)\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr class=\"even\"\u003e\n\u003ctd style=\"text-align: left;\"\u003e4\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e5.85 (0.84)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e3.05 (0.45)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e3.76 (1.79)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e1.21 (0.77)\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr class=\"odd\"\u003e\n\u003ctd style=\"text-align: left;\"\u003e5\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e5.84 (0.84)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e3.06 (0.44)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e3.76 (1.79)\u003c/td\u003e\n\u003ctd style=\"text-align: left;\"\u003e1.19 (0.78)\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\nAs illustrated in the example, we can use the function\n`anticlustering()` to create similar groups of plants. In this case\n“similar” primarily means that the means and standard deviations (in\nparentheses) of the variables are pretty much the same across the five\ngroups. The function `anticlustering()` takes as input a data table\ndescribing the elements that should be assigned to sets. In the data\ntable, each row represents an element (here a plant, but it can be\nanything; for example a person, word, or a photo). Each column is a\nnumeric variable describing one of the elements’ features. The number of\ngroups is specified through the argument `K`. The argument `objective`\nspecifies how between-group similarity is quantified; the argument\n`method` specifies the algorithm by which this measure is optimized. See\nthe documentation `?anticlustering` for more details.\n\nFive anticlustering objectives are natively supported in\n`anticlustering()`:\n\n-   the “diversity” objective, setting `objective = \"diversity\"`\n    (default)\n-   the “average-diversity”, setting `objective = \"average-diversity\"`,\n    which normalizes the diversity by cluster size\n-   the k-means objective (i.e., the “variance”) setting\n    `objective = \"variance\"`\n-   the “k-plus” objective, an extension of the k-means variance\n    criterion, setting `objective = \"kplus\"`\n-   the “dispersion” objective (the minimum distance between any two\n    elements within the same cluster), setting\n    `objective = \"dispersion\"`\n\nThe anticlustering objectives are described in detail in the\ndocumentation (`?anticlustering`, `?diversity_objective`,\n`?variance_objective`, `?kplus_anticlustering`, `?dispersion_objective`)\nand the references therein. It is also possible to optimize user-defined\nobjectives, which is also described in the documentation\n(`?anticlustering`).\n\nCategorical variables\n---------------------\n\nSometimes, it is required that sets are not only similar with regard to\nsome numeric variables, but we also want to ensure that each set\ncontains an equal number of elements of a certain category. Coming back\nto the initial iris data set, we may want to require that each set has a\nbalanced number of plants of the three iris species. To this end, we can\nuse the argument `categories` as follows:\n\n    anticlusters \u003c- anticlustering(\n      iris[, -5],\n      K = 3,\n      categories = iris$Species\n    )\n\n    ## The species are as balanced as possible across anticlusters:\n    table(anticlusters, iris$Species)\n    #\u003e             \n    #\u003e anticlusters setosa versicolor virginica\n    #\u003e            1     17         17        16\n    #\u003e            2     17         16        17\n    #\u003e            3     16         17        17\n\nMatching and clustering\n-----------------------\n\nAnticlustering creates sets of dissimilar elements; the heterogenity\nwithin anticlusters is maximized. This is the opposite of clustering\nproblems that strive for high within-cluster similarity and good\nseparation between clusters. The `anticlust` package also provides\nfunctions for “classical” clustering applications:\n`balanced_clustering()` creates sets of elements that are similar while\nensuring that clusters are of equal size. This is an example:\n\n    # Generate random data, cluster the data set and visualize results\n    N \u003c- 1400\n    lds \u003c- data.frame(var1 = rnorm(N), var2 = rnorm(N))\n    cl \u003c- balanced_clustering(lds, K = 7)\n    plot_clusters(lds, clusters = cl, show_axes = TRUE)\n\n\u003cimg src=\"man/figures/clustering-1.png\" style=\"display: block; margin: auto;\" /\u003e\n\nThe function `matching()` is very similar, but is usually used to find\nsmall groups of similar elements, e.g., triplets as in this example:\n\n    # Generate random data and find triplets of similar elements:\n    N \u003c- 120\n    lds \u003c- data.frame(var1 = rnorm(N), var2 = rnorm(N))\n    triplets \u003c- matching(lds, p = 3)\n    plot_clusters(\n      lds,\n      clusters = triplets,\n      within_connection = TRUE,\n      show_axes = TRUE\n    )\n\n\u003cimg src=\"man/figures/matching-1.png\" style=\"display: block; margin: auto;\" /\u003e\n\nQuestions and suggestions\n-------------------------\n\nIf you have any question on the `anticlust` package or find some bugs, I\nencourage you to open an [issue on the Github\nrepository](https://github.com/m-Py/anticlust/issues).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fm-py%2Fanticlust","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fm-py%2Fanticlust","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fm-py%2Fanticlust/lists"}