{"id":16274150,"url":"https://github.com/trinker/kmeanstext","last_synced_at":"2025-06-23T03:35:39.537Z","repository":{"id":146663694,"uuid":"54314551","full_name":"trinker/kmeanstext","owner":"trinker","description":null,"archived":false,"fork":false,"pushed_at":"2016-03-27T02:58:18.000Z","size":277,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-08T16:33:58.647Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/trinker.png","metadata":{"files":{"readme":"README.Rmd","changelog":"NEWS","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-03-20T11:43:23.000Z","updated_at":"2016-03-21T00:37:18.000Z","dependencies_parsed_at":null,"dependency_job_id":"feea4e73-ff5c-44bd-b5f7-ad71fadb16eb","html_url":"https://github.com/trinker/kmeanstext","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/trinker/kmeanstext","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trinker%2Fkmeanstext","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trinker%2Fkmeanstext/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trinker%2Fkmeanstext/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trinker%2Fkmeanstext/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/trinker","download_url":"https://codeload.github.com/trinker/kmeanstext/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trinker%2Fkmeanstext/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261405520,"owners_count":23153562,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-10T18:27:21.760Z","updated_at":"2025-06-23T03:35:34.525Z","avatar_url":"https://github.com/trinker.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\ntitle: \"kmeanstext\"\ndate: \"`r format(Sys.time(), '%d %B, %Y')`\"\noutput:\n  md_document:\n    toc: true      \n---\n\n```{r, echo=FALSE}\nlibrary(knitr)\ndesc \u003c- suppressWarnings(readLines(\"DESCRIPTION\"))\nregex \u003c- \"(^Version:\\\\s+)(\\\\d+\\\\.\\\\d+\\\\.\\\\d+)\"\nloc \u003c- grep(regex, desc)\nver \u003c- gsub(regex, \"\\\\2\", desc[loc])\nverbadge \u003c- sprintf('\u003ca href=\"https://img.shields.io/badge/Version-%s-orange.svg\"\u003e\u003cimg src=\"https://img.shields.io/badge/Version-%s-orange.svg\" alt=\"Version\"/\u003e\u003c/a\u003e\u003c/p\u003e', ver, ver)\n````\n\n\n```{r, echo=FALSE}\nknit_hooks$set(htmlcap = function(before, options, envir) {\n  if(!before) {\n    paste('\u003cp class=\"caption\"\u003e\u003cb\u003e\u003cem\u003e',options$htmlcap,\"\u003c/em\u003e\u003c/b\u003e\u003c/p\u003e\",sep=\"\")\n    }\n    })\nknitr::opts_knit$set(self.contained = TRUE, cache = FALSE)\nknitr::opts_chunk$set(fig.path = \"inst/figure/\")\n```\n\n[![Project Status: Active - The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)\n[![Build Status](https://travis-ci.org/trinker/kmeanstext.svg?branch=master)](https://travis-ci.org/trinker/kmeanstext)\n[![Coverage Status](https://coveralls.io/repos/trinker/kmeanstext/badge.svg?branch=master)](https://coveralls.io/r/trinker/kmeanstext?branch=master)\n`r verbadge`\n\n\u003cimg src=\"inst/kmeanstext_logo/r_kmeanstext.png\" width=\"150\" alt=\"readability Logo\"\u003e\n\n**kmeanstext** is a collection of optimized tools for clustering text data via kmeans clustering.  There are many great R [clustering tools](https://cran.r-project.org/web/views/Cluster.html) to locate topics within documents.  Kmeans clustering is a popular method for topic extraction.  This package builds upon my [hclustext](https://github.com/trinker/hclustext) package to extend the **hclustext** package framework to kmeans.  One major difference between the two techniques is that with hierchical clustering the number of topics is specified after thte model has been fitted, whereas kmeans requires the k topics to be specified before the model is fit.  Additionally, kmeans uses a random start seed, the results may vary each time a model is fit.  Additionally, Euclidian distance is typically used in a kmeans algorithm, where as any distance metric may be passed to a hierachical clustering fit.\n\nThe general idea is that we turn the documents into a matrix of words.  After this we weight the terms by importance using [tf-idf](http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html).  This helps the more salient words to rise to the top.  The user then selects k clusters (topics) and runs the model.  The model iteratively shuffels centers and assigns documents to the clusters based on minimal dsitance of a document to  center.  Each run uses the recalculated mean centroid of the prior clusters as a starting point for the current iteration's centroids.  Once the centroids have stabalized the model has converged at k topics.  The user then may extract the clusters from the fit, providing a grouping for documents with similar important text features.  \n\n\n# Functions\n\nThe main functions, task category, \u0026 descriptions are summarized in the table below:\n\n| Function               |  Category      | Description                                                              |\n|------------------------|----------------|--------------------------------------------------------------------------|\n| `data_store`           | data structure | **kmeanstext**'s data structure (list of dtm + text)                     |\n| `kmeans_cluster`       | cluster fit    | Fits a kmeans cluster model                                              |\n| `assign_cluster`       | assignment     | Extract clusters for document/text element                               |\n| `get_text`             | extraction     | Get text from various **kmeanstext** objects                             |\n| `get_dtm`              | extraction     | Get `tm::DocumentTermMatrix` from various **kmeanstext** objects         |\n| `get_removed`          | extraction     | Get removed text elements from various **kmeanstext** objects            |\n| `get_terms`            | extraction     | Get clustered weighted important terms from an **assign_cluster** object |\n| `get_documents`        | extraction     | Get clustered documents from an **assign_cluster** object                |\n\n\n# Installation\n\nTo download the development version of **kmeanstext**:\n\nDownload the [zip ball](https://github.com/trinker/kmeanstext/zipball/master) or [tar ball](https://github.com/trinker/kmeanstext/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version:\n\n```r\nif (!require(\"pacman\")) install.packages(\"pacman\")\npacman::p_load_gh(\n    \"trinker/textshape\", \n    \"trinker/gofastr\", \n    \"trinker/termco\",    \n    \"trinker/hclusttext\",    \n    \"trinker/kmeanstext\"\n)\n```\n\n# Contact\n\nYou are welcome to:    \n* submit suggestions and bug-reports at: \u003chttps://github.com/trinker/kmeanstext/issues\u003e    \n* send a pull request on: \u003chttps://github.com/trinker/kmeanstext/\u003e      \n* compose a friendly e-mail to: \u003ctyler.rinker@gmail.com\u003e     \n\n# Demonstration\n\n## Load Packages and Data\n\n```{r}\nif (!require(\"pacman\")) install.packages(\"pacman\")\npacman::p_load(kmeanstext, dplyr, textshape, ggplot2, tidyr)\n\ndata(presidential_debates_2012)\n```\n\n\n## Data Structure\n\nThe data structure for **kmeanstext** is very specific.  The `data_storage` produces a `DocumentTermMatrix` which maps to the original text.  The empty/removed documents are tracked within this data structure, making subsequent calls to cluster the original documents and produce weighted important terms more robust.  Making the `data_storage` object is the first step to analysis.\n\nWe can give the `DocumentTermMatrix` rownames via the `doc.names` argument.  If these names are not unique they will be combined into a single document as seen below.  Also, if you want to do stemming, minimum character length, stopword removal or such this is when/where it's done.\n\n\n```{r}\nds \u003c- with(\n    presidential_debates_2012,\n    data_store(dialogue, doc.names = paste(person, time, sep = \"_\"))\n)\n\nds\n```\n\n\n## Fit the Model: kmeans Cluster\n\nNext we can fit a kmeans cluster model to the `data_store` object via `kmeans_cluster`.  Note that, unlike **hclustext**'s `hierarchical_cluster`, we must provide the `k` (number of topics) to the model.  \n\nBy default `kmeans_cluster` uses an approximation of `k` based on Can \u0026 Ozkarahan's (1990) formula $(m * n)/t$ where $m$ and $n$ are the dimensions of the matrix and $t$ is the length of the non-zero elements in matrix $A$. \n\n- Can, F., Ozkarahan, E. A. (1990). Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. *ACM Transactions on Database Systems 15* (4): 483. doi:10.1145/99935.99938\n\nThere are other means of determining `k` as well.  See Ben Marwic's [StackOverflow post](http://stackoverflow.com/a/15376462/1000343) for a detailed exploration.\n\n```{r}\nset.seed(100)\nmyfit \u003c- kmeans_cluster(ds, k=6)\n\nstr(myfit)\n```\n\n\n## Assigning Clusters\n\nThe `assign_cluster` function allows the user to extract the clusters and the documents they are assigned to.  Unlike **hclustext**'s `assign_cluster`, the **kmeanstext** version as no `k` argument and is merely extracting the cluster assignments from the model.  \n\n\n```{r}\nca \u003c- assign_cluster(myfit)\n\nca\n```\n\n\n### Cluster Loading\n\nTo check the number of documents loading on a cluster there is a `summary` method for `assign_cluster` which provides a descending data frame of clusters and counts.  Additionally, a horizontal bar plot shows the document loadings on each cluster.\n\n```{r}\nsummary(ca)\n```\n\n\n### Cluster Text \n\nThe user can grab the texts from the original documents grouped by cluster using the `get_text` function.  Here I demo a 40 character substring of the document texts.\n\n```{r}\nget_text(ca) %\u003e%\n    lapply(substring, 1, 40)\n```\n\n### Cluster Frequent Terms\n\nAs with many topic clustering techniques, it is useful to get the to salient terms from the model.  The `get_terms` function uses the `centers` from the `kmeans` output.   Notice the absence of clusters 1 \u0026 2.  This is a result of lower weights (more diverse term use) across these clusters.  \n\n```{r}\nget_terms(ca, .008)\n```\n\n\nThe `min.weight` hyperparmeter sets the lower bound on the `centers` value to accept.  If you don't get any terms you may want to lower this.  Likewise, this parameter (and lowering `nrow`) can be raised to eliminate noise.\n\n\n```{r}\nget_terms(ca, .002, nrow=10) \n```\n\n\n### Clusters, Terms, and Docs Plot\n\nHere I plot the clusters, terms, and documents (grouping variables) together as a combined heatmap.  This can be useful for viewing \u0026 comparing what documents are clustering together in the context of the cluster's salient terms. This example also shows how to use the cluster terms as a lookup key to extract probable salient terms for a given document.\n\n```{r, fig.width=11}\nkey \u003c- data_frame(\n    cluster = 1:6,\n    labs = get_terms(ca, .002) %\u003e%\n        bind_list(\"cluster\") %\u003e%\n        select(-weight) %\u003e%\n        group_by(cluster) %\u003e%\n        slice(1:10) %\u003e%\n        na.omit() %\u003e%\n        group_by(cluster) %\u003e%\n        summarize(term=paste(term, collapse=\", \")) %\u003e%\n        apply(., 1, paste, collapse=\": \") \n)\n\nca %\u003e%\n    bind_vector(\"id\", \"cluster\") %\u003e%\n    separate(id, c(\"person\", \"time\"), sep=\"_\") %\u003e%\n    tbl_df() %\u003e%\n    left_join(key) %\u003e%\n    mutate(n = 1) %\u003e%\n    mutate(labs = factor(labs, levels=rev(key[[\"labs\"]]))) %\u003e%\n    unite(\"time_person\", time, person, sep=\"\\n\") %\u003e%\n    select(-cluster) %\u003e%\n    complete(time_person, labs) %\u003e%  \n    mutate(n = factor(ifelse(is.na(n), FALSE, TRUE))) %\u003e%\n    ggplot(aes(time_person, labs, fill = n)) +\n        geom_tile() +\n        scale_fill_manual(values=c(\"grey90\", \"red\"), guide=FALSE) +\n        labs(x=NULL, y=NULL) \n```        \n\n\n### Cluster Documents\n\nThe `get_documents` function grabs the documents associated with a particular cluster.  This is most useful in cases where the number of documents is small and they have been given names.\n\n```{r}\nget_documents(ca)\n```\n\n\n## Putting it Together\n\nI like working in a chain.  In the setup below we work within a **magrittr** pipeline to fit a model, select clusters, and examine the results.  In this example I do not condense the 2012 Presidential Debates data by speaker and time, rather leaving every sentence as a separate document.  On my machine the initial `data_store` and model fit take ~35 seconds to run.  Note that I do restrict the number of clusters (for texts and terms) to a random 5 clusters for the sake of space.\n\n\n```{r, fig.height = 10}\n.tic \u003c- Sys.time()\n\nmyfit2 \u003c- presidential_debates_2012 %\u003e%\n    with(data_store(dialogue)) %\u003e%\n    kmeans_cluster(k=100)\n\ndifftime(Sys.time(), .tic)\n\n## View Document Loadings\nca2 \u003c- assign_cluster(myfit2)\nsummary(ca2) %\u003e% \n    head(12)\n\n## Split Text into Clusters\nset.seed(3); inds \u003c- sort(sample.int(100, 5))\n\nget_text(ca2)[inds] %\u003e%\n    lapply(head, 10)\n\n## Get Associated Terms\nget_terms(ca2, term.cutoff = .07)[inds]\n```\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrinker%2Fkmeanstext","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftrinker%2Fkmeanstext","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrinker%2Fkmeanstext/lists"}