Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/juba/rainette
R implementation of the Reinert text clustering method
https://github.com/juba/rainette
r text-analysis text-classification
Last synced: 3 months ago
JSON representation
R implementation of the Reinert text clustering method
- Host: GitHub
- URL: https://github.com/juba/rainette
- Owner: juba
- Created: 2018-10-18T09:02:45.000Z (about 6 years ago)
- Default Branch: main
- Last Pushed: 2023-07-21T10:56:50.000Z (over 1 year ago)
- Last Synced: 2023-08-27T23:09:33.798Z (about 1 year ago)
- Topics: r, text-analysis, text-classification
- Language: R
- Homepage: https://juba.github.io/rainette/
- Size: 12.4 MB
- Stars: 48
- Watchers: 5
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Rainette
[![CRAN status](https://www.r-pkg.org/badges/version-ago/rainette)](https://cran.r-project.org/package=rainette)
[![rainette status badge](https://juba.r-universe.dev/badges/rainette)](https://juba.r-universe.dev)
[![DOI](https://zenodo.org/badge/153594739.svg)](https://zenodo.org/badge/latestdoi/153594739)
![CRAN Downloads](https://cranlogs.r-pkg.org/badges/last-month/rainette)
[![R build status](https://github.com/juba/rainette/workflows/R-CMD-check/badge.svg)](https://github.com/juba/rainette/actions?query=workflow%3AR-CMD-check)Rainette is an R package which implements a variant of the Reinert textual clustering method. This method is available in other softwares such as [Iramuteq](http://www.iramuteq.org/) (free software) or [Alceste](https://www.image-zafar.com/Logiciel.html) (commercial, closed source).
## Features
- Simple and double clustering algorithms
- Plot functions and shiny interfaces to visualise and explore clustering results
- Utility functions to split a corpus into segments or import a corpus in Iramuteq format## Installation
The package is installable from CRAN.
```r
install_packages("rainette")
```The development version is installable from [R-universe](https://r-universe.dev).
```r
install.packages("rainette", repos = "https://juba.r-universe.dev")
```## Usage
Let's start with an example corpus provided by the excellent [quanteda](https://quanteda.io) package.
```r
library(quanteda)
data_corpus_inaugural
```First, we'll use `split_segments()` to split each document into segments of about 40 words (punctuation is taken into account).
```r
corpus <- split_segments(data_corpus_inaugural, segment_size = 40)
```Next, we'll apply some preprocessing and compute a document-term matrix with `quanteda` functions.
```r
tok <- tokens(corpus, remove_punct = TRUE)
tok <- tokens_remove(tok, stopwords("en"))
dtm <- dfm(tok, tolower = TRUE)
dtm <- dfm_trim(dtm, min_docfreq = 10)
```We can then apply a simple clustering on this matrix with the `rainette()` function. We specify the number of clusters (`k`), and the minimum number of forms in each segment (`min_segment_size`). Segments which do not include enough forms will be merged with the following or previous one when possible.
```r
res <- rainette(dtm, k = 6, min_segment_size = 15)
```We can use the `rainette_explor()` shiny interface to visualise and explore the different clusterings at each `k`.
```r
rainette_explor(res, dtm, corpus)
```![rainette_explor() interface](man/figures/rainette_explor_plot_en.png)
The *Cluster documents* tab allows to browse and filter the documents in each cluster.
![rainette_explor() documents tab](man/figures/rainette_explor_docs_en.png)
We can also directly generate the clusters description plot for a given `k` with `rainette_plot()`.
```r
rainette_plot(res, dtm, k = 5)
```Or cut the tree at chosen `k` and add a group membership variable to our corpus metadata.
```r
docvars(corpus)$cluster <- cutree(res, k = 5)
```In addition to this, we can also perform a double clustering, *ie* two simple clusterings produced with different `min_segment_size` which are then "crossed" to generate more robust clusters. To do this, we use `rainette2()` on two `rainette()` results :
```r
res1 <- rainette(dtm, k = 5, min_segment_size = 10)
res2 <- rainette(dtm, k = 5, min_segment_size = 15)
res <- rainette2(res1, res2, max_k = 5)
```We can then use `rainette2_explor()` to explore and visualise the results.
```r
rainette2_explor(res, dtm, corpus)
```![rainette2_explor() interface](man/figures/rainette2_explor_en.png)
## Tell me more
Two vignettes are available :
- Introduction and usage vignette : [english](https://juba.github.io/rainette/articles/introduction_en.html), [french](https://juba.github.io/rainette/articles/introduction_usage.html)
- Algorithms description vignette : [english](https://juba.github.io/rainette/articles/algorithms_en.html), [french](https://juba.github.io/rainette/articles/algorithmes.html)## Credits
This clustering method has been created by Max Reinert, and is described in several articles, notably :
- Reinert M., "Une méthode de classification descendante hiérarchique : application à l'analyse lexicale par contexte", *Cahiers de l'analyse des données*, Volume 8, Numéro 2, 1983.
- Reinert M., "Alceste une méthodologie d'analyse des données textuelles et une application: Aurelia De Gerard De Nerval", *Bulletin de Méthodologie Sociologique*, Volume 26, Numéro 1, 1990.
- Reinert M., "Une méthode de classification des énoncés d’un corpus présentée à l’aide d’une application", *Les cahiers de l’analyse des données*, Tome 15, Numéro 1, 1990.Thanks to Pierre Ratineau, the author of [Iramuteq](http://www.iramuteq.org/), for providing it as free software and open source. Even if the R code has been almost entirely rewritten, it has been a precious resource to understand the algorithms.
Many thanks to [Sébastien Rochette](https://github.com/statnmap) for the creation of the hex logo.
Many thanks to [Florian Privé](https://github.com/privefl/) for his work on rewriting and optimizing the Rcpp code.