An open API service indexing awesome lists of open source software.

https://github.com/scollinselliott/lakhesis

Consensus Seriation for Binary Data
https://github.com/scollinselliott/lakhesis

archaeology binary-data correspondence-analysis cran ecology r seriation

Last synced: about 1 month ago
JSON representation

Consensus Seriation for Binary Data

Awesome Lists containing this project

README

          

---
bibliography: "inst/REFERENCES.bib"
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```

# lakhesis: Consensus Seriation for Binary Data

[![CRAN status](https://www.r-pkg.org/badges/version/lakhesis)](https://CRAN.R-project.org/package=lakhesis)

The `R` package `lakhesis` provides an interactive platform and critical measures for seriating binary data matrices through the exploration, selection, and consensus of partially seriated sequences.

Seriation (sequencing, ordination) involves putting a set of things in an optimal order. In archaeology, seriation can be used to establish a chronological order of contexts and find-types on the basis of their similarity, i.e, that things come into and go out of fashion with a peak moment of popularity [@ihm_contribution_2005]. In ecology, the distribution of a species may occur according to a preferred environmental condition that diminishes as that environment changes [@ter_braak_weighted_1986]. There are a number of R functions and packages, especially [`seriation`](https://github.com/mhahsler/seriation) [@hahsler_getting_2008] and [`vegan`](https://CRAN.R-project.org/package=vegan) [@oksanen_vegan_2024] that provide the means to seriate or ordinate matrices, especially for frequency or count data. While binary (presence/absence) data are often viewed as a reductive case of frequency data, they can also present their own challenges. Moreover, not all incidence matrices (the matrix of 0/1s that record the joint incidence or occurrence for a row-column pairing) will be well seriated. The selection of which row and column elements to inlcude in the input is accordingly an intrinsic part of the task of seriation. In this respect, `lakhesis` seeks to complement existing methods in `R`, focusing on binary data, by providing an interactive, graphical means of selecting seriated sequences. It relies correspondence analysis (CA), a mainstay technique for seriation, and offers a method of Procrustes-fit CA to align scores with an ideal reference curve. Multiple seriations can be rerun on partial subsets, called "strands," of the initial incidence matrix, which are then recompiled into a single consensus seriation using an optimality criterion. The process of harmonizing different strands of sequential elements via iterative linear regression is called a lakhesis technique, after the fate from ancient Greek mythology who measured the strand of one's life. The package relies on `Rcpp` and `RcppArmadillo` [@eddelbuettel_rcpparmadillo_2014;@eddelbuettel_extending_2018].

While command line functions can be run in `R`, the functionality of `lakhesis` is primarily achieved via the Lakhesis Calculator, a graphical interface in `shiny` [@chang_shiny_2024] that enables investigators to explore datasets, select strands, and harmonize them into a single consensus seriation. Panels in the calculator include:

* **Seriation Explorer** displays the correspondence analysis plot of a datase, including Procrustes-fit CA (both as they have been fit to the curve and their orthogonal projection along the curve). Selections can be made on any plot.
* Map options. Users can choose either a **symmetric** or an **asymmetric** CA plot.
* **Save Strand** records the displayed plot as a partial seriation, or "strand" (i.e., partial with respect to the initial data matrix).
* Strands can be sequenced according to different projections:
* **CA1** / **CA2** Projection of scores along the first or second principal CA axis.
* **Procrustes1** / **Procrustes2** Projection of scores along the first or second axis after Procrustes fitting.
* **Curve** Projection along the reference curve of an ideal seriation.
* **Lakhesize** Produce a consensus seriation (must have saved at least two strands). Constructs a consensus seriation of the selected strands using an iterative process of linear regression of partial rankings in an agglomerative fashion. The matrix plot displays the incidence matrix of the resulting consensus seriation, with optimality criteria. The agreement of the seriation in each strand with that of the consensus seriation as well as its criterion is displayed in the Diganostics panel. The function `lakhesize()` performs this task.
* **Run Deviance Test** performs a goodness-of-fit test using deviance, treating the distribution of the row and column incidences with a quadratic-logistic model. The largest $p$ values of the row and column elements is contained in the Diganostics panel. The function `element_eval()` performs this task.
* **Consensus Seriation** displays the results of harmonizing selected partial seriations, which have been identified as "strands." The process of deriving a consensus seriation entails a process of iterative regressions on partially seriated sequences, optimized using the concentration measure. The seriated incidence matrix is also displayed in this panel.
* **Diagnostics** show critical coefficients to determine whether discordant strands should be removed and/or row or column elements should be suppressed from consideration.
* **Agreement** expresses whether a strand agrees with consensus seriation.
* **Criteria** expresses how well seriated the strand is. Options for optimality criteria are those which are used in Lakhesis, comprising:
* **Squared correlation coefficient** (`cor_sq`).
* **Weighted row-column concentration** (`conc_wrc`).
* Tabs marked **Deviance** report on the goodness-of-fit of row and column elements in the consensus seriation using deviance with a quadratic-logistic model. Higher $p$ values will indicate poorer fit for a particular row or column element.
* **Modify** temporarily suppress row or column elements from correspondence analysis. Strands which have low agreement or high concentration may also be deleted in this panel.

The sidebar contains the following commands:

* **Choose CSV** -- data must be without a header in a two-column "long" format of occurring pairs of row and column elements, where the first column contains a row element and the second column contains a column element of the incidence matrix.
* **Reinitialize** resets the plots to their original, starting condition.
* **Replot with Selection** -- upon the selection of row and column points from the Seriation Explorer panel, this command will perform and fit CA only on the selection. To return to the initial dataset, press the Reinitialize button. The function `ca_procrustes_ser()` performs this task.
* **Export Data** will download results in a single `.rds` file, which is a `list` class object containing the following:
* `consensus` The results of `lakhesize()`, a `lakhesis` class object containing row and column consensus seriations, coefficients of agreement and concentration, and the seriated incidence matrix.
* `strands` The strands selected to produce `consensus`.

## Installation

To obtain the current development version of `lakhesis` from GitHub, install from GitHub in the `R` command line with:

``` r
library(devtools)
install_github("scollinselliott/lakhesis", dependencies = TRUE, build_vignettes = TRUE)
```

## Usage

To start the Lakhesis Calculator, execute the function `LC()`:

``` r
library(lakhesis)
LC()
```

In uploading a `csv` file for analysis inside the Lakhesis Calculator, the incidence matrix should be in "long" format. That is, the file should consist of just two columns without headers, in which each row represents the incidence of a row-column pair. For example, an incidence matrix of

$$\begin{array} \, & C_1 & C_2 & C_3 \\\ R_1 & 1 & 0 & 0 \\\ R_2 & 0 & 1 & 1 \\\ R_3 & 0 & 0 & 1 \end{array}$$

will have a corresponding long format of

``` r
R1, C1
R2, C2
R2, C3
R3, C3
```

If characters are not displaying properly in the plot, make sure to check font encoding (UTF-8 is recommended).

Row and column elements must be unique (a row element cannot have the same name as a column element).

The Lakhesis Calculator enables the temporary suppression of row or column elements from the plots, with zero rows/columns automatically removed. As such, unexpected results may be elicited if key elements are suppressed. All elements can easily be re-added and the starting incidence matrix re-initialized.

### Incidence Matrices

If data are already in incidence matrix format, the `im_long()` function in `lakhesis` can be used to convert an incidence matrix to be exported into the necessary long format, using the `write.table()` function to export (see documentation on `im_long()`):

``` r
# x is a matrix of 0/1 values with unique row/column names
y <- im_long(x)
write.table(y, file = "im.csv", sep = ",")
```

The file `im.csv` can then be loaded into the Lakhesis Calculator.

## Consensus Seriations

Establishing a consensus seriation via a lakhesis technique can be done in the calculator, but if one has seriations, whether derived by Procustes-fit CA or by another method, one can perform a consensus seriation in the console by creating a `strands` object and then executing the `lakhesize()` function.

The console can also be used to perform consensus seriations. For example, using the built-in selection of three strands in the data object `qf_strands`, a consensus seriation is performed using the `lakhesize()` function:

``` r
x <- lakhesize(qf_strands)
summary(x)
```

The vignette "A Guide to Lakhesis" contains more information on usage.

## References