https://github.com/adrianrasoongit/langstats

langstats is an R package designed to help language researchers explore the statistical properties of natural language in an accessible and agile way.
https://github.com/adrianrasoongit/langstats
data-science information-theory nlp powerlaw quantitative-linguistics r text-mining text-mining-analysis zipfs-law
Last synced: 9 months ago
JSON representation
langstats is an R package designed to help language researchers explore the statistical properties of natural language in an accessible and agile way.
Host: GitHub
URL: https://github.com/adrianrasoongit/langstats
Owner: AdrianRasoOnGit
License: other
Created: 2025-06-24T10:16:31.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-07-03T08:16:50.000Z (11 months ago)
Last Synced: 2025-07-03T09:25:39.483Z (11 months ago)
Topics: data-science, information-theory, nlp, powerlaw, quantitative-linguistics, r, text-mining, text-mining-analysis, zipfs-law
Language: R
Homepage:
Size: 291 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project

README

          ---

title: "langstats; An out-of-the-box environment for statistical language analysis"

output: github_document

---

[![Pre-Release](https://img.shields.io/github/v/release/adrianrasoongit/langstats?include_prereleases)](https://github.com/AdrianRasoOnGit/langstats/releases)

[![R-CMD-check](https://github.com/AdrianRasoOnGit/langstats/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/AdrianRasoOnGit/langstats/actions)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

## Brief description

`langstats` is an R package designed to help language researchers

explore the statistical properties of natural language in an accessible

and agile way. It provides a set of functions tailored for extracting

relevant information in quantitative linguistic analysis, such as Zipf’s

law, Heaps’ law, the Zipf-Mandelbrot adjustment, and key

information-theoretic measures like Shannon entropy, cross entropy, and

information gain.

This package aims to bring these tools—often complex to interpret—closer

to language specialists. At the same time, it proposes a toolkit that

allows researchers to apply the foundational concepts of this tradition

directly, within a single environment, without needing to implement the

required models manually. It offers a flexible and practical suite of

functions that are designed to adapt, through their parameters and

general structure, to a wide range of projects.

In the current version *(remember to check the [official repository](github.com/AdrianRasoOnGit/langstats) to

confirm whether you are using the latest version!)*, the package includes

the following functions:

------------------------------------------------------------------------

- **Zipf’s law** (`zipf(input, level)`):  

  Expresses a power-law relationship between the frequency of a word and

  its rank in the corpus. According to Zipf’s law, the most frequent

  word appears about twice as often as the second most frequent word,

  three times as often as the third, and so on.

- **Heaps’ law** (`heaps(input, level)`):  

  Estimates vocabulary growth based on the number of unique words found

  in a corpus. The relationship is typically expressed as

  $V(N) = K N^\beta$, where $V$ is the vocabulary size and $N$ is the

  number of tokens.

- **Zipf-Mandelbrot’s law** (`zipf_mandelbrot(input)`):  

  A refinement of Zipf’s law proposed by Mandelbrot (1955) that better

  models the frequency of low-rank (often grammatical) terms. It adds

  two parameters: a rank offset $\beta$, and a scaling constant $C$,

  resulting in a more flexible power-law model.

- **Shannon’s entropy** (`shannon_entropy(input, level)`):  

  Quantifies the average uncertainty or information contained in a

  linguistic unit (word or letter), based on its distribution. This

  metric is computed using:  

  $$

  H(X) = -\sum p(x) \log_2 p(x)

  $$  

  The function allows analysis at either the word or letter level via

  the `level` parameter.

- **Cross entropy** (`cross_entropy(text_p, text_q, level)`):  

  Measures the discrepancy between two probability distributions: one

  empirical and one modeled. It represents how much additional

  information is needed to encode data from one distribution using

  another—commonly used to evaluate language models.

- **Information gain** (`gain(text_p, text_q, level)`):  

  Defined as the difference between the actual entropy and the cross

  entropy: $IG = H(P) - H(P, Q)$. It quantifies how much more

  informative a model is relative to a reference distribution.

- **Bag of Words generator** (`bow(text, level)`):  

  A bag of words is a standard NLP object that tallies the frequency of

  tokens in a text. This function returns a `data.frame` with each

  token, its absolute count, and its relative proportion.

- **N-grams** (`ngrams(input, level)`): With this function, we can produce n-grams that capture the combinatorial situation of the elements in the text, considering different degrees of grouping through the parameter n, which will be an integer representing the number of elements grouped in the n-gram.

- **$G^2$ lexical co-ocurrence test (`collocation_g2(input, word1(optional), word2(optional))`)**: It performs a log-likelihood ratio test (that is, a $G^2$ test), with the purpose of unveiling significant collocation patterns in lexical distributions. When word1 and word2 are given, the $G^2$ test is performed on both words through a contingency table (that can be accessed!), and if they are not provided, `g2()` will show the most significant collocation structures in the input.

- **$G^2$ test and contingency tables (`g2(input)`) **: It performs a log-likelihood ratio test (that is, a $G^2$ test), that it is called from other functions. This is an internal quite straightforward function in the package, it's real use can be found in `collocation_g2()`, which supports this very function.

- **Term frequency and inverse document frequency** (`tf(input)`, `idf(input)`, `tf_idf(input)`):  

  With these three functions users are able to gain insight on lexical distributions with a quite standard and enlightening measure: TF-IDF — not to be confused with other IDFs, whose systematic violence and repression bear no resemblance to Inverse Document Frequencies (luckily!). `tf()` computes the frequency of each token within individual documents. `idf()` captures how rare a token is across all documents in the corpus, using the formula  

  $$

  \text{idf}(t) = \log\left(\frac{N}{n_t}\right)

  $$  

  where $N$ is the total number of documents and $n_t$ is the number of documents that contain term $t$. Finally, `tf_idf()` combines the two measures to emphasize terms that are both frequent in a specific document and relatively rare in the corpus:  

  $$

  \text{tf\_idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)

  $$

Since `langstats v0.2.0`, the package comes with an internal function that forces an unified treatment of inputs. That function is named, quite in a straightforward way, `input_handling()`. This is of no real use for the user, as it is pretty internal, but out of transparency I mention it here.

- **Input handling** (`input_handling(input)`): Given a text or a data frame, it detects or calculates frequencies from tokens and also probability values. These are relevant data for the functions of the package and are obtained and stored in a universal manner, so errors and implementations are swifter and safer. It returns a list, which is way faster than a data frame to manipulate. The readability that we lose with the list is trivial, since that list will be of internal use and the user will always have at hand a data frame output to assess the data and results. Since the version `v0.4.0`, the function offers a BERT-powered tokeniser, that can be accessed through the argument `token = c("regex", "transformers")`. If not defined, `input_handling()` selects the regex based tokeniser of previous versions, which is usually faster and still satisfactory most of the times.

Also, since `langstats v0.5.0`, the package allows its users to automatically plot results with a smart plot function powered by `ggplot2`. We introduce it below:

- **Plotting with automated configuration** (`plot_ls(input)`): This function automatically plot results regardless of the origin of the data, as long as it has been returned from a native function of `langstats`.

  Besides, the package comes with some example data to try the features

  of this resource. Remember to type

  `data(“name_of_the_data_here”)` before using it! The datasets

  provided are:

- **Piantadosi Selected Papers** (named: `selected_piantadosi`): A brief collection of five

  papers of psychologist Steven T. Piantadosi, gathered in his personal

  website (). 

  

- **Heart of Darkness** (named: `heart_of_darkness`): The whole Joseph Conrad’s novel. Ideal to try the functions of the package.

  In the following sections of this README, we will show how to install

  the package, and how to use it. For further details, you can always

  run `?function_name` in your R console, or visit our site on [GitHub](github.com/AdrianRasoOnGit/langstats).

------------------------------------------------------------------------

## Installation

You can install the actual version of langstats like so:

``` r

# Install the development version from GitHub:

# install.packages("devtools")

devtools::install_github("AdrianRasoOnGit/langstats")

```

## Examples

Here we share just a pair of examples on how a project can benefit from the functions `langstats` provides. Remember to check the file `A Brief Overview of langstats` if you'd like to dig a bit deeper on use cases of the package.

### Example 1. Report on potential laws in a text

We can study the main potential statistical laws in `langstats` using `zipf()` and `heaps()` and produce reports aimed at examining a particular text, like we show below:

```{r zipf_heaps_report_example, echo = TRUE, eval = TRUE, results = 'hide', message = FALSE}

# Load langstats package and load the dataset, here heart_of_darkness, included in langstats

library(langstats)

data("heart_of_darkness")

# Preparation of both Zipf and Heaps report

zipf_report <- zipf(heart_of_darkness)

heaps_report <- heaps(heart_of_darkness)

# Show some results of the function in data frame form. For Heaps, we look at a narrow window later in text, to showcase more clearly the effect of the law

head(zipf_report, 10)

subset(heaps_report, tokens >= 1000 & tokens <= 1010)

```

The output should look like this:

```{r zipf_heaps_report_example_actual, echo = FALSE, eval = TRUE, message = FALSE}

head(zipf_report, 10)

subset(heaps_report, tokens >= 1000 & tokens <= 1010)

```

We can (and should) summarise these results in a graph of the following form, through `ggplot2`:

```{r zipf_heaps_report_example_plot, echo = FALSE, eval = TRUE, message = FALSE}

library(ggplot2)

# Plot Zipf distribution (log-log)

ggplot(zipf_report[1:100,], aes(x = rank, y = frequency)) +

  geom_point(color = "skyblue2") +

  scale_x_log10() +

  scale_y_log10() +

  labs(

    title = "Zipf's law in Heart of Darkness",

    x = "Rank",

    y = "Frequency"

  ) +

  theme_minimal()

# Plot vocabulary growth curve

ggplot(heaps_report, aes(x = tokens, y = vocabulary)) +

  geom_line(color = "darkred", linewidth = 1) +

  labs(

    title = "Heaps'law in Heart of Darkness",

    x = "Total tokens",

    y = "Vocabulary size"

  ) +

  theme_minimal()

```

This way we can apply `zipf()` and `heaps()` in a report of the sort here described.

### Example 2. Information Theory exploratory study in a corpus

We will examine now `selected_piantadosi` dataset, which is a corpus constituted by five papers of Steven T. Piantadosi. We plan to obtain its main information theory measures and compare it to `heart_of_darkness`:

```{r information_theory_piantadosi, echo = TRUE, eval = TRUE, message = FALSE, results = 'hide'}

library(langstats)

data("selected_piantadosi")

data("heart_of_darkness")

# We calculate Shannon's entropy at word level

shannon_entropy(selected_piantadosi, level = "word")

# We calculate cross entropy at word level between selected_piantadosi and heart_of_darkness

cross_entropy(selected_piantadosi, heart_of_darkness, level = "word")

# We finally study the information gain between these two sources, again, at word level

gain(selected_piantadosi, heart_of_darkness, level = "word")

```

Now, let's see the results:

```{r information_theory_piantadosi_actual, echo = FALSE, eval = TRUE, message = FALSE}

library(langstats)

data("selected_piantadosi")

data("heart_of_darkness")

# We calculate Shannon's entropy at word level

shannon_entropy_piantadosi_report <- shannon_entropy(selected_piantadosi, level = "word")

# We calculate cross entropy at word level between selected_piantadosi and heart_of_darkness

cross_entropy_piantadosi_report <- cross_entropy(selected_piantadosi, heart_of_darkness, level = "word")

# We finally study the information gain between these two sources, again, at word level

gain_piantadosi_report <- gain(selected_piantadosi, heart_of_darkness, level = "word")

cat("Shannon entropy is ", shannon_entropy_piantadosi_report, "\n")

cat("Cross entropy is ", cross_entropy_piantadosi_report, "\n")

cat("Information gain is ", gain_piantadosi_report, "\n")

```

That's how we can implement these functions in a information theoretic project aimed at linguistic data.

------------------------------------------------------------------------

Any piece of advice and any sort of suggestion are more than welcome. This is an ongoing project, and I'd like to add many other functions and features to the package. For contact, please reach my institutional email address (adrraso@ucm.es).

Thanks for using `langstats`. Have a wonderful analysis!
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/adrianrasoongit/langstats

Awesome Lists containing this project

README