An open API service indexing awesome lists of open source software.

https://github.com/mpadge/texttimetravel

Tools for analysing temporally structured text collections
https://github.com/mpadge/texttimetravel

r text-mining time-series topic-modeling

Last synced: about 2 months ago
JSON representation

Tools for analysing temporally structured text collections

Awesome Lists containing this project

README

        

[![Build Status](https://travis-ci.org/mpadge/texttimetravel.svg?branch=master)](https://travis-ci.org/mpadge/texttimetravel)
[![Project Status: Active](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)
[![codecov](https://codecov.io/gh/mpadge/texttimetravel/branch/master/graph/badge.svg)](https://codecov.io/gh/mpadge/texttimetravel)

# texttimetravel

Tools for analysing temporally structured text collections, including tools for
reading large sets of texts in (via
[`pdftools`](https://github.com/ropensci/pdftools)), and for time series
analysis of qualitative statistics such as word associations and topic models
(primarily via [`quanteda`](https://quanteda.io) and
[`topicmodels`](https://cran.r-project.org/package=topicmodels).

## Installation

System requirements (for linux):

- libpoppler-cpp-dev
- libgsl-dev

For other systems, see respective documentation for
[`pdftools`](https://github.com/ropensci/pdftools)) and
[`topicmodels`](https://cran.r-project.org/package=topicmodels).

```{r, eval=FALSE}
devtools::install_github ('mpadge/texttimetravel')
```

---------------

## Usage

Load packages and a temporally-structured corpus to work with:

```{r load-real, echo = FALSE, message = FALSE}
devtools::load_all (".", export_all = FALSE)
library (quanteda)
dat <- data_corpus_inaugural
```
```{r load-print, eval = FALSE}
library (texttimetravel)
library (quanteda)
dat <- data_corpus_inaugural
#dat <- corpus_reshape (dat, to = "sentences") # if desired
```
([`data_corpus_inaugural`](https://quanteda.io/reference/data_corpus_inaugural.html)
is a sample corpus from [`quanteda`](https://quanteda.io) of inaugural speeches
of US presidents.) Then use [`quanteda`](https://quanteda.io) functions to
convert to desired tokenized form:
```{r tokenize}
tok <- tokens (dat,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_separators = TRUE)
tok <- tokens_remove (tok, stopwords("english"))
```

## keywords

Keyword associations can be extracted with the `ttt_keyness` function, which
relies on the `quanteda::keyness` function, yet simplifies the interface by
allowing keyness statistics to be extracted with a single function call.

```{r keywords}
x <- ttt_keyness (tok, "politic*")
head (x, n = 10) %>% knitr::kable()
x <- ttt_keyness (tok, "school*")
head (x, n = 10) %>% knitr::kable()
```

## topics

The function `ttt_fit_topics` provides a convenient wrapper around the functions
provided by the
[`topicmodels`](https://cran.r-project.org/package=topicmodels) package, and
extends functionality via two additional parameters:

1. `years`, allowing topic models to be fitted only to those portions of a
corpus corresponding to the specified years;
2. `topic`, allowing models to be fitted around a specified topic phrase.

```{r topics1}
x <- ttt_fit_topics (tok, ntopics = 5)
topicmodels::get_terms(x, 10) %>% knitr::kable()
```
```{r topics2}
x <- ttt_fit_topics (tok, years = 1789:1900, ntopics = 5)
topicmodels::get_terms(x, 10) %>% knitr::kable()
```
```{r topics3}
x <- ttt_fit_topics (tok, topic = "nation", ntopics = 5)
topicmodels::get_terms(x, 10) %>% knitr::kable()
```