Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/trinker/textshape

Tools for reshaping text data
https://github.com/trinker/textshape

data-reshaping manipulation r sentence-boundary-detection text-data text-formating tidy

Last synced: 3 months ago
JSON representation

Tools for reshaping text data

Awesome Lists containing this project

README

        

---
title: "textshape"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
md_document:
toc: true
---

```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(knitr)
knitr::opts_chunk$set(fig.path = "tools/figure/")
combine <- textshape::combine

library(tidyverse)
library(magrittr)
library(ggstance)
library(textshape)
library(gridExtra)
library(viridis)
library(quanteda)
library(gofastr)

## desc <- suppressWarnings(readLines("DESCRIPTION"))
## regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"
## loc <- grep(regex, desc)
## ver <- gsub(regex, "\\2", desc[loc])
## verbadge <- sprintf('Version', ver, ver)
## verbadge <- ''
```

```{r, echo=FALSE}
knit_hooks$set(htmlcap = function(before, options, envir) {
if(!before) {
paste('

',options$htmlcap,"

",sep="")
}
})
knitr::opts_knit$set(self.contained = TRUE, cache = FALSE)
```

[![Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.](http://www.repostatus.org/badges/latest/inactive.svg)](https://www.repostatus.org/)
[![](http://cranlogs.r-pkg.org/badges/textshape)](https://cran.r-project.org/package=textshape)

![](tools/textshape_logo/r_textshape.png)

**textshape** is small suite of text reshaping and restructuring functions. Many of these functions are descended from tools in the [**qdapTools**](https://github.com/trinker/qdapTools) package. This brings reshaping tools under one roof with specific functionality of the package limited to text reshaping.

Other R packages provide some of the same functionality. **textshape** differs from these packages in that it is designed to help the user take unstructured data (or implicitly structured), extract it into a structured format, and then restructure into common text analysis formats for use in the next stage of the text analysis pipeline. The implicit structure of seemingly unstructured data is often detectable/expressible by the researcher. **textshape** provides tools (e.g., `split_match`) to enable the researcher to convert this tacit knowledge into a form that can be used to reformat data into more structured formats. This package is meant to be used jointly with the [**textclean**](https://github.com/trinker/textclean) package, which provides cleaning and text normalization functionality.

# Functions

Most of the functions split, expand, grab, or tidy a `vector`, `list`, `data.frame`, or `DocumentTermMatrix`. The `combine`, `duration`, `mtabulate`, & `flatten` functions are notable exceptions. The table below describes the functions and their use:

| Function | Used On | Description |
|------------------|--------------------------------|--------------------------------------------------------------|
| `combine` | `vector`, `list`, `data.frame` | Combine and collapse elements |
| `tidy_list` | `list` of `vector`s or `data.frame`s | Row bind a list and repeat list names as id column |
| `tidy_vector` | `vector` | Column bind a named atomic `vector`'s names and values |
| `tidy_table` | `table` | Column bind a `table`'s names and values |
| `tidy_matrix` | `matrix` | Stack values, repeat column row names accordingly |
| `tidy_dtm`/`tidy_tdm` | `DocumentTermMatrix` | Tidy format `DocumentTermMatrix`/`TermDocumentMatrix` |
| `tidy_colo_dtm`/`tidy_colo_tdm` | `DocumentTermMatrix` | Tidy format of collocating words from a `DocumentTermMatrix`/`TermDocumentMatrix` |
| `duration` | `vector`, `data.frame` | Get duration (start-end times) for turns of talk in n words |
| `from_to` | `vector`, `data.frame` | Prepare speaker data for a flow network |
| `mtabulate` | `vector`, `list`, `data.frame` | Dataframe/list version of `tabulate` to produce count matrix |
| `flatten` | `list` | Flatten nested, named list to single tier |
| `unnest_text` | `data.frame` | Unnest a nested text column |
| `split_index` | `vector`, `list`, `data.frame` | Split at specified indices |
| `split_match` | `vector` | Split vector at specified character/regex match |
| `split_portion` | `vector`\* | Split data into portioned chunks |
| `split_run` | `vector`, `data.frame` | Split runs (e.g., "aaabbbbcdddd") |
| `split_sentence` | `vector`, `data.frame` | Split sentences |
| `split_speaker` | `data.frame` | Split combined speakers (e.g., "Josh, Jake, Jim") |
| `split_token` | `vector`, `data.frame` | Split words and punctuation |
| `split_transcript` | `vector` | Split speaker and dialogue (e.g., "greg: Who me") |
| `split_word` | `vector`, `data.frame` | Split words |
| `grab_index` | `vector`, `data.frame`, `list` | Grab from an index up to a second index |
| `grab_match` | `vector`, `data.frame`, `list` | Grab from a regex match up to a second regex match |
| `column_to_rownames` | `data.frame` | Add a column as rownames |
| `cluster_matrix` | `matrix` | Reorder column/rows of a matrix via hierarchical clustering |

\****Note:*** *Text vector accompanied by aggregating `grouping.var` argument, which can be in the form of a `vector`, `list`, or `data.frame`*

# Installation

To download the development version of **textshape**:

Download the [zip ball](https://github.com/trinker/textshape/zipball/master) or [tar ball](https://github.com/trinker/textshape/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version:

```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/textshape")
```

# Contact

You are welcome to:

* submit suggestions and bug-reports at:

# Contributing

Contributions are welcome from anyone subject to the following rules:

- Abide by the [code of conduct](https://github.com/trinker/textshape/blob/master/CODE_OF_CONDUCT.md).
- Follow the style conventions of the package (indentation, function & argument naming, commenting, etc.)
- All contributions must be consistent with the package license (GPL-2)
- Submit contributions as a pull request. Clearly state what the changes are and try to keep the number of changes per pull request as low as possible.
- If you make big changes, add your name to the 'Author' field.

# Examples

The main shaping functions can be broken into the categories of (a) binding, (b) combining, (c) tabulating, (d) spanning, (e) splitting, (f) grabbing & (e) tidying. The majority of functions in **textshape** fall into the category of splitting and expanding (the semantic opposite of combining). These sections will provide example uses of the functions from **textshape** within the three categories.

# Loading Dependencies

```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, magrittr, ggstance, viridis, gridExtra, quanteda)
pacman::p_load_current_gh('trinker/gofastr', 'trinker/textshape')
```

## Tidying

The `tidy_xxx` functions convert untidy structures into [tidy format](https://www.jstatsoft.org/article/view/v059i10). Tidy formatted text data structures are particularly useful for interfacing with **ggplot2**, which expects this form.

The `tidy_list` function is used in the style of `do.call(rbind, list(x1, x2))` as a convenient way to bind together multiple named `data.frame`s or `vectors`s into a single `data.frame` with the `list` `names` acting as an id column. The `data.frame` bind is particularly useful for binding transcripts from different observations. Additionally, `tidy_vector` and `tidy_table` are provided for `cbinding` a `table`'s or named atomic `vector`'s values and names as separate columns in a `data.frame`. Lastly, `tidy_dtm`/`tidy_tdm` provide convenient ways to tidy a `DocumentTermMatrix` or `TermDocumentMatrix`.

#### A Vector

```{r}
x <- list(p=1:500, r=letters)
tidy_list(x)
```

#### A Dataframe

```{r}
x <- list(p=mtcars, r=mtcars, z=mtcars, d=mtcars)
tidy_list(x)
```

#### A Named Vector

```{r}
x <- setNames(
sample(LETTERS[1:6], 1000, TRUE),
sample(state.name[1:5], 1000, TRUE)
)
tidy_vector(x)
```

#### A Table

```{r}
x <- table(sample(LETTERS[1:6], 1000, TRUE))
tidy_table(x)
```

#### A Matrix

```{r}
mat <- matrix(1:16, nrow = 4,
dimnames = list(LETTERS[1:4], LETTERS[23:26])
)

mat

tidy_matrix(mat)
```

With clustering (column and row reordering) via the `cluster_matrix` function.

```{r}
## plot heatmap w/o clustering
wo <- mtcars %>%
cor() %>%
tidy_matrix('car', 'var') %>%
ggplot(aes(var, car, fill = value)) +
geom_tile() +
scale_fill_viridis(name = expression(r[xy])) +
theme(
axis.text.y = element_text(size = 8) ,
axis.text.x = element_text(size = 8, hjust = 1, vjust = 1, angle = 45),
legend.position = 'bottom',
legend.key.height = grid::unit(.1, 'cm'),
legend.key.width = grid::unit(.5, 'cm')
) +
labs(subtitle = "With Out Clustering")

## plot heatmap w clustering
w <- mtcars %>%
cor() %>%
cluster_matrix() %>%
tidy_matrix('car', 'var') %>%
mutate(
var = factor(var, levels = unique(var)),
car = factor(car, levels = unique(car))
) %>%
group_by(var) %>%
ggplot(aes(var, car, fill = value)) +
geom_tile() +
scale_fill_viridis(name = expression(r[xy])) +
theme(
axis.text.y = element_text(size = 8) ,
axis.text.x = element_text(size = 8, hjust = 1, vjust = 1, angle = 45),
legend.position = 'bottom',
legend.key.height = grid::unit(.1, 'cm'),
legend.key.width = grid::unit(.5, 'cm')
) +
labs(subtitle = "With Clustering")

grid.arrange(wo, w, ncol = 2)

```

#### A DocumentTermMatrix

The `tidy_dtm` and `tidy_tdm` functions convert a `DocumentTermMatrix` or `TermDocumentMatrix` into a tidied data set.

```{r, warning=FALSE}
my_dtm <- with(presidential_debates_2012, q_dtm(dialogue, paste(time, tot, sep = "_")))

tidy_dtm(my_dtm) %>%
tidyr::extract(doc, c("time", "turn", "sentence"), "(\\d)_(\\d+)\\.(\\d+)") %>%
mutate(
time = as.numeric(time),
turn = as.numeric(turn),
sentence = as.numeric(sentence)
) %>%
tbl_df() %T>%
print() %>%
group_by(time, term) %>%
summarize(n = sum(n)) %>%
group_by(time) %>%
arrange(desc(n)) %>%
slice(1:10) %>%
mutate(term = factor(paste(term, time, sep = "__"), levels = rev(paste(term, time, sep = "__")))) %>%
ggplot(aes(x = n, y = term)) +
geom_barh(stat='identity') +
facet_wrap(~time, ncol=2, scales = 'free_y') +
scale_y_discrete(labels = function(x) gsub("__.+$", "", x))
```

#### A DocumentTermMatrix of Collocations

The `tidy_colo_dtm` and `tidy_colo_tdm` functions convert a `DocumentTermMatrix` or `TermDocumentMatrix` into a collocation matrix and then a tidied data set.

```{r}
my_dtm <- with(presidential_debates_2012, q_dtm(dialogue, paste(time, tot, sep = "_")))
sw <- unique(c(
lexicon::sw_jockers,
lexicon::sw_loughran_mcdonald_long,
lexicon::sw_fry_1000
))

tidy_colo_dtm(my_dtm) %>%
tbl_df() %>%
filter(!term_1 %in% c('i', sw) & !term_2 %in% sw) %>%
filter(term_1 != term_2) %>%
unique_pairs() %>%
filter(n > 15) %>%
complete(term_1, term_2, fill = list(n = 0)) %>%
ggplot(aes(x = term_1, y = term_2, fill = n)) +
geom_tile() +
scale_fill_gradient(low= 'white', high = 'red') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

## Combining

The `combine` function acts like `paste(x, collapse=" ")` on vectors and lists of vectors. On dataframes multiple text cells are pasted together within grouping variables.

#### A Vector

```{r}
x <- c("Computer", "is", "fun", ".", "Not", "too", "fun", ".")
combine(x)
```

#### A Dataframe

```{r}
(dat <- split_sentence(DATA))
combine(dat[, 1:5, with=FALSE])
```

## Tabulating

`mtabulate` allows the user to transform data types into a dataframe of counts.

#### A Vector

```{r}
(x <- list(w=letters[1:10], x=letters[1:5], z=letters))
mtabulate(x)

## Dummy coding
mtabulate(mtcars$cyl[1:10])
```

#### A Dataframe

```{r}
(dat <- data.frame(matrix(sample(c("A", "B"), 30, TRUE), ncol=3)))
mtabulate(dat)
t(mtabulate(dat))
```

## Spanning

Often it is useful to know the duration (start-end) of turns of talk. The `duration` function calculates start-end durations as n words.

#### A Vector

```{r}
(x <- c(
"Mr. Brown comes! He says hello. i give him coffee.",
"I'll go at 5 p. m. eastern time. Or somewhere in between!",
"go there"
))
duration(x)
# With grouping variables
groups <- list(group1 = c("A", "B", "A"), group2 = c("red", "red", "green"))
duration(x, groups)
```

#### A Dataframe

```{r}
duration(DATA)
```

#### Gantt Plot

```{r}
library(ggplot2)
ggplot(duration(DATA), aes(x = start, xend = end, y = person, yend = person, color = sex)) +
geom_segment(size=4) +
xlab("Duration (Words)") +
ylab("Person")
```

## Splitting

The following section provides examples of available splitting functions.

### Indices

`split_index` allows the user to supply the integer indices of where to split a data type.

#### A Vector

```{r}
split_index(
LETTERS,
indices = c(4, 10, 16),
names = c("dog", "cat", "chicken", "rabbit")
)
```

#### A Dataframe

Here I calculate the indices of every time the `vs` variable in the `mtcars` data set changes and then split the dataframe on those indices. The `change_index` function is handy for extracting the indices of changes in runs within an atomic vector.

```{r}
(vs_change <- change_index(mtcars[["vs"]]))
split_index(mtcars, indices = vs_change)
```

### Matches

`split_match` splits on elements that match exactly or via a regular expression match.

#### Exact Match

```{r}
set.seed(15)
(x <- sample(c("", LETTERS[1:10]), 25, TRUE, prob=c(.2, rep(.08, 10))))

split_match(x)
split_match(x, split = "C")
split_match(x, split = c("", "C"))
## Don't include
split_match(x, include = 0)
## Include at beginning
split_match(x, include = 1)
## Include at end
split_match(x, include = 2)
```

#### Regex Match

Here I use the regex `"^I"` to break on any vectors containing the capital letter I as the first character.

```{r}
split_match(DATA[["state"]], split = "^I", regex=TRUE, include = 1)
```

### Portions

At times it is useful to split texts into portioned chunks, operate on the chunks and aggregate the results. `split_portion` allows the user to do this sort of text shaping. We can split into n chunks per grouping variable (via `n.chunks`) or into chunks of n length (via `n.words`).

#### A Vector

```{r}
with(DATA, split_portion(state, n.chunks = 10))
with(DATA, split_portion(state, n.words = 10))
```

#### A Dataframe

```{r}
with(DATA, split_portion(state, list(sex, adult), n.words = 10))
```

### Runs

`split_run` allows the user to split up runs of identical characters.

```{r}
x1 <- c(
"122333444455555666666",
NA,
"abbcccddddeeeeeffffff",
"sddfg",
"11112222333"
)

x <- c(rep(x1, 2), ">>???,,,,....::::;[[")

split_run(x)
```

#### Dataframe

```{r}
DATA[["run.col"]] <- x
split_run(DATA)
## Reset the DATA dataset
DATA <- textshape::DATA
```

### Sentences

`split_sentece` provides a mapping + regex approach to splitting sentences. It is less accurate than the Stanford parser but more accurate than a simple regular expression approach alone.

#### A Vector

```{r}
(x <- paste0(
"Mr. Brown comes! He says hello. i give him coffee. i will ",
"go at 5 p. m. eastern time. Or somewhere in between!go there"
))
split_sentence(x)
```

#### A Dataframe

```{r}
split_sentence(DATA)
```

### Speakers

Often speakers may talk in unison. This is often displayed in a single cell as a comma separated string of speakers. Some analysis may require this information to be parsed out and replicated as one turn per speaker. The `split_speaker` function accomplishes this.

```{r}
DATA$person <- as.character(DATA$person)
DATA$person[c(1, 4, 6)] <- c("greg, sally, & sam",
"greg, sally", "sam and sally")
DATA

split_speaker(DATA)
## Reset the DATA dataset
DATA <- textshape::DATA
```

### Tokens

The `split_token` function split data into words and punctuation.

#### A Vector

```{r}
(x <- c(
"Mr. Brown comes! He says hello. i give him coffee.",
"I'll go at 5 p. m. eastern time. Or somewhere in between!",
"go there"
))
split_token(x)
```

#### A Dataframe

```{r}
split_token(DATA)
```

### Transcript

The `split_transcript` function splits `vector`s with speaker prefixes (e.g., `c("greg: Who me", "sarah: yes you!")`) into a two column `data.frame`.

#### A Vector

```{r}
(x <- c(
"greg: Who me",
"sarah: yes you!",
"greg: well why didn't you say so?",
"sarah: I did but you weren't listening.",
"greg: oh :-/ I see...",
"dan: Ok let's meet at 4:30 pm for drinks"
))

split_transcript(x)
```

### Words

The `split_word` function splits data into words.

#### A Vector

```{r}
(x <- c(
"Mr. Brown comes! He says hello. i give him coffee.",
"I'll go at 5 p. m. eastern time. Or somewhere in between!",
"go there"
))
split_word(x)
```

#### A Dataframe

```{r}
split_word(DATA)
```

## Grabbing

The following section provides examples of available grabbing (from a starting point up to an ending point) functions.

### Indices

`grab_index` allows the user to supply the integer indices of where to grab (from - up to) a data type.

#### A Vector

```{r}
grab_index(DATA$state, from = 2, to = 4)
grab_index(DATA$state, from = 9)
grab_index(DATA$state, to = 3)
```

#### A Dataframe

```{r}
grab_index(DATA, from = 2, to = 4)
```

#### A List

```{r}
grab_index(as.list(DATA$state), from = 2, to = 4)
```

### Matches

`grab_match` grabs (from - up to) elements that match a regular expression.

#### A Vector

```{r}
grab_match(DATA$state, from = 'dumb', to = 'liar')
grab_match(DATA$state, from = '^What are')
grab_match(DATA$state, to = 'we do[?]')
grab_match(DATA$state, from = 'no', to = 'the', ignore.case = TRUE,
from.n = 'last', to.n = 'first')
```

#### A Dataframe

```{r}
grab_match(DATA, from = 'dumb', to = 'liar')
```

#### A List

```{r}
grab_match(as.list(DATA$state), from = 'dumb', to = 'liar')
```

## Putting It Together

Eduardo Flores blogged about [What the candidates say, analyzing republican debates using R](https://www.r-bloggers.com/2015/11/what-the-candidates-say-analyzing-republican-debates-using-r/) where he demonstrated some scraping and analysis techniques. Here I highlight a combination usage of **textshape** tools to scrape and structure the text from 4 of the 2015 Republican debates within a [**magrittr**](https://github.com/tidyverse/magrittr) pipeline. The result is a single [**data.table**](https://github.com/Rdatatable/data.table) containing the dialogue from all 4 debates. The code highlights the conciseness and readability of **textshape** by restructuring Flores scraping with **textshape** replacements.

```{r}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(rvest, magrittr, xml2)

debates <- c(
wisconsin = "110908",
boulder = "110906",
california = "110756",
ohio = "110489"
)

lapply(debates, function(x){
xml2::read_html(paste0("http://www.presidency.ucsb.edu/ws/index.php?pid=", x)) %>%
rvest::html_nodes("p") %>%
rvest::html_text() %>%
textshape::split_index(., grep("^[A-Z]+:", .)) %>%
#textshape::split_match("^[A-Z]+:", TRUE, TRUE) %>% #equal to line above
textshape::combine() %>%
textshape::split_transcript() %>%
textshape::split_sentence()
}) %>%
textshape::tidy_list("location")
```