Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/trinker/textreadr

Tools to uniformly read in text data including semi-structured transcripts
https://github.com/trinker/textreadr

doc docx pdf-reading r read-transcripts text-data text-mining

Last synced: about 2 months ago
JSON representation

Tools to uniformly read in text data including semi-structured transcripts

Awesome Lists containing this project

README

        

---
title: "textreadr"
output:
md_document:
toc: true
---

![](tools/textreadr_logo/r_textreadr.png)

[![Project Status: Active - The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![Build Status](https://travis-ci.org/trinker/textreadr.svg?branch=master)](https://travis-ci.org/trinker/textreadr)
[![Coverage Status](https://coveralls.io/repos/trinker/textreadr/badge.svg?branch=master)](https://coveralls.io/github/trinker/textreadr)
[![](http://cranlogs.r-pkg.org/badges/textreadr)](https://cran.r-project.org/package=textreadr)

**textreadr** is a small collection of convenience tools for reading text documents into R. This is not meant to be an exhaustive collection; for more see the [**tm**](https://CRAN.R-project.org/package=tm) package.

```{r, eval=FALSE, echo=FALSE}
#render("README.Rmd", md_document(variant="markdown", toc=TRUE, preserve_yaml = TRUE))
```
# Functions

Most jobs in my workflow can be completed with `read_document` and `read_dir`. The former generically reads in a .docx, .doc, .pdf, .html, .pptx, or .txt file without specifying the extension. The latter reads in multiple .docx, .doc, .html, .odt .pdf, .pptx, .rtf, or .txt files from a directory as a `data.frame` with a file and text column. This workflow is effective because most text documents I encounter are stored as a .docx, .doc, .html, .odt .pdf, .pptx, .rtf, or .txt file. The remaining common storage formats I encounter include .csv, .xlsx, XML, structured .html, and SQL. For these first 4 forms the [**readr**](https://CRAN.R-project.org/package=readr), [**readx**l](https://CRAN.R-project.org/package=readxl), [**xml2**](https://CRAN.R-project.org/package=xml2), and [**rvest**](https://CRAN.R-project.org/package=rvest). For SQL:

| R Package | SQL |
|-------------|------------------------|
| ROBDC | Microsoft SQL Server |
| RMySQL | MySQL |
| ROracle | Oracle |
| RJDBC | JDBC |

These packages are already specialized to handle these very specific data formats. **textreadr** provides the basic reading tools that work with the five basic file formats in which text data is stored.

The main functions, task category, & descriptions are summarized in the table below:

| Function | Task | Description |
|---------------------------|-------------|---------------------------------------|
| `read_transcript` | reading | Read 2 column transcripts |
| `read_docx` | reading | Read .docx |
| `read_doc` | reading | Read .doc |
| `read_rtf` | reading | Read .rtf |
| `read_document` | reading | Generic text reader for .doc, .docx, .rtf, .txt, .pdf |
| `read_html` | reading | Read .html |
| `read_pdf` | reading | Read .pdf |
| `read_odt` | reading | Read .odt |
| `read_dir` | reading | Read and format multiple .doc, .docx, .rtf, .txt, .pdf, .pptx, .odt files |
| `read_dir_transcript` | reading | Read and format multiple transcript files |
| `download` | downloading | Download documents |
| `peek` | viewing | Truncated viewing of `data.frame`s |

# Installation

To download the development version of **textreadr**:

Download the [zip ball](https://github.com/trinker/textreadr/zipball/master) or [tar ball](https://github.com/trinker/textreadr/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version:

```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/textreadr")
```

# Contact

You are welcome to:

* submit suggestions and bug-reports at:
* send a pull request on:
* compose a friendly e-mail to:

# Demonstration

## Load the Packages/Data

```{r, message=FALSE, warning=FALSE}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(textreadr, magrittr)
pacman::p_load_gh("trinker/pathr")

trans_docs <- dir(
system.file("docs", package = "textreadr"),
pattern = "^trans",
full.names = TRUE
)

docx_doc <- system.file("docs/Yasmine_Interview_Transcript.docx", package = "textreadr")
doc_doc <- system.file("docs/Yasmine_Interview_Transcript.doc", package = "textreadr")
pdf_doc <- system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr")
html_doc <- system.file('docs/textreadr_creed.html', package = "textreadr")
txt_doc <- system.file('docs/textreadr_creed.txt', package = "textreadr")
pptx_doc <- system.file('docs/Hello_World.pptx', package = "textreadr")
odt_doc <- system.file('docs/Hello_World.odt', package = "textreadr")

rtf_doc <- download(
'https://raw.githubusercontent.com/trinker/textreadr/master/inst/docs/trans7.rtf'
)

pdf_doc_img <- system.file("docs/McCune2002Choi2010.pdf", package = "textreadr")
```

## Download & Browse

The `download` and `browse` functions are utilities for downloading and opening files and directories.

### Download

`download` is simply a wrapper for `curl::curl_download` that allows multiple documents to be download, has the `tempdir` pre-set as the `destfile` (named `loc` in **textreadr**), and also returns the path to the file download for easy use in a **magrittr** chain.

Here I download a .docx file of presidential debated from 2012.

```{r}
'https://github.com/trinker/textreadr/raw/master/inst/docs/pres.deb1.docx' %>%
download() %>%
read_docx() %>%
head(3)
```

### Browse

`browse` is a system dependent tool for opening files and directories. In the example below we open the directory that contains the example documents used in this README.

```{r, eval=FALSE}
system.file("docs", package = "textreadr") %>%
browse()
```

We can open files as well:

```{r, eval = FALSE}
html_doc %>%
browse()
```

## Generic Document Reading

The `read_document` is a generic wrapper for `read_docx`, `read_doc`, `read_html`, `read_odt`, `read_pdf`, `read_rtf`, and `read_pptx` that detects the file extension and chooses the correct reader. For most tasks that require reading a .docx, .doc, .html, .odt, .pdf, .pptx, .rtf or .txt file this is the go-to function to get the job done. Below I demonstrate reading each of these five file formats with `read_document`.

```{r}

doc_doc %>%
read_document() %>%
head(3)

docx_doc %>%
read_document() %>%
head(3)

html_doc %>%
read_document() %>%
head(3)

odt_doc %>%
read_document() %>%
head(3)

pdf_doc %>%
read_document() %>%
head(3)

pptx_doc %>%
read_document() %>%
head(3)

rtf_doc %>%
read_document() %>%
head(3)

txt_doc %>%
read_document() %>%
paste(collapse = "\n") %>%
cat()
```

## Read Directory Contents

Often there is a need to read multiple files in from a single directory. The `read_dir` function wraps other **textreadr** functions and `lapply` to create a data frame with a document and text column (one row per document). We will read the following documents from the 'pos' directory in **textreadr**'s system file:

```
levelName
pos
|--0_9.txt
|--1_7.txt
|--10_9.txt
|--11_9.txt
|--12_9.txt
|--13_7.txt
|--14_10.txt
|--15_7.txt
|--16_7.txt
|--17_9.txt
|--18_7.txt
|--19_10.txt
|--2_9.txt
|--3_10.txt
|--4_8.txt
|--5_10.txt
|--6_10.txt
|--7_7.txt
|--8_7.txt
\--9_7.txt
```

Here we have read the files in, one row per file.

```{r}
system.file("docs/Maas2011/pos", package = "textreadr") %>%
read_dir() %>%
peek(Inf, 40)
```

## Basic Readers

### Read .doc

A .doc file is a bit trickier to read in than .docx but is made easy by the **antiword** package which wraps the [Antiword](http://www.winfield.demon.nl) program in an OS independent way.

```{r}
doc_doc %>%
read_doc() %>%
head()
```

```{r}
doc_doc %>%
read_doc(15) %>%
head(7)
```

### Read .docx

A .docx file is nothing but a fancy container. It can be parsed via XML. The `read_docx` function allows the user to read in a .docx file as plain text. Elements are essentially the p tags (explicitly `//w:t` tags collapsed with `//w:p` tags) in the markup.

```{r}
docx_doc %>%
read_docx() %>%
head(3)
```

```{r}
docx_doc %>%
read_docx(15) %>%
head(3)
```

### Read .html

Often a researcher only wishes to grab the text from the body of .html files. The `read_html` function does exactly this task. For finer control over .html scraping the user may investigate the **xml2** & **rvest** packages for parsing .html and .xml files. Here I read in HTML with `read_html`.

```{r}
html_doc %>%
read_html()
```

### Read .odt

Open Document Texts (.odt) are rather similar to .docx files in how they behave. The `read_odt` function reads them in in a similar way.

```{r}
odt_doc %>%
read_odt()
```

### Read .pdf

Like .docx a .pdf file is simply a container. Reading PDF's is made easier with a number of command line tools. A few methods of PDF reading have been incorporated into R. Here I wrap **pdftools** `pdf_text` to produce `read_pdf`, a function with sensible defaults that is designed to read PDFs into R for as many folks as possible right out of the box.

Here I read in a PDF with `read_pdf`. Notice the result is a data frame with meta data, including page numbers and element (row) ids.

```{r}
pdf_doc %>%
read_pdf()
```

#### Image Based .pdf: OCR

Image based .pdfs require optical character recognition (OCR) in order for the images to be converted to text. The `ocr` argument of `read_pdf` allows the user to read in image based .pdf files and allow the [**tesseract**](https://CRAN.R-project.org/package=tesseract) package do the heavy lifting in the backend. You can look at the .pdf we'll be using by running:

```{r,eval=FALSE}
browse(pdf_doc_img)
```

First let's try the task without using OCR.

```{r, eval = FALSE}
pdf_doc_img %>%
read_pdf(ocr = FALSE)
```

```r
## Table: [0 x 3]
##
## [1] page_id element_id text
## <0 rows> (or 0-length row.names)
## ... ... ... ...
```

And now using OCR via **tesseract**. Note that `ocr = TRUE` is the default behavior of `read_pdf`.

```{r, eval = FALSE}
pdf_doc_img %>%
read_pdf(ocr = TRUE)
```

```r
## Converting page 1 to C:\Users\AppData\Local\Temp\RtmpKeJAnL/McCune2002Choi2010_01.png... done!
## Converting page 2 to C:\Users\AppData\Local\Temp\RtmpKeJAnL/McCune2002Choi2010_02.png... done!

## Table: [104 x 3]
##
## page_id element_id text
## 1 1 1 A Survey of Binary Similarity and Distan
## 2 1 2 Seung-Seok Choi, Sung-Hyuk Cha, Charles
## 3 1 3 Department of Computer Science, Pace Uni
## 4 1 4 New York, US
## 5 1 5 ABSTRACT ecological 25 sh species [2|].
## 6 1 6 conventional similarity measures to solv
## 7 1 7 The binary feature vector is one of the
## 8 1 8 representations of patterns and measurin
## 9 1 9 distance measures play a critical role i
## 10 1 10 such as clustering, classication, etc.
## .. ... ... ...
```

### Read .pptx

Like the .docx, a .pptx file is also nothing but a fancy container. Likewise, it can be parsed via XML. The `read_pptx` function allows the user to read in a .pptx file as a data.frame with plain text that tracks slide id numbers.

```{r}
pptx_doc %>%
read_pptx()
```

### Read .rtf

Rich text format (.rtf) is a plain text document with markup similar to latex. The **striprtf** package provides the backend for `read_rtf`.

```{r}
rtf_doc %>%
read_rtf()
```

## Read Transcripts

Many researchers store their dialogue data (including interviews and observations) as a .docx or .xlsx file. Typically the data is a two column format with the person in the first column and the text in the second separated by some sort of separator (often a colon). The `read_transcript` wraps up many of these assumptions into a reader that will extract the data as a data frame with a person and text column. The `skip` argument is very important for correct parsing.

Here I read in and parse the different formats `read_transcript` handles. These are the files that will be read in:

```{r}
base_name(trans_docs)
```

### doc

```{r}
read_transcript(trans_docs[6], skip = 1)
```

### docx Simple

```{r}
read_transcript(trans_docs[1])
```

### docx With Skip

`skip` is important to capture the document structure. Here not skipping front end document matter throws an error, while `skip = 1` correctly parses the file.

```{r,error=TRUE}
read_transcript(trans_docs[2])
read_transcript(trans_docs[2], skip = 1)
```

### docx With Dash Separator

The colon is the default separator. At times other separators may be used to separate speaker and text. Here is an example where hypens are used as a separator. Notice the poor parse with colon set as the default separator the first go round.

```{r}
read_transcript(trans_docs[3], skip = 1)
read_transcript(trans_docs[3], sep = "-", skip = 1)
```

### odt

```{r}
read_transcript(trans_docs[8])
```
### rtf

```{r}
read_transcript(rtf_doc, skip = 1)
```

### xls and xlsx

```{r}
read_transcript(trans_docs[4])
read_transcript(trans_docs[5])
```

### Reading Text

Like `read.table`, `read_transcript` also has a `text` argument which is useful for demoing code.

```{r}
read_transcript(
text=

"34 The New York Times reports a lot of words here.
12 Greenwire reports a lot of words.
31 Only three words.
2 The Financial Times reports a lot of words.
9 Greenwire short.
13 The New York Times reports a lot of words again.",

col.names = c("NO", "ARTICLE"), sep = " "
)

```

### Authentic Interview

Here I read in an authentic interview transcript:

```{r}
docx_doc %>%
read_transcript(c("Person", "Dialogue"), skip = 19)
```

## Pairing textreadr

**textreadr** is but one package used in the text analysis (often the first package used). It pairs nicely with a variety of other text munging and analysis packages. In the example below I show just a few other package pairings that are used to extract case names (e.g., "Jones v. State of New York") from a [Supreme Court Database Code Book](http://scdb.wustl.edu/_brickFiles/2012_01/SCDB_2012_01_codebook.pdf). I demonstrate pairings with [**textshape**](https://github.com/trinker/textshape), [**textclean**](https://github.com/trinker/textclean), [**qdapRegex**](https://github.com/trinker/qdapRegex), and [**dplyr**](https://github.com/tidyverse/dplyr).

```{r, message=FALSE, warning=FALSE}
library(dplyr)
library(qdapRegex)
library(textreadr)
library(textshape)
library(textclean)

## Read in pdf, split on variables
dat <- 'http://scdb.wustl.edu/_brickFiles/2012_01/SCDB_2012_01_codebook.pdf' %>%
textreadr::download() %>%
textreadr::read_pdf() %>%
filter(page_id > 5 & page_id < 79) %>%
mutate(
loc = grepl('Variable Name', text, ignore.case=TRUE),
text = textclean::replace_non_ascii(text)
) %>%
textshape::split_index(which(.$loc) -1) %>%
lapply(select, -loc)

## Function to extract cases
ex_vs <- qdapRegex::ex_(pattern = "((of|[A-Z][A-Za-z'.,-]+)\\s+)+([Vv]s?\\.\\s+)(([A-Z][A-Za-z'.,-]+\\s+)*((of|[A-Z][A-Za-z',.-]+),?($|\\s+|\\d))+)")

## Extract and filter cases
dat %>%
lapply(function(x) {
x$text %>%
textshape::combine() %>%
ex_vs() %>%
c() %>%
textclean::mgsub(c("^[ ,]+", "[ ,0-9]+$", "^(See\\s+|E\\.g\\.?,)"), "", fixed=FALSE)
}) %>%
setNames(seq_along(.)) %>%
{.[sapply(., function(x) all(length(x) > 1 | !is.na(x)))]}
```

# Other Implementations

Some other implementations of text readers in R:

1. [tm](https://CRAN.R-project.org/package=tm )
1. [readtext](https://CRAN.R-project.org/package=readtext)