Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/quanteda/readtext
an R package for reading text files
https://github.com/quanteda/readtext
encoding quanteda r text
Last synced: about 1 month ago
JSON representation
an R package for reading text files
- Host: GitHub
- URL: https://github.com/quanteda/readtext
- Owner: quanteda
- Created: 2016-10-26T00:47:47.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2024-02-27T19:08:19.000Z (10 months ago)
- Last Synced: 2024-04-26T07:41:12.087Z (8 months ago)
- Topics: encoding, quanteda, r, text
- Language: R
- Homepage: https://readtext.quanteda.io
- Size: 15.5 MB
- Stars: 115
- Watchers: 13
- Forks: 28
- Open Issues: 31
-
Metadata Files:
- Readme: README.Rmd
Awesome Lists containing this project
- jimsghstars - quanteda/readtext - an R package for reading text files (R)
README
---
output: github_document
---```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "##",
fig.path = "images/"
)
```
```{r echo=FALSE, results="hide", message=FALSE}
library("badger")
```# readtext: Import and handling for plain and formatted text files
[![CRAN Version](https://www.r-pkg.org/badges/version/readtext)](https://CRAN.R-project.org/package=readtext)
`r badge_devel("quanteda/readtext", "royalblue")`
[![Downloads](https://cranlogs.r-pkg.org/badges/readtext)](https://CRAN.R-project.org/package=readtext)
[![Total Downloads](https://cranlogs.r-pkg.org/badges/grand-total/readtext?color=orange)](https://CRAN.R-project.org/package=readtext)
[![R-CMD-check](https://github.com/quanteda/readtext/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/quanteda/readtext/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/quanteda/readtext/branch/master/graph/badge.svg)](https://app.codecov.io/gh/quanteda/readtext?branch=master)[1]: https://codecov.io/gh/quanteda/readtext/branch/master
An R package for reading text files in all their various formats, by Ken Benoit, Adam Obeng, Paul Nulty, Aki Matsuo, Kohei Watanabe, and Stefan Müller.
## Introduction
**readtext** is a one-function package that does exactly what it says on the tin: It reads files containing text, along with any associated document-level metadata, which we call "docvars", for document variables. Plain text files do not have docvars, but other forms such as .csv, .tab, .xml, and .json files usually do.
**readtext** accepts filemasks, so that you can specify a pattern to load multiple texts, and these texts can even be of multiple types. **readtext** is smart enough to process them correctly, returning a data.frame with a primary field "text" containing a character vector of the texts, and additional columns of the data.frame as found in the document variables from the source files.
As encoding can also be a challenging issue for those reading in texts, we include functions for diagnosing encodings on a file-by-file basis, and allow you to specify vectorized input encodings to read in file types with individually set (and different) encodings. (All encoding functions are handled by the **stringi** package.)
## How to Install
1. From CRAN
```{r, eval = FALSE}
install.packages("readtext")
```2. From GitHub, if you want the latest development version.
```{r, eval = FALSE}
# devtools packaged required to install readtext from Github
remotes::install_github("quanteda/readtext")
```Linux note: There are a couple of dependencies that may not be available on linux systems. On Debian/Ubuntu try installing these packages by running these commands at the command line:
```{bash, eval = FALSE}
sudo apt-get install libpoppler-cpp-dev # for antiword
```## Demonstration: Reading one or more text files
**readtext** supports plain text files (.txt), data in some form of JavaScript Object Notation (.json), comma-or tab-separated values (.csv, .tab, .tsv), XML documents (.xml), as well as PDF, Microsoft Word formatted files and other document formats (.pdf, .doc, .docx, .odt, .rtf). **readtext** also handles multiple files and file types using for instance a "glob" expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz).
The file formats are determined automatically by the filename extensions. If a file has no extension or is unknown, **readtext** will assume that it is plain text. The following command, for instance, will load in all of the files from the subdirectory `txt/UDHR/`:
```{r}
library("readtext")
# get the data directory from readtext
DATA_DIR <- system.file("extdata/", package = "readtext")# read in all files from a folder
readtext(paste0(DATA_DIR, "/txt/UDHR/*"))
```For files that contain multiple documents, such as comma-separated-value documents, you will need to specify the column name containing the texts, using the `text_field` argument:
```{r}
# read in comma-separated values and specify text field
readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
```For a more complete demonstration, see the package [vignette](https://readtext.quanteda.io/articles/readtext_vignette.html).
## Inter-operability with other packages
### With **quanteda**
**readtext** was originally developed in early versions of the [**quanteda**](https://github.com/quanteda/quanteda) package for the quantitative analysis of textual data. Because **quanteda**'s corpus constructor recognizes the data.frame format returned by `readtext()`, it can construct a corpus directly from a readtext object, preserving all docvars and other meta-data.
```{r}
library("quanteda")
# read in comma-separated values with readtext
rt_csv <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
# create quanteda corpus
corpus_csv <- corpus(rt_csv)
summary(corpus_csv, 5)
```### Text Interchange Format compatibility
**readtext** returns a data.frame that is formatted as per the corpus structure of the [Text Interchange Format](https://github.com/ropenscilabs/tif), it can easily be used by other packages that can accept a corpus in data.frame format.
If you only want a named `character` object, **readtext** also defines an `as.character()` method that inputs its data.frame and returns just the named character vector of texts, conforming to the TIF definition of the character version of a corpus.