https://github.com/quanteda/readtext

an R package for reading text files
https://github.com/quanteda/readtext

encoding quanteda r text

Last synced: 8 months ago
JSON representation

an R package for reading text files

Host: GitHub
URL: https://github.com/quanteda/readtext
Owner: quanteda
Created: 2016-10-26T00:47:47.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2024-02-27T19:08:19.000Z (almost 2 years ago)
Last Synced: 2024-04-26T07:41:12.087Z (over 1 year ago)
Topics: encoding, quanteda, r, text
Language: R
Homepage: https://readtext.quanteda.io
Size: 15.5 MB
Stars: 115
Watchers: 13
Forks: 28
Open Issues: 31
Metadata Files:
- Readme: README.Rmd

Awesome Lists containing this project

jimsghstars - quanteda/readtext - an R package for reading text files (R)

README

          ---

output: github_document

---

```{r, echo = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "##",

  fig.path = "images/"

)

```

```{r echo=FALSE, results="hide", message=FALSE}

library("badger")

```

# readtext: Import and handling for plain and formatted text files

[![CRAN Version](https://www.r-pkg.org/badges/version/readtext)](https://CRAN.R-project.org/package=readtext)

`r badge_devel("quanteda/readtext", "royalblue")`

[![Downloads](https://cranlogs.r-pkg.org/badges/readtext)](https://CRAN.R-project.org/package=readtext)

[![Total Downloads](https://cranlogs.r-pkg.org/badges/grand-total/readtext?color=orange)](https://CRAN.R-project.org/package=readtext)

[![R-CMD-check](https://github.com/quanteda/readtext/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/quanteda/readtext/actions/workflows/R-CMD-check.yaml)

[![Codecov test coverage](https://codecov.io/gh/quanteda/readtext/branch/master/graph/badge.svg)](https://app.codecov.io/gh/quanteda/readtext?branch=master)

[1]: https://codecov.io/gh/quanteda/readtext/branch/master

An R package for reading text files in all their various formats, by Ken Benoit, Adam Obeng, Paul Nulty, Aki Matsuo, Kohei Watanabe, and Stefan Müller.

## Introduction

**readtext** is a one-function package that does exactly what it says on the tin: It reads files containing text, along with any associated document-level metadata, which we call "docvars", for document variables.  Plain text files do not have docvars, but other forms such as .csv, .tab, .xml, and .json files usually do.  

**readtext** accepts filemasks, so that you can specify a pattern to load multiple texts, and these texts can even be of multiple types.  **readtext** is smart enough to process them correctly, returning a data.frame with a primary field "text" containing a character vector of the texts, and additional columns of the data.frame as found in the document variables from the source files.

As encoding can also be a challenging issue for those reading in texts, we include functions for diagnosing encodings on a file-by-file basis, and allow you to specify vectorized input encodings to read in file types with individually set (and different) encodings.  (All encoding functions are handled by the **stringi** package.)

## How to Install

1.  From CRAN

    ```{r, eval = FALSE}

    install.packages("readtext")

    ```

2.  From GitHub, if you want the latest development version.

    ```{r, eval = FALSE}

    # devtools packaged required to install readtext from Github 

    remotes::install_github("quanteda/readtext") 

    ```

Linux note: There are a couple of dependencies that may not be available on linux systems. On Debian/Ubuntu try installing these packages by running these commands at the command line:

```{bash, eval = FALSE}

sudo apt-get install libpoppler-cpp-dev   # for antiword

```

## Demonstration: Reading one or more text files

**readtext** supports plain text files (.txt), data in some form of JavaScript Object Notation (.json), comma-or tab-separated values (.csv, .tab, .tsv), XML documents (.xml), as well as PDF, Microsoft Word formatted files and other document formats (.pdf, .doc, .docx, .odt, .rtf). **readtext** also handles multiple files and file types using for instance a "glob" expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz).

The file formats are determined automatically by the filename extensions.  If a file has no extension or is unknown, **readtext** will assume that it is plain text.  The following command, for instance, will load in all of the files from the subdirectory `txt/UDHR/`:

```{r}

library("readtext")

# get the data directory from readtext

DATA_DIR <- system.file("extdata/", package = "readtext")

# read in all files from a folder

readtext(paste0(DATA_DIR, "/txt/UDHR/*"))

```

For files that contain multiple documents, such as comma-separated-value documents, you will need to specify the column name containing the texts, using the `text_field` argument:

```{r}

# read in comma-separated values and specify text field

readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")

```

For a more complete demonstration, see the package [vignette](https://readtext.quanteda.io/articles/readtext_vignette.html).

## Inter-operability with other packages

### With **quanteda**

**readtext** was originally developed in early versions of the [**quanteda**](https://github.com/quanteda/quanteda) package for the quantitative analysis of textual data. Because **quanteda**'s corpus constructor recognizes the data.frame format returned by `readtext()`, it can construct a corpus directly from a readtext object, preserving all docvars and other meta-data.

```{r}

library("quanteda")

# read in comma-separated values with readtext

rt_csv <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")

# create quanteda corpus

corpus_csv <- corpus(rt_csv)

summary(corpus_csv, 5)

```

### Text Interchange Format compatibility

**readtext** returns a data.frame that is formatted as per the corpus structure of the [Text Interchange Format](https://github.com/ropenscilabs/tif), it can easily be used by other packages that can accept a corpus in data.frame format.  

If you only want a named `character` object, **readtext** also defines an `as.character()` method that inputs its data.frame and returns just the named character vector of texts, conforming to the TIF definition of the character version of a corpus.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/quanteda/readtext

Awesome Lists containing this project

README