Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ropenscilabs/tif
Text Interchange Formats
https://github.com/ropenscilabs/tif
corpus natural-language-processing r r-package rstats term-frequency text-processing tokenizer
Last synced: 2 days ago
JSON representation
Text Interchange Formats
- Host: GitHub
- URL: https://github.com/ropenscilabs/tif
- Owner: ropenscilabs
- Created: 2017-04-22T15:45:55.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2023-11-26T21:57:13.000Z (about 1 year ago)
- Last Synced: 2024-12-23T00:23:46.223Z (about 2 months ago)
- Topics: corpus, natural-language-processing, r, r-package, rstats, term-frequency, text-processing, tokenizer
- Language: R
- Homepage: https://docs.ropensci.org/tif
- Size: 43.9 KB
- Stars: 35
- Watchers: 13
- Forks: 4
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## tif: Text Interchange Formats
[![R-CMD-check](https://github.com/ropensci/tif/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ropensci/tif/actions/workflows/R-CMD-check.yaml)
This package describes and validates formats for storing
common object arising in text analysis as native R objects.
Representations of a text corpus, document term matrix, and
tokenized text are included. The tokenized text format is
extensible to include other annotations. There are two versions
of the corpus and tokens objects; packages should accept
both and return or coerce to at least one of these.## Installation
You can install the development version using devtools:
```{r}
devtools::install_github("ropensci/tif")
```## Usage
The package can be used to check that a particular object is in a valid
format. For example, here we see that the object `corpus` is a valid corpus
data frame:```{r}
library(tif)
corpus <- data.frame(doc_id = c("doc1", "doc2", "doc3"),
text = c("Aujourd'hui, maman est morte.",
"It was a pleasure to burn.",
"All this happened, more or less."),
stringsAsFactors = FALSE)tif_is_corpus_df(corpus)
```
```
TRUE
```The package also has functions to convert between the list and data frame
formats for corpus and token object. For example:```{r}
tif_as_corpus_character(corpus)
```
```
doc1 doc2
"Aujourd'hui, maman est morte." "It was a pleasure to burn."
doc3
"All this happened, more or less."
```Note that extra meta data columns will be lost in the conversion from a data
frame to a named character vector.## Details
This package describes and validates formats for storing
common object arising in text analysis as native R objects.
Representations of a text corpus, document term matrix, and
tokenized text are included. The tokenized text format is
extensible to include other annotations. There are two versions
of the corpus and tokens objects; packages should accept and return
at least one of these.**corpus** (data frame) - A valid corpus data frame object
is a data frame with at least two columns. The first column
is called doc_id and is a character vector with UTF-8 encoding. Document
ids must be unique. The second column is called text and
must also be a character vector in UTF-8 encoding. Each
individual document is represented by a single row in
the data frame. Addition document-level metadata columns
and corpus level attributes are allowed but not required.**corpus** (character vector) - A valid character vector corpus
object is an character vector with UTF-8 encoding. If it has
names, this should be a unique character also in UTF-8
encoding. No other attributes should be present.**dtm** - A valid document term matrix is a sparse matrix with
the row representing documents and columns representing
terms. The row names is a character vector giving the
document ids with no duplicated entries. The column
names is a character vector giving the terms of the
matrix with no duplicated entries. The sparse matrix
should inherit from the Matrix class dgCMatrix.**tokens** (data frame) - A valid data frame tokens
object is a data frame with at least two columns. There must be
a column called doc_id that is a character vector
with UTF-8 encoding. Document ids must be unique.
There must also be a column called token that must also be a
character vector in UTF-8 encoding.
Each individual token is represented by a single row in
the data frame. Addition token-level metadata columns
are allowed but not required.**tokens** (list) - A valid corpus tokens object is (possibly
named) list of character vectors. The character vectors, as
well as names, should be in UTF-8 encoding. No other
attributes should be present in either the list or any of its
elements.