An open API service indexing awesome lists of open source software.

https://github.com/vspinu/mlvocab

Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines
https://github.com/vspinu/mlvocab

corpus embeddings natural-language-processing r-package term-document-matrix vocabulary word2vec

Last synced: about 2 months ago
JSON representation

Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines

Awesome Lists containing this project

README

        

[![Build Status](https://travis-ci.org/vspinu/mlvocab.svg?branch=master)](https://travis-ci.org/vspinu/mlvocab) [![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/mlvocab)](https://cran.r-project.org/package=mlvocab) [![CRAN version](http://www.r-pkg.org/badges/version/mlvocab)](https://cran.r-project.org/package=mlvocab)

## Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines (an R package)

The following two-step abstraction is provided by the package:

1. The vocabulary object is first built from the entire corpus with the help of `vocab()`, `update_vocab()` and `prune_vocab()` functions.
2. Then, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions. Most of the `mlvocab` functions accept `nbuckets` argument for partial or full hashing of the corpus.

Current functionality includes:

- __term index sequences__: `tix_seq()`, `tix_mat()` and `tix_df()` produce integer sequences suitable for direct consumption by various sequence models.
- __term matrices__: `dtm()`, `tdm()` and `tcm()` create document-term term-document and term-co-occurrence matrices respectively.
- __subseting embedding matrices__: given pre-trained word-vectors `prune_embeddings()` creates smaller embedding matrices treating missing and unknown vocabulary terms with grace.
- __tfidf weighting__: `tfidf()` computes various versions of term frequency, inverse document frequency weighting of `dtm` and `tdm` matrices.


## Stability

Package is in alpha state. API changes are likely.