https://github.com/vspinu/mlvocab

Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines
https://github.com/vspinu/mlvocab

corpus embeddings natural-language-processing r-package term-document-matrix vocabulary word2vec

Last synced: about 2 months ago
JSON representation

Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines

Host: GitHub
URL: https://github.com/vspinu/mlvocab
Owner: vspinu
Created: 2018-04-10T20:48:51.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2021-06-06T06:12:28.000Z (almost 4 years ago)
Last Synced: 2025-04-01T18:57:26.154Z (2 months ago)
Topics: corpus, embeddings, natural-language-processing, r-package, term-document-matrix, vocabulary, word2vec
Language: C++
Size: 132 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- Changelog: NEWS.md

Awesome Lists containing this project

README

        [![Build Status](https://travis-ci.org/vspinu/mlvocab.svg?branch=master)](https://travis-ci.org/vspinu/mlvocab) [![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/mlvocab)](https://cran.r-project.org/package=mlvocab) [![CRAN version](http://www.r-pkg.org/badges/version/mlvocab)](https://cran.r-project.org/package=mlvocab)

## Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines (an R package)

The following two-step abstraction is provided by the package:

  1. The vocabulary object is first built from the entire corpus with the help of `vocab()`, `update_vocab()` and `prune_vocab()` functions. 

  2. Then, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions. Most of the `mlvocab` functions accept `nbuckets` argument for partial or full hashing of the corpus.

Current functionality includes:

 - __term index sequences__: `tix_seq()`, `tix_mat()` and `tix_df()` produce integer sequences suitable for direct consumption by various sequence models.

 - __term matrices__: `dtm()`, `tdm()` and `tcm()` create document-term term-document and term-co-occurrence matrices respectively.

 - __subseting embedding matrices__: given pre-trained word-vectors `prune_embeddings()` creates smaller embedding matrices treating missing and unknown vocabulary terms with grace.

 - __tfidf weighting__: `tfidf()` computes various versions of term frequency, inverse document frequency weighting of `dtm` and `tdm` matrices.

 

 

## Stability

Package is in alpha state. API changes are likely.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vspinu/mlvocab

Awesome Lists containing this project

README