https://github.com/vspinu/mlvocab
Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines
https://github.com/vspinu/mlvocab
corpus embeddings natural-language-processing r-package term-document-matrix vocabulary word2vec
Last synced: about 2 months ago
JSON representation
Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines
- Host: GitHub
- URL: https://github.com/vspinu/mlvocab
- Owner: vspinu
- Created: 2018-04-10T20:48:51.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2021-06-06T06:12:28.000Z (almost 4 years ago)
- Last Synced: 2025-04-01T18:57:26.154Z (about 2 months ago)
- Topics: corpus, embeddings, natural-language-processing, r-package, term-document-matrix, vocabulary, word2vec
- Language: C++
- Size: 132 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Changelog: NEWS.md
Awesome Lists containing this project
README
[](https://travis-ci.org/vspinu/mlvocab) [](https://cran.r-project.org/package=mlvocab) [](https://cran.r-project.org/package=mlvocab)
## Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines (an R package)
The following two-step abstraction is provided by the package:
1. The vocabulary object is first built from the entire corpus with the help of `vocab()`, `update_vocab()` and `prune_vocab()` functions.
2. Then, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions. Most of the `mlvocab` functions accept `nbuckets` argument for partial or full hashing of the corpus.Current functionality includes:
- __term index sequences__: `tix_seq()`, `tix_mat()` and `tix_df()` produce integer sequences suitable for direct consumption by various sequence models.
- __term matrices__: `dtm()`, `tdm()` and `tcm()` create document-term term-document and term-co-occurrence matrices respectively.
- __subseting embedding matrices__: given pre-trained word-vectors `prune_embeddings()` creates smaller embedding matrices treating missing and unknown vocabulary terms with grace.
- __tfidf weighting__: `tfidf()` computes various versions of term frequency, inverse document frequency weighting of `dtm` and `tdm` matrices.
## StabilityPackage is in alpha state. API changes are likely.