{"id":16273394,"url":"https://github.com/vspinu/mlvocab","last_synced_at":"2025-04-08T16:10:36.754Z","repository":{"id":145019208,"uuid":"128993124","full_name":"vspinu/mlvocab","owner":"vspinu","description":"Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines","archived":false,"fork":false,"pushed_at":"2021-06-06T06:12:28.000Z","size":135,"stargazers_count":0,"open_issues_count":3,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-01T18:57:26.154Z","etag":null,"topics":["corpus","embeddings","natural-language-processing","r-package","term-document-matrix","vocabulary","word2vec"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vspinu.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-04-10T20:48:51.000Z","updated_at":"2021-06-06T06:12:31.000Z","dependencies_parsed_at":null,"dependency_job_id":"3bd354ea-28be-4179-a273-e19e5d49c274","html_url":"https://github.com/vspinu/mlvocab","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vspinu%2Fmlvocab","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vspinu%2Fmlvocab/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vspinu%2Fmlvocab/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vspinu%2Fmlvocab/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vspinu","download_url":"https://codeload.github.com/vspinu/mlvocab/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247878023,"owners_count":21011158,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["corpus","embeddings","natural-language-processing","r-package","term-document-matrix","vocabulary","word2vec"],"created_at":"2024-10-10T18:24:06.533Z","updated_at":"2025-04-08T16:10:36.748Z","avatar_url":"https://github.com/vspinu.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.org/vspinu/mlvocab.svg?branch=master)](https://travis-ci.org/vspinu/mlvocab) [![CRAN RStudio mirror downloads](http://cranlogs.r-pkg.org/badges/mlvocab)](https://cran.r-project.org/package=mlvocab) [![CRAN version](http://www.r-pkg.org/badges/version/mlvocab)](https://cran.r-project.org/package=mlvocab)\n\n## Corpus and Vocabulary Preprocessing Utilities for Natural Language Pipelines (an R package)\n\nThe following two-step abstraction is provided by the package:\n\n  1. The vocabulary object is first built from the entire corpus with the help of `vocab()`, `update_vocab()` and `prune_vocab()` functions. \n  2. Then, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions. Most of the `mlvocab` functions accept `nbuckets` argument for partial or full hashing of the corpus.\n\nCurrent functionality includes:\n\n - __term index sequences__: `tix_seq()`, `tix_mat()` and `tix_df()` produce integer sequences suitable for direct consumption by various sequence models.\n - __term matrices__: `dtm()`, `tdm()` and `tcm()` create document-term term-document and term-co-occurrence matrices respectively.\n - __subseting embedding matrices__: given pre-trained word-vectors `prune_embeddings()` creates smaller embedding matrices treating missing and unknown vocabulary terms with grace.\n - __tfidf weighting__: `tfidf()` computes various versions of term frequency, inverse document frequency weighting of `dtm` and `tdm` matrices.\n \n \n## Stability\n\nPackage is in alpha state. API changes are likely.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvspinu%2Fmlvocab","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvspinu%2Fmlvocab","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvspinu%2Fmlvocab/lists"}