{"id":15633941,"url":"https://github.com/rth/vtext","last_synced_at":"2025-04-06T06:12:21.689Z","repository":{"id":34523056,"uuid":"156188651","full_name":"rth/vtext","owner":"rth","description":"Simple NLP in Rust with Python bindings","archived":false,"fork":false,"pushed_at":"2023-07-06T21:58:30.000Z","size":280,"stargazers_count":150,"open_issues_count":17,"forks_count":10,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-30T05:07:01.652Z","etag":null,"topics":["bag-of-words","information-retrieval","nlp","tf-idf","tokenization"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rth.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-05T09:02:15.000Z","updated_at":"2025-01-04T19:21:18.000Z","dependencies_parsed_at":"2024-10-03T10:50:52.646Z","dependency_job_id":"921f1a3e-8afc-44ee-8339-b4fa5aaf873d","html_url":"https://github.com/rth/vtext","commit_stats":{"total_commits":143,"total_committers":5,"mean_commits":28.6,"dds":"0.26573426573426573","last_synced_commit":"908f9dd53b4c779b03e5b1081f77e101839bcba2"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rth%2Fvtext","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rth%2Fvtext/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rth%2Fvtext/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rth%2Fvtext/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rth","download_url":"https://codeload.github.com/rth/vtext/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247441059,"owners_count":20939239,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bag-of-words","information-retrieval","nlp","tf-idf","tokenization"],"created_at":"2024-10-03T10:50:41.767Z","updated_at":"2025-04-06T06:12:21.662Z","avatar_url":"https://github.com/rth.png","language":"Rust","readme":"# vtext\n\n[![Crates.io](https://img.shields.io/crates/v/vtext.svg)](https://crates.io/crates/vtext)\n[![PyPI](https://img.shields.io/pypi/v/vtext.svg)](https://pypi.org/project/vtext/)\n[![CircleCI](https://circleci.com/gh/rth/vtext/tree/master.svg?style=svg)](https://circleci.com/gh/rth/vtext/tree/master)\n[![Build Status](https://dev.azure.com/ryurchak/vtext/_apis/build/status/rth.vtext?branchName=master)](https://dev.azure.com/ryurchak/vtext/_build/latest?definitionId=1\u0026branchName=master)\n\nNLP in Rust with Python bindings\n\nThis package aims to provide a high performance toolkit for ingesting textual data for\nmachine learning applications.\n\n### Features\n\n - Tokenization: Regexp tokenizer, Unicode segmentation + language specific rules\n - Stemming: Snowball (in Python 15-20x faster than NLTK)\n - Token counting: converting token counts to sparse matrices for use\n   in machine learning libraries. Similar to `CountVectorizer` and\n   `HashingVectorizer` in scikit-learn but will less broad functionality.\n - Levenshtein edit distance; Sørensen-Dice, Jaro, Jaro Winkler string similarities\n\n## Usage\n\n### Usage in Python\n\nvtext requires Python 3.6+ and can be installed with,\n```\npip install vtext\n```\n\nBelow is a simple tokenization example,\n\n```python\n\u003e\u003e\u003e from vtext.tokenize import VTextTokenizer\n\u003e\u003e\u003e VTextTokenizer(\"en\").tokenize(\"Flights can't depart after 2:00 pm.\")\n[\"Flights\", \"ca\", \"n't\", \"depart\" \"after\", \"2:00\", \"pm\", \".\"]\n```\n\nFor more details see the project documentation: [vtext.io/doc/latest/index.html](https://vtext.io/doc/latest/index.html)\n\n### Usage in Rust\n\nAdd the following to `Cargo.toml`,\n```toml\n[dependencies]\nvtext = \"0.2.0\"\n```\n\nFor more details see rust documentation: [docs.rs/vtext](https://docs.rs/vtext)\n\n## Benchmarks\n\n#### Tokenization\n\nFollowing benchmarks illustrate the tokenization accuracy (F1 score) on [UD treebanks](https://universaldependencies.org/)\n,\n\n                    \n|  lang | dataset   |regexp    | spacy 2.1 | vtext    |         \n|-------|-----------|----------|-----------|----------|\n|  en   | EWT       | 0.812    | 0.972     | 0.966    |\n|  en   | GUM       | 0.881    | 0.989     | 0.996    |\n|  de   | GSD       | 0.896    | 0.944     | 0.964    |\n|  fr   | Sequoia   | 0.844    | 0.968     | 0.971    |\n\nand the English tokenization speed,\n\n|                          |regexp | spacy 2.1 | vtext |\n|--------------------------|-------|-----------|-------|\n| **Speed** (10⁶ tokens/s) | 3.1   | 0.14      | 2.1   |\n\n\n#### Text vectorization\n\nBelow are  benchmarks for converting\ntextual data to a sparse document-term matrix using the 20 newsgroups dataset, \nrun on Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz,\n\n| Speed (MB/s)                  | scikit-learn 0.20.1 | vtext (n_jobs=1) | vtext (n_jobs=4) |\n|-------------------------------|---------------------|------------------|------------------|\n| CountVectorizer.fit           |  14                 | 104              | 225              |\n| CountVectorizer.transform     |  14                 | 82               | 303              |\n| CountVectorizer.fit_transform |  14                 | 70               | NA               |\n| HashingVectorizer.transform   |  19                 | 89               | 309              |\n\n\nNote however that these two estimators in vtext currently support only a fraction of\nscikit-learn's functionality.  See [benchmarks/README.md](./benchmarks/README.md)\nfor more details.\n\n\n## License\n\nvtext is released under the [Apache License, Version 2.0](./LICENSE).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frth%2Fvtext","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frth%2Fvtext","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frth%2Fvtext/lists"}