Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rth/vtext
Simple NLP in Rust with Python bindings
https://github.com/rth/vtext
bag-of-words information-retrieval nlp tf-idf tokenization
Last synced: 7 days ago
JSON representation
Simple NLP in Rust with Python bindings
- Host: GitHub
- URL: https://github.com/rth/vtext
- Owner: rth
- License: apache-2.0
- Created: 2018-11-05T09:02:15.000Z (about 6 years ago)
- Default Branch: main
- Last Pushed: 2023-07-06T21:58:30.000Z (over 1 year ago)
- Last Synced: 2025-01-13T05:05:55.712Z (14 days ago)
- Topics: bag-of-words, information-retrieval, nlp, tf-idf, tokenization
- Language: Rust
- Homepage:
- Size: 273 KB
- Stars: 150
- Watchers: 5
- Forks: 11
- Open Issues: 17
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# vtext
[![Crates.io](https://img.shields.io/crates/v/vtext.svg)](https://crates.io/crates/vtext)
[![PyPI](https://img.shields.io/pypi/v/vtext.svg)](https://pypi.org/project/vtext/)
[![CircleCI](https://circleci.com/gh/rth/vtext/tree/master.svg?style=svg)](https://circleci.com/gh/rth/vtext/tree/master)
[![Build Status](https://dev.azure.com/ryurchak/vtext/_apis/build/status/rth.vtext?branchName=master)](https://dev.azure.com/ryurchak/vtext/_build/latest?definitionId=1&branchName=master)NLP in Rust with Python bindings
This package aims to provide a high performance toolkit for ingesting textual data for
machine learning applications.### Features
- Tokenization: Regexp tokenizer, Unicode segmentation + language specific rules
- Stemming: Snowball (in Python 15-20x faster than NLTK)
- Token counting: converting token counts to sparse matrices for use
in machine learning libraries. Similar to `CountVectorizer` and
`HashingVectorizer` in scikit-learn but will less broad functionality.
- Levenshtein edit distance; Sørensen-Dice, Jaro, Jaro Winkler string similarities## Usage
### Usage in Python
vtext requires Python 3.6+ and can be installed with,
```
pip install vtext
```Below is a simple tokenization example,
```python
>>> from vtext.tokenize import VTextTokenizer
>>> VTextTokenizer("en").tokenize("Flights can't depart after 2:00 pm.")
["Flights", "ca", "n't", "depart" "after", "2:00", "pm", "."]
```For more details see the project documentation: [vtext.io/doc/latest/index.html](https://vtext.io/doc/latest/index.html)
### Usage in Rust
Add the following to `Cargo.toml`,
```toml
[dependencies]
vtext = "0.2.0"
```For more details see rust documentation: [docs.rs/vtext](https://docs.rs/vtext)
## Benchmarks
#### Tokenization
Following benchmarks illustrate the tokenization accuracy (F1 score) on [UD treebanks](https://universaldependencies.org/)
,
| lang | dataset |regexp | spacy 2.1 | vtext |
|-------|-----------|----------|-----------|----------|
| en | EWT | 0.812 | 0.972 | 0.966 |
| en | GUM | 0.881 | 0.989 | 0.996 |
| de | GSD | 0.896 | 0.944 | 0.964 |
| fr | Sequoia | 0.844 | 0.968 | 0.971 |and the English tokenization speed,
| |regexp | spacy 2.1 | vtext |
|--------------------------|-------|-----------|-------|
| **Speed** (10⁶ tokens/s) | 3.1 | 0.14 | 2.1 |#### Text vectorization
Below are benchmarks for converting
textual data to a sparse document-term matrix using the 20 newsgroups dataset,
run on Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz,| Speed (MB/s) | scikit-learn 0.20.1 | vtext (n_jobs=1) | vtext (n_jobs=4) |
|-------------------------------|---------------------|------------------|------------------|
| CountVectorizer.fit | 14 | 104 | 225 |
| CountVectorizer.transform | 14 | 82 | 303 |
| CountVectorizer.fit_transform | 14 | 70 | NA |
| HashingVectorizer.transform | 19 | 89 | 309 |Note however that these two estimators in vtext currently support only a fraction of
scikit-learn's functionality. See [benchmarks/README.md](./benchmarks/README.md)
for more details.## License
vtext is released under the [Apache License, Version 2.0](./LICENSE).