Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/greyblake/whatlang-rs
Natural language detection library for Rust. Try demo online: https://whatlang.org/
https://github.com/greyblake/whatlang-rs
ai algorithm classifier detect-language language language-recognition nlp rust rustlang text-analysis text-classification text-classifier whatlang
Last synced: 3 days ago
JSON representation
Natural language detection library for Rust. Try demo online: https://whatlang.org/
- Host: GitHub
- URL: https://github.com/greyblake/whatlang-rs
- Owner: greyblake
- License: mit
- Created: 2016-11-05T21:26:51.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2024-03-16T15:51:08.000Z (11 months ago)
- Last Synced: 2025-01-16T07:10:21.580Z (10 days ago)
- Topics: ai, algorithm, classifier, detect-language, language, language-recognition, nlp, rust, rustlang, text-analysis, text-classification, text-classifier, whatlang
- Language: Rust
- Homepage: https://whatlang.org/
- Size: 2.04 MB
- Stars: 987
- Watchers: 24
- Forks: 111
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Support: SUPPORTED_LANGUAGES.md
Awesome Lists containing this project
- awesome-rust-cn - greyblake/whatlang-rs - ci.org/greyblake/whatlang-rs.svg?branch=master">](https://travis-ci.org/greyblake/whatlang-rs) (Libraries / Text processing)
- awesome-rust - greyblake/whatlang-rs - ci.org/greyblake/whatlang-rs.svg?branch=master">](https://travis-ci.org/greyblake/whatlang-rs) (Libraries / Text processing)
- awesome-nlp - whatlang
- awesome-rust - greyblake/whatlang-rs
- awesome-rust-cn - greyblake/whatlang-rs
- awesome-rust-zh - greyblake/whatlang-rs - 基于 trigrams 的自然语言检测库[<img src="https://api.travis-ci.org/greyblake/whatlang-rs.svg?branch=master">](https://travis-ci.org/greyblake/whatlang-rs) (库 / 文本处理)
README
Whatlang
Natural language detection for Rust with focus on simplicity and performance.
[![Stand With Ukraine](https://raw.githubusercontent.com/vshymanskyy/StandWithUkraine/main/banner2-direct.svg)](https://stand-with-ukraine.pp.ua/)
## Content
* [Features](#features)
* [Get started](#get-started)
* [Who uses Whatlang?](#who-uses-whatlang)
* [Documentation](https://docs.rs/whatlang)
* [Supported languages](https://github.com/greyblake/whatlang-rs/blob/master/SUPPORTED_LANGUAGES.md)
* [Feature toggles](#feature-toggles)
* [How does it work?](#how-does-it-work)
* [How language recognition works?](#how-language-recognition-works)
* [How is_reliable calculated?](#how-is_reliable-calculated)
* [Running benchmark](#running-benchmarks)
* [Comparison with alternatives](#comparison-with-alternatives)
* [Ports and clones](#ports-and-clones)
* [Donations](#donations)
* [Derivation](#derivation)
* [License](#license)
* [Contributors](#contributors)## Features
* Supports [69 languages](https://github.com/greyblake/whatlang-rs/blob/master/SUPPORTED_LANGUAGES.md)
* 100% written in Rust
* Lightweight, fast and simple
* Recognizes not only a language, but also a script (Latin, Cyrillic, etc)
* Provides reliability information## Get started
Example:
```rust
use whatlang::{detect, Lang, Script};fn main() {
let text = "Ĉu vi ne volas eklerni Esperanton? Bonvolu! Estas unu de la plej bonaj aferoj!";let info = detect(text).unwrap();
assert_eq!(info.lang(), Lang::Epo);
assert_eq!(info.script(), Script::Latin);
assert_eq!(info.confidence(), 1.0);
assert!(info.is_reliable());
}
```For more details (e.g. how to blacklist some languages) please check the [documentation](https://docs.rs/whatlang).
## Who uses Whatlang?
Whatlang is used within the following big projects as direct or indirect dependency for language recognition.
You're gonna be in a great company using Whatlang:* [Sonic](https://github.com/valeriansaliou/sonic) - fast, lightweight and schema-less search backend in Rust.
* [Meilisearch](https://github.com/meilisearch) - an open-source, easy-to-use, blazingly fast, and hyper-relevant search engine built in Rust.## Feature toggles
| Feature | Description |
|-------------|---------------------------------------------------------------------------------------|
| `enum-map` | `Lang` and `Script` implement `Enum` trait from [enum-map](https://docs.rs/enum-map/) |
| `arbitrary` | Support [Arbitrary](https://crates.io/crates/arbitrary) |
| `serde` | Implements `Serialize` and `Deserialize` for `Lang` and `Script` |
| `dev` | Enables `whatlang::dev` module which provides some internal API.
It exists for profiling purposes and normal users are discouraged to to rely on this API. |## How does it work?
### How does the language recognition work?
The algorithm is based on the trigram language models, which is a particular case of n-grams.
To understand the idea, please check the original whitepaper [Cavnar and Trenkle '94: N-Gram-Based Text Categorization'](https://www.researchgate.net/publication/2375544_N-Gram-Based_Text_Categorization).### How is `is_reliable` calculated?
It is based on the following factors:
* How many unique trigrams are in the given text
* How big is the difference between the first and the second(not returned) detected languages? This metric is called `rate` in the code base.Therefore, it can be presented as 2d space with threshold functions, that splits it into "Reliable" and "Not reliable" areas.
This function is a hyperbola and it looks like the following one:For more details, please check a blog article [Introduction to Rust Whatlang Library and Natural Language Identification Algorithms](https://www.greyblake.com/blog/introduction-to-rust-whatlang-library-and-natural-language-identification-algorithms/).
## Make tasks
* `make bench` - run performance benchmarks
* `make doc` - generate and open doc
* `make test` - run tests
* `make watch` - watch changes and run tests## Comparison with alternatives
| | Whatlang | CLD2 | CLD3 |
| ------------------------- | ---------- | ----------- | -------------- |
| Implementation language | Rust | C++ | C++ |
| Languages | 68 | 83 | 107 |
| Algorithm | trigrams | quadgrams | neural network |
| Supported Encoding | UTF-8 | UTF-8 | ? |
| HTML support | no | yes | ? |## Ports and clones
* [whatlang-ffi](https://github.com/greyblake/whatlang-ffi) - C bindings
* [whatlanggo](https://github.com/abadojack/whatlanggo) - whatlang clone for Go language
* [whatlang-py](https://github.com/cathalgarvey/whatlang-py) - bindings for Python
* [whatlang-rb](https://gitlab.com/KitaitiMakoto/whatlang-rb) - bindings for Ruby
* [whatlangex](https://github.com/pierrelegall/whatlangex) - bindings for Elixir## Donations
You can support the project by donating [NEAR tokens](https://near.org).
Our NEAR wallet address is `whatlang.near`
## Derivation
**Whatlang** is a derivative work from [Franc](https://github.com/wooorm/franc) (JavaScript, MIT) by [Titus Wormer](https://github.com/wooorm).
## License
[MIT](https://github.com/greyblake/whatlang-rs/blob/master/LICENSE) © [Sergey Potapov](http://greyblake.com/)
## Contributors
- [greyblake](https://github.com/greyblake) Potapov Sergey - creator, maintainer.
- [Dr-Emann](https://github.com/Dr-Emann) Zachary Dremann - optimization and improvements
- [BaptisteGelez](https://github.com/BaptisteGelez) Baptiste Gelez - improvements
- [Vishesh Chopra](https://github.com/KarmicKonquest) - designed the logo
- [Joel Natividad](https://github.com/jqnatividad) - support of Tagalog
- [ManyTheFish](https://github.com/ManyTheFish) - crazy optimization
- [Kerollmops](https://github.com/Kerollmops) Clément Renault - crazy optimization