Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/abadojack/whatlanggo
Natural language detection library for Go
https://github.com/abadojack/whatlanggo
go language nlp text-processing
Last synced: about 2 months ago
JSON representation
Natural language detection library for Go
- Host: GitHub
- URL: https://github.com/abadojack/whatlanggo
- Owner: abadojack
- License: mit
- Created: 2017-02-20T17:32:01.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2023-03-28T08:08:05.000Z (over 1 year ago)
- Last Synced: 2024-10-25T05:24:22.061Z (about 2 months ago)
- Topics: go, language, nlp, text-processing
- Language: Go
- Size: 240 KB
- Stars: 640
- Watchers: 15
- Forks: 66
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Support: SUPPORTED_LANGUAGES.md
Awesome Lists containing this project
- awesome-go - whatlanggo - Natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc). (Natural Language Processing / Language Detection)
- zero-alloc-awesome-go - whatlanggo - Natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc). (Natural Language Processing / Language Detection)
- go-awesome - whatlanggo - Natural Language Recognition (Open source library / Word Processing)
- awesome-go - whatlanggo - Natural language detection library for Go - ★ 304 (Natural Language Processing)
- awesome-go-extra - whatlanggo - 02-20T17:32:01Z|2021-01-15T09:31:00Z| (Bot Building / Language Detection)
- awesome-go-zh - whatlanggo
- awesome-go - whatlanggo - 支持84种语言的自然语言检测包。 (语言检测和处理 / 交互工具)
- awesome-go - whatlanggo - 支持84种语言的自然语言检测包。 (语言检测和处理 / 交互工具)
README
# Whatlanggo
[![Build Status](https://travis-ci.org/abadojack/whatlanggo.svg?branch=master)](https://travis-ci.org/abadojack/whatlanggo) [![Go Report Card](https://goreportcard.com/badge/github.com/abadojack/whatlanggo)](https://goreportcard.com/report/github.com/abadojack/whatlanggo) [![GoDoc](https://godoc.org/github.com/abadojack/whatlanggo?status.png)](https://godoc.org/github.com/abadojack/whatlanggo) [![Coverage Status](https://coveralls.io/repos/github/abadojack/whatlanggo/badge.svg)](https://coveralls.io/github/abadojack/whatlanggo)
Natural language detection for Go.
## Features
* Supports [84 languages](https://github.com/abadojack/whatlanggo/blob/master/SUPPORTED_LANGUAGES.md)
* 100% written in Go
* No external dependencies
* Fast
* Recognizes not only a language, but also a script (Latin, Cyrillic, etc)## Getting started
Installation:
```sh
go get -u github.com/abadojack/whatlanggo
```Simple usage example:
```go
package mainimport (
"fmt""github.com/abadojack/whatlanggo"
)func main() {
info := whatlanggo.Detect("Foje funkcias kaj foje ne funkcias")
fmt.Println("Language:", info.Lang.String(), " Script:", whatlanggo.Scripts[info.Script], " Confidence: ", info.Confidence)
}
```## Blacklisting and whitelisting
```go
package mainimport (
"fmt""github.com/abadojack/whatlanggo"
)func main() {
//Blacklist
options := whatlanggo.Options{
Blacklist: map[whatlanggo.Lang]bool{
whatlanggo.Ydd: true,
},
}info := whatlanggo.DetectWithOptions("האקדמיה ללשון העברית", options)
fmt.Println("Language:", info.Lang.String(), "Script:", whatlanggo.Scripts[info.Script])
//Whitelist
options1 := whatlanggo.Options{
Whitelist: map[whatlanggo.Lang]bool{
whatlanggo.Epo: true,
whatlanggo.Ukr: true,
},
}info = whatlanggo.DetectWithOptions("Mi ne scias", options1)
fmt.Println("Language:", info.Lang.String(), " Script:", whatlanggo.Scripts[info.Script])
}
```
For more details, please check the [documentation](https://godoc.org/github.com/abadojack/whatlanggo).## Requirements
Go 1.8 or higher## How does it work?
### How does the language recognition work?
The algorithm is based on the trigram language models, which is a particular case of n-grams.
To understand the idea, please check the original whitepaper [Cavnar and Trenkle '94: N-Gram-Based Text Categorization'](https://www.researchgate.net/publication/2375544_N-Gram-Based_Text_Categorization).### How _IsReliable_ calculated?
It is based on the following factors:
* How many unique trigrams are in the given text
* How big is the difference between the first and the second(not returned) detected languages? This metric is called `rate` in the code base.Therefore, it can be presented as 2d space with threshold functions, that splits it into "Reliable" and "Not reliable" areas.
This function is a hyperbola and it looks like the following one:For more details, please check a blog article [Introduction to Rust Whatlang Library and Natural Language Identification Algorithms](https://www.greyblake.com/blog/2017-07-30-introduction-to-rust-whatlang-library-and-natural-language-identification-algorithms/).
## License
[MIT](https://github.com/abadojack/whatlanggo/blob/master/LICENSE)## Derivation
whatlanggo is a derivative of [Franc](https://github.com/wooorm/franc) (JavaScript, MIT) by [Titus Wormer](https://github.com/wooorm).## Acknowledgements
Thanks to [greyblake](https://github.com/greyblake) (Potapov Sergey) for creating [whatlang-rs](https://github.com/greyblake/whatlang-rs) from where I got the idea and algorithms.