Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Proteusiq/luga
Blazing fast language detection using fastText model
https://github.com/Proteusiq/luga
detection language language-model languages machine-learning
Last synced: 2 months ago
JSON representation
Blazing fast language detection using fastText model
- Host: GitHub
- URL: https://github.com/Proteusiq/luga
- Owner: Proteusiq
- License: mit
- Created: 2021-11-13T13:39:31.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2022-12-18T17:29:43.000Z (about 2 years ago)
- Last Synced: 2024-07-25T19:26:38.341Z (6 months ago)
- Topics: detection, language, language-model, languages, machine-learning
- Language: Python
- Homepage:
- Size: 605 KB
- Stars: 23
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-blazingly-fast - luga - Blazing fast language detection using fastText model (Python)
README
Luga
==============================
- A blazing fast language detection using fastText's language models.![Languages](https://user-images.githubusercontent.com/14926709/143822756-8fd6437f-6c99-4a9f-9718-37f086955583.png)
_Luga_ is a Swahili word for language. [fastText](https://github.com/facebookresearch/fastText) provides blazing-fast
language detection tool. Lamentably, [fastText's](https://fasttext.cc/docs/en/support.html) API is beauty-less, and the documentation is a bit fuzzy.
It is also funky that we have to manually [download](https://fasttext.cc/docs/en/language-identification.html) and load models.Here is where _luga_ comes in. We abstract unnecessary steps and allow you to do precisely one thing: detecting text language.
#### cover image
[Stand Still. Stay Silent](http://sssscomic.com/index.php) - The relationships between Indo-European and Uralic languages by Minna Sundberg.### Show, don't tell
![Luga in Action](example.gif)### Installation
```bash
python -m pip install -U luga
```### Usage:
⚠️ Note: The first usage downloads the model for you. It will take a bit longer to import depending on internet speed.
It is done only once.```python
from luga import languageprint(language("the world ended yesterday"))
# Language(name='en', score=0.98)
```With the list of texts, we can create a mask for a filtering pipeline, that can be used, for example, with DataFrames
```python
from luga import language
import pandas as pdexamples = ["Jeg har ikke en rød reje", "Det blæser en halv pelican", "We are not robots yet"]
languages(texts=examples, only_language=True, to_array=True) == "en"
# output
# array([False, False, True])dataf = pd.DataFrame({"text": examples})
dataf.loc[lambda d: languages(texts=d["text"].to_list(), only_language=True, to_array=True) == "en"]
# output
# 2 We are not robots yet
# Name: text, dtype: object
```### Without Luga:
Download the model
```bash
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -O /tmp/lid.176.bin
```Load and use
```python
import fasttextPATH_TO_MODEL = '/tmp/lid.176.bin'
fmodel = fasttext.load_model(PATH_TO_MODEL)
fmodel.predict(["the world has ended yesterday"])# ([['__label__en']], [array([0.98046654], dtype=float32)])
```### Dev:
```bash
poetry run pre-commit install
```## Release Flow
```bash
# assumes git push is completed
git tag -l # lists tags
git tag v*.*.* # Major.Minor.Fix
git push origin tag v*.*.*# to delete tag:
git tag -d v*.*.* && git push origin tag -d v*.*.*# change project_toml and __init__.py to reflect new version
```#### TODO:
- [X] refactor artifacts.py
- [X] auto checkers with pre-commit | invoke
- [X] write more tests
- [X] write github actions
- [ ] create an intelligent data checker (a fast List[str], what do with none strings)
- [ ] make it faster with Cython
- [ ] get NDArray typing correctly
- [ ] fix `artifacts.py` line 111 cast to List[str] that causes issues
- [ ] remove nptyping when more packages move to numpy > 1.21