https://github.com/Proteusiq/luga

Blazing fast language detection using fastText model
https://github.com/Proteusiq/luga

detection language language-model languages machine-learning

Last synced: over 1 year ago
JSON representation

Blazing fast language detection using fastText model

Host: GitHub
URL: https://github.com/Proteusiq/luga
Owner: Proteusiq
License: mit
Created: 2021-11-13T13:39:31.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2022-12-18T17:29:43.000Z (over 3 years ago)
Last Synced: 2024-07-25T19:26:38.341Z (almost 2 years ago)
Topics: detection, language, language-model, languages, machine-learning
Language: Python
Homepage:
Size: 605 KB
Stars: 23
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-blazingly-fast - luga - Blazing fast language detection using fastText model (Python)

README

          Luga

==============================

- A blazing fast language detection using fastText's language models.

![Languages](https://user-images.githubusercontent.com/14926709/143822756-8fd6437f-6c99-4a9f-9718-37f086955583.png)

_Luga_ is a Swahili word for language. [fastText](https://github.com/facebookresearch/fastText) provides blazing-fast

language detection tool. Lamentably, [fastText's](https://fasttext.cc/docs/en/support.html) API is beauty-less, and the documentation is a bit fuzzy.

It is also funky that we have to manually [download](https://fasttext.cc/docs/en/language-identification.html) and load models.

Here is where _luga_ comes in. We abstract unnecessary steps and allow you to do precisely one thing: detecting text language.

#### cover image

[Stand Still. Stay Silent](http://sssscomic.com/index.php) - The relationships between Indo-European and Uralic languages by Minna Sundberg.

### Show, don't tell

![Luga in Action](example.gif)

### Installation

```bash

python -m pip install -U luga

```

### Usage:

⚠️ Note: The first usage downloads the model for you. It will take a bit longer to import depending on internet speed.

It is done only once.

```python

from luga import language

print(language("the world ended yesterday"))

# Language(name='en', score=0.98)

```

With the list of texts, we can create a mask for a filtering pipeline, that can be used, for example, with DataFrames

```python

from luga import language

import pandas as pd

examples = ["Jeg har ikke en rød reje", "Det blæser en halv pelican", "We are not robots yet"]

languages(texts=examples, only_language=True, to_array=True) == "en"

# output

# array([False, False, True])

dataf = pd.DataFrame({"text": examples})

dataf.loc[lambda d: languages(texts=d["text"].to_list(), only_language=True, to_array=True) == "en"]

# output

# 2    We are not robots yet

# Name: text, dtype: object

```

### Without Luga:

Download the model

```bash

wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -O /tmp/lid.176.bin

```

Load and use

```python

import fasttext

PATH_TO_MODEL = '/tmp/lid.176.bin'

fmodel = fasttext.load_model(PATH_TO_MODEL)

fmodel.predict(["the world has ended yesterday"])

# ([['__label__en']], [array([0.98046654], dtype=float32)])

```

### Dev:

```bash

poetry run pre-commit install

```

## Release Flow

```bash

# assumes git push is completed

git tag -l #  lists tags

git tag v*.*.* # Major.Minor.Fix

git push origin tag v*.*.*

# to delete tag:

git tag -d v*.*.* && git push origin tag -d v*.*.*

# change project_toml and __init__.py to reflect new version

```

#### TODO:

- [X] refactor artifacts.py

- [X] auto checkers with pre-commit | invoke

- [X] write more tests

- [X] write github actions

- [ ] create an intelligent data checker (a fast List[str], what do with none strings)

- [ ] make it faster with Cython

- [ ] get NDArray typing correctly

- [ ] fix `artifacts.py` line 111 cast to List[str] that causes issues

- [ ] remove nptyping when more packages move to numpy > 1.21

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Proteusiq/luga

Awesome Lists containing this project

README