An open API service indexing awesome lists of open source software.

https://github.com/proycon/lingua-cli

Very small simple command-line interface for language detection using lingua-rs
https://github.com/proycon/lingua-cli

languagedetection nlp

Last synced: 7 months ago
JSON representation

Very small simple command-line interface for language detection using lingua-rs

Awesome Lists containing this project

README

          

[![Crate](https://img.shields.io/crates/v/lingua-cli.svg)](https://crates.io/crates/lingua-cli)
[![GitHub release](https://img.shields.io/github/release/proycon/lingua-cli.svg)](https://github.com/proycon/lingua-cli/releases/)
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
![Technology Readiness Level 7/9 - Release Candidate - Technology ready enough and in initial use by end-users in intended scholarly environments. Further validation in progress.](https://w3id.org/research-technology-readiness-levels/Level7ReleaseCandidate.svg)

# Lingua-cli

This is a small command-line tool for language detection, it is a simple
wrapper around the [lingua-rs](https://github.com/pemistahl/lingua-rs/) library
for Rust, read there for extensive documentation. A distinguishing feature is
that this library works better for short texts thanmany other libraries

## Installation

Ensure you have Rust's package manager `cargo`, then download, isntall and compile `lingua-cli` in one go as follows:

``$ cargo install lingua-cli``

## Usage

Pass text as parameter

``$ lingua-cli bonjour à tous``

Pass text via standard input:

``$ echo "bonjour à tous" | lingua-cli``

Constrain the languages you want to detect using `-l` with iso-639-1 languages
codes. Constraining the list improves accuracy. Do `-L` to see a list of
supported languages.

``$ echo "bonjour à tous" | lingua-cli -l "fr,de,es,nl,en"``

To classify input line-by-line, pass ``-n``.

``$ echo -e "bonjour à tous\nhola a todos\nhallo allemaal" | lingua-cli -n -l "fr,de,es,nl,en"``

```
fr 0.9069164472389637 bonjour à tous
es 0.918273871035807 hola a todos
nl 0.988293648761749 hallo allemaal
```

Output is TSV and consists of an iso-639-1 language code, confidence score, and in line-by-line mode, a copy of the line.

You can also classified mixed text using the ``--multi`` option. This will then output UTF-8 byte offsets:

```
$ lingua-cli --multi -l fr,de,en < /tmp/test.txt
0 23 fr Parlez-vous français?
23 73 de Ich spreche ein bisschen spreche Französisch ja.
73 110 en A little bit is better than nothing.
```