Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pemistahl/lingua-py
The most accurate natural language detection library for Python, suitable for short text and mixed-language text
https://github.com/pemistahl/lingua-py
language-classification language-detection language-identification language-recognition natural-language-processing nlp python-library
Last synced: 6 days ago
JSON representation
The most accurate natural language detection library for Python, suitable for short text and mixed-language text
- Host: GitHub
- URL: https://github.com/pemistahl/lingua-py
- Owner: pemistahl
- License: apache-2.0
- Created: 2021-07-13T09:52:34.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-10-31T11:02:56.000Z (2 months ago)
- Last Synced: 2024-12-27T01:02:33.804Z (13 days ago)
- Topics: language-classification, language-detection, language-identification, language-recognition, natural-language-processing, nlp, python-library
- Language: Python
- Homepage:
- Size: 283 MB
- Stars: 1,197
- Watchers: 10
- Forks: 45
- Open Issues: 25
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
![lingua](https://raw.githubusercontent.com/pemistahl/lingua-py/main/images/logo.png)
[![build status](https://github.com/pemistahl/lingua-rs/actions/workflows/python-build.yml/badge.svg)](https://github.com/pemistahl/lingua-rs/actions/workflows/python-build.yml)
[![codecov](https://codecov.io/gh/pemistahl/lingua-rs/branch/main/graph/badge.svg)](https://codecov.io/gh/pemistahl/lingua-rs)
[![supported languages](https://img.shields.io/badge/supported%20languages-75-green.svg)](#3-which-languages-are-supported)
![supported Python versions](https://img.shields.io/badge/Python-%3E%3D%203.8-blue)
[![pypi](https://img.shields.io/badge/PYPI-v2.0.2-blue)](https://pypi.org/project/lingua-language-detector)
[![license](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
## 1. What does this library do?
Its task is simple: It tells you which language some text is written in.
This is very useful as a preprocessing step for linguistic data
in natural language processing applications such as text classification and
spell checking. Other use cases, for instance, might include routing e-mails
to the right geographically located customer service department, based on the
e-mails' languages.## 2. Why does this library exist?
Language detection is often done as part of large machine learning frameworks
or natural language processing applications. In cases where you don't need
the full-fledged functionality of those systems or don't want to learn the
ropes of those, a small flexible library comes in handy.Python is widely used in natural language processing, so there are a couple
of comprehensive open source libraries for this task, such as Google's
[*CLD 2*](https://github.com/CLD2Owners/cld2) and
[*CLD 3*](https://github.com/google/cld3),
[*Langid*](https://github.com/saffsd/langid.py),
[*FastText*](https://fasttext.cc/docs/en/language-identification.html),
[*FastSpell*](https://github.com/mbanon/fastspell),
[*Simplemma*](https://github.com/adbar/simplemma) and
[*Langdetect*](https://github.com/Mimino666/langdetect).
Unfortunately, most of them have two major drawbacks:1. Detection only works with quite lengthy text fragments. For very short
text snippets such as Twitter messages, they do not provide adequate results.
2. The more languages take part in the decision process, the less accurate are
the detection results.*Lingua* aims at eliminating these problems. She nearly does not need any
configuration and yields pretty accurate results on both long and short text,
even on single words and phrases. She draws on both rule-based and statistical
methods but does not use any dictionaries of words. She does not need a
connection to any external API or service either. Once the library has been
downloaded, it can be used completely offline.## 3. A short history of this library
This library started as a pure Python implementation. Python's quick prototyping
capabilities made an important contribution to its improvements. Unfortunately,
there was always a tradeoff between performance and memory consumption. At first,
*Lingua's* language models were stored in dictionaries during runtime. This led
to quick performance at the cost of large memory consumption (more than 3 GB).
Because of that, the language models were then stored in NumPy arrays instead of
dictionaries. Memory consumption reduced to approximately 800 MB but CPU
performance dropped significantly. Both approaches were not satisfying.Starting from version 2.0.0, the pure Python implementation was replaced with
compiled Python bindings to the native
[Rust implementation](https://github.com/pemistahl/lingua-rs) of *Lingua*.
This decision has led to both quick performance and a small memory
footprint of less than 1 GB. The pure Python implementation is still available
in a [separate branch](https://github.com/pemistahl/lingua-py/tree/pure-python-impl)
in this repository and will be kept up-to-date in subsequent 1.* releases.
Both 1.* and 2.* versions will remain available on the Python package index (PyPI).## 4. Which languages are supported?
Compared to other language detection libraries, *Lingua's* focus is on
*quality over quantity*, that is, getting detection right for a small set of
languages first before adding new ones. Currently, the following 75 languages
are supported:- A
- Afrikaans
- Albanian
- Arabic
- Armenian
- Azerbaijani
- B
- Basque
- Belarusian
- Bengali
- Norwegian Bokmal
- Bosnian
- Bulgarian
- C
- Catalan
- Chinese
- Croatian
- Czech
- D
- Danish
- Dutch
- E
- English
- Esperanto
- Estonian
- F
- Finnish
- French
- G
- Ganda
- Georgian
- German
- Greek
- Gujarati
- H
- Hebrew
- Hindi
- Hungarian
- I
- Icelandic
- Indonesian
- Irish
- Italian
- J
- Japanese
- K
- Kazakh
- Korean
- L
- Latin
- Latvian
- Lithuanian
- M
- Macedonian
- Malay
- Maori
- Marathi
- Mongolian
- N
- Norwegian Nynorsk
- P
- Persian
- Polish
- Portuguese
- Punjabi
- R
- Romanian
- Russian
- S
- Serbian
- Shona
- Slovak
- Slovene
- Somali
- Sotho
- Spanish
- Swahili
- Swedish
- T
- Tagalog
- Tamil
- Telugu
- Thai
- Tsonga
- Tswana
- Turkish
- U
- Ukrainian
- Urdu
- V
- Vietnamese
- W
- Welsh
- X
- Xhosa
- Y
- Yoruba
- Z
- Zulu## 5. How accurate is it?
*Lingua* is able to report accuracy statistics for some bundled test data
available for each supported language. The test data for each language is split
into three parts:1. a list of single words with a minimum length of 5 characters
2. a list of word pairs with a minimum length of 10 characters
3. a list of complete grammatical sentences of various lengthsBoth the language models and the test data have been created from separate
documents of the [Wortschatz corpora](https://wortschatz.uni-leipzig.de)
offered by Leipzig University, Germany. Data crawled from various news websites
have been used for training, each corpus comprising one million sentences.
For testing, corpora made of arbitrarily chosen websites have been used, each
comprising ten thousand sentences. From each test corpus, a random unsorted
subset of 1000 single words, 1000 word pairs and 1000 sentences has been
extracted, respectively.Given the generated test data, I have compared the detection results of
*Lingua*, *FastText*, *FastSpell*, *Langdetect*, *Langid*, *Simplemma*, *CLD 2* and *CLD 3*
running over the data of *Lingua's* supported 75 languages. Languages that are
not supported by the other detectors are simply ignored for them during the
detection process.Each of the following sections contains two plots. The bar plot shows the detailed accuracy
results for each supported language. The box plot illustrates the distributions of the
accuracy values for each classifier. The boxes themselves represent the areas which the
middle 50 % of data lie within. Within the colored boxes, the horizontal lines mark the
median of the distributions.### 5.1 Single word detection
Bar plot
### 5.2 Word pair detection
Bar plot
### 5.3 Sentence detection
Bar plot
### 5.4 Average detection
Bar plot
### 5.5 Mean, median and standard deviation
The table below shows detailed statistics for each language and classifier
including mean, median and standard deviation.Open table
Language
Average
Single Words
Word Pairs
Sentences
Lingua
(high accuracy mode)
Lingua
(low accuracy mode)
Langdetect
FastText
FastSpell
(conservative mode)
FastSpell
(aggressive mode)
Langid
CLD3
CLD2
Simplemma
Lingua
(high accuracy mode)
Lingua
(low accuracy mode)
Langdetect
FastText
FastSpell
(conservative mode)
FastSpell
(aggressive mode)
Langid
CLD3
CLD2
Simplemma
Lingua
(high accuracy mode)
Lingua
(low accuracy mode)
Langdetect
FastText
FastSpell
(conservative mode)
FastSpell
(aggressive mode)
Langid
CLD3
CLD2
Simplemma
Lingua
(high accuracy mode)
Lingua
(low accuracy mode)
Langdetect
FastText
FastSpell
(conservative mode)
FastSpell
(aggressive mode)
Langid
CLD3
CLD2
Simplemma
Afrikaans
79
64
67
36
70
73
30
55
55
-
58
38
37
11
49
50
1
22
13
-
81
62
66
23
67
74
10
46
56
-
97
93
98
74
94
95
80
98
96
-
Albanian
88
80
79
66
66
66
65
55
65
20
69
54
53
35
35
35
33
18
18
21
95
86
84
66
66
66
63
48
77
17
100
99
99
98
98
98
98
98
99
23
Arabic
98
94
97
96
96
96
91
90
67
-
96
88
94
89
89
89
84
79
19
-
99
96
98
98
98
98
90
92
82
-
100
99
100
100
100
100
98
100
99
-
Armenian
100
100
-
100
100
100
94
99
100
22
100
100
-
100
100
100
83
100
100
36
100
100
-
100
100
100
99
100
100
14
100
100
-
100
100
100
100
97
100
14
Azerbaijani
90
82
-
78
69
85
68
81
72
-
77
71
-
57
43
67
36
62
34
-
92
78
-
80
69
90
69
82
82
-
99
96
-
98
94
100
98
99
99
-
Basque
84
75
-
71
71
71
52
62
61
-
71
56
-
44
44
44
18
33
23
-
87
76
-
70
70
70
52
62
69
-
93
92
-
100
100
100
86
92
91
-
Belarusian
97
92
-
85
92
95
85
84
76
-
92
80
-
69
81
87
69
67
42
-
99
95
-
88
94
98
87
86
87
-
100
100
-
98
99
100
99
100
99
-
Bengali
100
100
100
98
98
98
92
99
63
-
100
100
100
94
94
94
92
98
19
-
100
100
100
99
99
99
88
99
69
-
100
100
100
100
100
100
97
99
99
-
Bokmal
58
50
-
-
69
75
13
-
-
50
39
27
-
-
53
55
3
-
-
15
59
47
-
-
70
77
12
-
-
45
77
75
-
-
85
91
23
-
-
90
Bosnian
35
29
-
9
54
65
5
33
19
-
29
23
-
9
54
54
2
19
4
-
35
29
-
10
64
76
4
28
15
-
41
36
-
8
44
64
8
52
36
-
Bulgarian
87
78
72
78
89
92
67
70
66
68
70
56
51
56
80
83
46
45
32
44
91
81
68
81
88
95
62
66
72
67
99
96
96
99
98
99
93
98
93
91
Catalan
70
58
54
57
63
66
38
48
38
59
51
33
25
33
42
44
5
19
4
32
74
60
51
57
63
67
29
42
30
62
87
82
86
83
85
88
81
84
79
81
Chinese
100
100
64
71
71
71
96
92
33
-
100
100
39
46
46
46
90
92
-
-
100
100
56
68
68
68
97
83
2
-
100
100
97
100
100
100
100
100
98
-
Croatian
73
60
73
47
72
81
48
42
51
-
53
36
49
28
62
64
16
26
34
-
74
57
72
42
79
87
38
42
47
-
90
86
97
72
76
93
90
58
73
-
Czech
80
71
71
76
76
80
66
64
74
50
66
54
52
58
61
64
44
39
50
31
84
72
73
79
78
83
69
65
80
44
91
87
88
92
88
92
86
88
91
76
Danish
81
70
70
62
76
78
60
58
59
50
61
45
50
35
56
58
33
26
27
20
84
70
68
57
75
78
61
54
56
47
98
95
93
95
98
99
86
95
94
83
Dutch
77
64
58
78
71
78
64
58
47
58
55
36
27
55
46
55
34
29
11
32
81
61
49
81
70
81
61
47
42
50
96
94
98
100
97
99
98
97
90
92
English
81
63
60
96
96
96
85
54
56
65
55
29
22
90
90
90
84
22
12
27
89
62
58
98
98
98
71
44
55
69
99
97
99
100
100
100
99
97
100
98
Esperanto
84
66
-
76
76
76
44
57
50
-
67
44
-
51
51
51
5
22
7
-
85
61
-
79
79
79
30
51
46
-
98
93
-
100
100
100
96
98
98
-
Estonian
92
83
83
73
73
73
67
70
65
71
80
62
62
50
50
50
37
41
24
44
96
88
87
73
73
73
67
69
73
70
100
99
100
96
97
97
98
99
99
97
Finnish
96
91
93
92
93
93
83
80
77
76
90
77
84
82
82
82
62
58
44
47
98
95
95
96
96
96
88
84
89
81
100
100
100
100
100
100
100
99
98
100
French
89
77
75
83
83
83
71
55
46
65
74
52
48
62
62
62
42
22
12
34
94
83
78
86
86
86
74
49
48
68
99
98
99
99
99
99
98
94
80
94
Ganda
91
84
-
-
-
-
-
-
61
-
79
65
-
-
-
-
-
-
23
-
95
87
-
-
-
-
-
-
62
-
100
100
-
-
-
-
-
-
99
-
Georgian
100
100
-
99
99
99
99
98
100
4
100
100
-
97
97
97
97
99
100
11
100
100
-
99
99
99
100
100
100
2
100
100
-
100
100
100
100
96
100
0
German
89
80
73
89
89
89
81
66
64
72
74
57
49
76
76
76
61
40
27
38
94
84
70
93
93
93
81
62
66
78
100
99
100
100
100
100
100
98
98
99
Greek
100
100
100
99
99
99
100
100
100
75
100
100
100
98
98
98
100
100
100
74
100
100
100
100
100
100
100
100
100
60
100
100
100
100
100
100
100
100
100
92
Gujarati
100
100
100
100
100
100
100
100
100
-
100
100
100
99
99
99
100
99
100
-
100
100
100
100
100
100
100
100
100
-
100
100
100
100
100
100
100
100
100
-
Hebrew
100
100
100
100
100
100
100
-
-
-
100
100
100
99
99
99
100
-
-
-
100
100
100
100
100
100
100
-
-
-
100
100
100
100
100
100
100
-
-
-
Hindi
73
33
68
87
72
88
60
58
77
5
61
11
44
74
53
77
41
34
56
2
64
20
60
88
65
89
47
45
76
4
94
67
99
99
96
99
92
95
99
11
Hungarian
95
90
88
92
92
92
83
76
75
72
87
77
73
80
80
80
64
53
41
58
98
94
91
96
96
96
86
76
85
62
100
100
100
100
100
100
100
99
100
95
Icelandic
93
88
-
65
70
71
66
71
66
64
83
72
-
39
49
50
33
42
26
43
97
92
-
57
64
65
66
70
73
59
100
99
-
98
99
99
99
99
99
90
Indonesian
61
47
80
69
68
77
51
46
62
26
39
25
56
43
52
56
16
26
36
20
61
46
84
68
73
82
54
45
63
26
83
71
100
95
78
93
82
66
88
32
Irish
91
85
-
60
66
69
63
67
66
77
82
70
-
35
41
47
28
42
29
66
94
90
-
57
66
68
64
66
78
76
96
95
-
89
93
93
97
94
92
90
Italian
87
71
77
89
89
89
66
62
44
58
69
42
50
74
74
74
28
31
7
24
92
74
81
92
92
92
70
57
32
57
100
98
99
100
100
100
100
98
93
94
Japanese
100
100
100
87
87
87
86
98
33
-
100
100
99
72
72
72
61
97
-
-
100
100
100
89
89
89
96
96
-
-
100
100
100
100
100
100
100
100
100
-
Kazakh
96
94
-
88
76
91
80
82
77
-
89
88
-
72
52
79
67
62
43
-
98
94
-
90
80
94
78
83
88
-
100
100
-
100
96
100
96
99
99
-
Korean
100
100
100
99
99
99
100
99
100
-
100
100
100
98
98
98
100
100
100
-
100
100
100
100
100
100
100
100
100
-
100
100
100
100
100
100
100
98
100
-
Latin
87
73
-
50
50
50
21
62
46
63
72
49
-
24
24
24
-
44
9
33
93
76
-
41
41
41
2
58
42
63
97
94
-
85
86
86
61
83
88
93
Latvian
93
87
89
82
82
84
83
75
72
45
85
75
76
65
66
69
64
51
33
36
97
90
92
83
84
86
86
77
84
33
99
97
99
97
97
98
98
98
98
65
Lithuanian
95
87
87
81
81
81
80
72
70
66
86
76
71
61
61
61
58
42
30
50
98
89
91
83
83
83
85
75
82
62
100
98
100
99
99
99
99
99
99
88
Macedonian
84
72
86
74
86
93
51
60
60
13
66
52
71
51
77
83
15
30
27
12
86
70
88
72
83
96
44
54
70
11
99
95
100
100
97
99
94
97
84
15
Malay
31
31
-
15
39
52
11
22
18
13
26
22
-
14
36
38
2
11
9
3
38
36
-
19
52
64
9
22
22
10
28
35
-
12
29
54
22
34
23
26
Maori
92
83
-
-
-
-
-
52
61
-
84
64
-
-
-
-
-
22
12
-
92
88
-
-
-
-
-
43
72
-
99
98
-
-
-
-
-
91
98
-
Marathi
85
39
88
80
8
75
80
84
83
-
74
16
77
61
9
61
70
69
65
-
85
30
89
81
15
69
79
84
86
-
96
72
98
99
1
95
91
98
99
-
Mongolian
97
95
-
81
85
89
86
83
78
-
92
88
-
59
66
72
68
63
43
-
99
98
-
86
91
94
90
87
92
-
99
99
-
98
99
100
99
99
100
-
Nynorsk
66
52
-
29
63
70
32
-
54
24
41
25
-
8
42
43
5
-
18
6
66
49
-
18
58
70
16
-
50
22
91
81
-
61
87
96
75
-
93
45
Persian
90
80
81
90
79
92
92
76
61
12
78
62
64
79
57
84
83
57
13
12
94
80
80
92
81
94
94
70
72
5
100
98
100
100
98
99
100
99
99
18
Polish
95
90
89
92
92
92
89
77
75
86
85
77
74
80
80
80
73
51
38
72
98
93
93
97
97
97
93
80
87
87
100
99
100
100
100
100
100
99
99
99
Portuguese
81
69
60
73
81
84
54
53
54
61
59
42
29
47
66
67
19
21
20
26
85
70
54
71
81
85
44
40
48
60
99
95
98
99
96
99
98
97
94
97
Punjabi
100
100
100
100
100
100
100
100
100
-
100
100
100
99
99
99
100
99
100
-
100
100
100
100
100
100
100
100
100
-
100
100
100
100
100
100
100
100
100
-
Romanian
87
72
77
64
64
64
61
53
54
57
69
49
56
38
38
38
31
24
11
34
92
74
79
60
60
60
60
48
53
51
99
94
97
95
95
95
92
88
96
86
Russian
90
78
84
94
94
97
75
71
60
66
76
59
70
86
88
92
60
48
26
54
95
84
87
98
97
99
75
72
68
62
98
92
96
100
98
99
91
93
87
83
Serbian
88
78
-
76
53
76
64
78
69
-
74
62
-
54
47
54
39
63
29
-
90
80
-
76
58
76
63
75
78
-
99
92
-
98
52
98
89
95
99
-
Shona
91
81
-
-
-
-
-
76
65
-
78
56
-
-
-
-
-
51
24
-
96
86
-
-
-
-
-
79
71
-
100
100
-
-
-
-
-
99
99
-
Slovak
84
75
74
65
80
83
68
63
71
68
64
49
50
41
63
64
40
32
38
45
90
78
75
62
81
86
66
61
76
66
99
97
98
91
97
98
97
96
99
93
Slovene
82
67
73
59
75
77
63
63
48
72
61
39
48
32
56
57
33
29
8
48
87
68
72
54
74
78
61
60
42
72
99
93
98
90
96
97
95
99
92
96
Somali
92
85
90
24
51
52
-
69
70
-
82
64
76
4
18
20
-
38
27
-
96
90
95
15
46
48
-
70
83
-
100
100
100
52
89
89
-
100
99
-
Sotho
86
72
-
-
-
-
-
49
54
-
67
43
-
-
-
-
-
15
13
-
90
75
-
-
-
-
-
33
54
-
100
97
-
-
-
-
-
98
95
-
Spanish
70
56
56
74
64
73
65
48
43
50
44
26
25
51
48
52
37
16
12
16
69
49
46
72
60
74
59
32
34
41
97
94
98
100
85
94
98
96
85
92
Swahili
81
70
73
41
41
41
42
57
57
46
60
43
47
7
7
7
3
25
16
26
84
68
74
24
24
24
24
49
59
41
98
97
99
92
92
92
98
98
97
72
Swedish
84
72
68
76
79
81
65
61
53
59
64
46
40
51
57
59
35
30
14
29
88
76
67
78
82
85
63
56
52
62
99
94
96
98
98
99
96
96
93
87
Tagalog
78
66
76
45
46
46
42
-
50
12
52
36
51
11
11
11
2
-
9
9
83
67
78
28
28
28
26
-
44
11
98
96
99
98
98
98
98
-
95
15
Tamil
100
100
100
100
100
100
100
100
100
-
100
100
100
100
100
100
100
100
100
-
100
100
100
100
100
100
100
100
100
-
100
100
100
100
100
100
100
99
100
-
Telugu
100
100
100
100
100
100
100
99
100
-
100
100
100
100
100
100
100
99
100
-
100
100
100
100
100
100
100
100
100
-
100
100
100
100
100
100
100
99
100
-
Thai
100
100
100
100
100
100
100
99
100
-
100
100
100
100
100
100
100
100
100
-
100
100
100
100
100
100
100
100
100
-
100
100
100
100
100
100
100
98
100
-
Tsonga
84
72
-
-
-
-
-
-
61
-
66
46
-
-
-
-
-
-
19
-
89
73
-
-
-
-
-
-
68
-
98
97
-
-
-
-
-
-
97
-
Tswana
84
71
-
-
-
-
-
-
56
-
65
44
-
-
-
-
-
-
17
-
88
73
-
-
-
-
-
-
57
-
99
96
-
-
-
-
-
-
94
-
Turkish
94
87
82
86
86
86
67
69
66
76
84
71
63
70
70
70
50
41
30
55
98
91
84
88
88
88
67
70
71
78
100
100
100
100
100
100
84
97
97
96
Ukrainian
92
86
83
91
95
98
76
81
77
78
84
75
66
78
90
94
54
62
46
62
97
92
84
94
95
98
77
83
88
75
95
93
98
100
100
100
96
98
99
97
Urdu
90
79
83
63
75
80
58
61
61
-
80
65
67
40
59
68
30
39
8
-
94
78
83
50
68
74
46
53
75
-
96
94
97
99
99
99
99
92
99
-
Vietnamese
91
87
93
89
89
89
86
66
63
-
79
76
81
71
71
71
65
26
-
-
94
87
98
97
97
97
93
74
90
-
99
98
100
100
100
100
100
99
100
-
Welsh
91
82
85
64
69
72
49
69
72
69
78
61
69
35
41
46
11
43
34
58
96
87
88
61
71
74
39
66
85
60
99
99
99
96
96
97
95
98
98
90
Xhosa
82
69
-
-
-
-
53
66
71
-
64
45
-
-
-
-
13
40
45
-
85
67
-
-
-
-
49
65
71
-
98
94
-
-
-
-
96
92
97
-
Yoruba
74
62
-
8
8
8
-
15
37
-
50
33
-
1
1
1
-
5
1
-
77
61
-
1
1
1
-
11
22
-
96
92
-
21
22
22
-
28
88
-
Zulu
81
70
-
-
-
-
6
63
54
-
62
45
-
-
-
-
0
35
18
-
83
72
-
-
-
-
6
63
51
-
97
94
-
-
-
-
11
92
93
-
Mean
86
78
82
74
77
81
68
69
65
52
74
61
65
58
62
66
48
48
34
34
89
78
82
74
77
82
65
67
68
50
96
93
98
92
91
96
90
93
94
73
Median
89.0
79.0
82.5
78.0
79.0
83.0
67.0
68.0
63.0
59.0
74.0
57.0
63.5
57.5
61.0
67.0
41.5
41.0
26.5
33.0
94.0
81.0
84.0
81.0
81.0
86.0
67.0
66.0
71.5
60.0
99.0
97.0
99.0
99.0
98.0
99.0
98.0
98.0
98.0
90.0
Standard Deviation
13.12
17.34
13.43
23.07
19.9
17.0
24.61
19.04
18.57
23.46
18.48
25.01
23.72
28.52
25.31
24.22
32.33
27.86
28.74
18.94
13.14
18.95
15.64
26.45
21.67
19.67
28.5
21.83
22.7
24.48
11.05
11.91
2.78
19.46
19.1
11.78
20.21
13.95
12.25
31.91
## 6. How fast is it?
The accuracy reporter script measures the time each language detector needs
to classify 3000 input texts for each of the supported 75 languages. The results
below have been produced on an iMac 3.6 Ghz 8-Core Intel Core i9 with 40 GB RAM.Lingua in [multi-threaded mode](https://github.com/pemistahl/lingua-py#117-single-threaded-versus-multi-threaded-language-detection)
is one of the fastest algorithms in this comparison. CLD 2, CLD 3 and fasttext
are similarly fast as they have been implemented in C or C++. Pure Python libraries
such as Simplemma, Langid or Langdetect are significantly slower.| Detector | Time |
|----------------------------------------------|-----------------:|
| Lingua (low accuracy mode, multi-threaded) | 3.00 sec |
| Lingua (high accuracy mode, multi-threaded) | 7.97 sec |
| CLD 2 | 8.65 sec |
| FastText | 10.50 sec |
| CLD 3 | 16.77 sec |
| Lingua (low accuracy mode, single-threaded) | 20.46 sec |
| Lingua (high accuracy mode, single-threaded) | 51.88 sec |
| FastSpell (aggressive mode) | 51.92 sec |
| FastSpell (conservative mode) | 52.32 sec |
| Simplemma | 2 min 36.44 sec |
| Langid | 3 min 50.40 sec |
| Langdetect | 10 min 43.96 sec |## 7. Why is it better than other libraries?
Every language detector uses a probabilistic
[n-gram](https://en.wikipedia.org/wiki/N-gram) model trained on the character
distribution in some training corpus. Most libraries only use n-grams of size 3
(trigrams) which is satisfactory for detecting the language of longer text
fragments consisting of multiple sentences. For short phrases or single words,
however, trigrams are not enough. The shorter the input text is, the less
n-grams are available. The probabilities estimated from such few n-grams are not
reliable. This is why *Lingua* makes use of n-grams of sizes 1 up to 5 which
results in much more accurate prediction of the correct language.A second important difference is that *Lingua* does not only use such a
statistical model, but also a rule-based engine. This engine first determines
the alphabet of the input text and searches for characters which are unique
in one or more languages. If exactly one language can be reliably chosen this
way, the statistical model is not necessary anymore. In any case, the
rule-based engine filters out languages that do not satisfy the conditions of
the input text. Only then, in a second step, the probabilistic n-gram model is
taken into consideration. This makes sense because loading less language models
means less memory consumption and better runtime performance.In general, it is always a good idea to restrict the set of languages to be
considered in the classification process using the respective api methods.
If you know beforehand that certain languages are never to occur in an input
text, do not let those take part in the classifcation process. The filtering
mechanism of the rule-based engine is quite good, however, filtering based on
your own knowledge of the input text is always preferable.## 8. Test report generation
If you want to reproduce the accuracy results above, you can generate the test
reports yourself for all classifiers and languages by installing
[Poetry](https://python-poetry.org) and executing:poetry install --no-root --only script
poetry run python3 scripts/accuracy_reporter.pyFor each detector and language, a test report file is then written into
[`/accuracy-reports`](https://github.com/pemistahl/lingua-py/tree/main/accuracy-reports).
As an example, here is the current output of the *Lingua* German report:```
##### German #####>>> Accuracy on average: 89.27%
>> Detection of 1000 single words (average length: 9 chars)
Accuracy: 74.20%
Erroneously classified as Dutch: 2.30%, Danish: 2.20%, English: 2.20%, Latin: 1.80%, Bokmal: 1.60%, Italian: 1.30%, Basque: 1.20%, Esperanto: 1.20%, French: 1.20%, Swedish: 0.90%, Afrikaans: 0.70%, Finnish: 0.60%, Nynorsk: 0.60%, Portuguese: 0.60%, Yoruba: 0.60%, Sotho: 0.50%, Tsonga: 0.50%, Welsh: 0.50%, Estonian: 0.40%, Irish: 0.40%, Polish: 0.40%, Spanish: 0.40%, Tswana: 0.40%, Albanian: 0.30%, Icelandic: 0.30%, Tagalog: 0.30%, Bosnian: 0.20%, Catalan: 0.20%, Croatian: 0.20%, Indonesian: 0.20%, Lithuanian: 0.20%, Romanian: 0.20%, Swahili: 0.20%, Zulu: 0.20%, Latvian: 0.10%, Malay: 0.10%, Maori: 0.10%, Slovak: 0.10%, Slovene: 0.10%, Somali: 0.10%, Turkish: 0.10%, Xhosa: 0.10%>> Detection of 1000 word pairs (average length: 18 chars)
Accuracy: 93.90%
Erroneously classified as Dutch: 0.90%, Latin: 0.90%, English: 0.70%, Swedish: 0.60%, Danish: 0.50%, French: 0.40%, Bokmal: 0.30%, Irish: 0.20%, Tagalog: 0.20%, Tsonga: 0.20%, Afrikaans: 0.10%, Esperanto: 0.10%, Estonian: 0.10%, Finnish: 0.10%, Italian: 0.10%, Maori: 0.10%, Nynorsk: 0.10%, Somali: 0.10%, Swahili: 0.10%, Turkish: 0.10%, Welsh: 0.10%, Zulu: 0.10%>> Detection of 1000 sentences (average length: 111 chars)
Accuracy: 99.70%
Erroneously classified as Dutch: 0.20%, Latin: 0.10%
```## 9. How to add it to your project?
*Lingua* is available in the [Python Package Index](https://pypi.org/project/lingua-language-detector)
and can be installed with:pip install lingua-language-detector
## 10. How to build locally?
*Lingua* requires Python >= 3.8.
First [download](https://pypi.org/project/lingua-language-detector/#files)
the correct Python wheel for your platform on PyPI and put it in the `lingua` directory.
Then create a virtualenv and install the Python wheel with `pip`.```
git clone https://github.com/pemistahl/lingua-py.git
cd lingua-py/lingua# Put the downloaded wheel file in this directory
cd ../
python3 -m venv .venv
source .venv/bin/activate
pip install --find-links=lingua lingua-language-detector
```In the scripts directory, there are Python scripts for writing accuracy reports,
drawing plots and writing accuracy values in an HTML table. The dependencies
for these scripts are managed by [Poetry](https://python-poetry.org) which
you need to install if you have not done so yet. In order to install the script
dependencies in your virtualenv, runpoetry install --no-root --only script
The project makes uses of type annotations which allow for static type checking with
[Mypy](http://mypy-lang.org). Run the following commands for checking the types:poetry install --no-root --only dev
poetry run mypyThe Python source code is formatted with [Black](https://github.com/psf/black):
poetry run black .
## 11. How to use?
### 11.1 Basic usage
```python
>>> from lingua import Language, LanguageDetectorBuilder
>>> languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
>>> detector = LanguageDetectorBuilder.from_languages(*languages).build()
>>> language = detector.detect_language_of("languages are awesome")
>>> language
Language.ENGLISH
>>> language.iso_code_639_1
IsoCode639_1.EN
>>> language.iso_code_639_1.name
'EN'
>>> language.iso_code_639_3
IsoCode639_3.ENG
>>> language.iso_code_639_3.name
'ENG'
```### 11.2 Minimum relative distance
By default, *Lingua* returns the most likely language for a given input text.
However, there are certain words that are spelled the same in more than one
language. The word *prologue*, for instance, is both a valid English and French
word. *Lingua* would output either English or French which might be wrong in
the given context. For cases like that, it is possible to specify a minimum
relative distance that the logarithmized and summed up probabilities for
each possible language have to satisfy. It can be stated in the following way:```python
>>> from lingua import Language, LanguageDetectorBuilder
>>> languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
>>> detector = LanguageDetectorBuilder.from_languages(*languages)\
.with_minimum_relative_distance(0.9)\
.build()
>>> print(detector.detect_language_of("languages are awesome"))
None
```Be aware that the distance between the language probabilities is dependent on
the length of the input text. The longer the input text, the larger the
distance between the languages. So if you want to classify very short text
phrases, do not set the minimum relative distance too high. Otherwise, `None`
will be returned most of the time as in the example above. This is the return
value for cases where language detection is not reliably possible.### 11.3 Confidence values
Knowing about the most likely language is nice but how reliable is the computed
likelihood? And how less likely are the other examined languages in comparison
to the most likely one? These questions can be answered as well:```python
>>> from lingua import Language, LanguageDetectorBuilder
>>> languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
>>> detector = LanguageDetectorBuilder.from_languages(*languages).build()
>>> confidence_values = detector.compute_language_confidence_values("languages are awesome")
>>> for confidence in confidence_values:
... print(f"{confidence.language.name}: {confidence.value:.2f}")
ENGLISH: 0.93
FRENCH: 0.04
GERMAN: 0.02
SPANISH: 0.01
```In the example above, a list is returned containing those languages which the
calling instance of LanguageDetector has been built from, sorted by
their confidence value in descending order. Each value is a probability between
0.0 and 1.0. The probabilities of all languages will sum to 1.0.
If the language is unambiguously identified by the rule engine, the value 1.0
will always be returned for this language. The other languages will receive a
value of 0.0.There is also a method for returning the confidence value for one specific
language only:```python
>>> from lingua import Language, LanguageDetectorBuilder
>>> languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
>>> detector = LanguageDetectorBuilder.from_languages(*languages).build()
>>> confidence_value = detector.compute_language_confidence("languages are awesome", Language.FRENCH)
>>> print(f"{confidence_value:.2f}")
0.04
```The value that this method computes is a number between 0.0 and 1.0. If the
language is unambiguously identified by the rule engine, the value 1.0 will
always be returned. If the given language is not supported by this detector
instance, the value 0.0 will always be returned.### 11.4 Eager loading versus lazy loading
By default, *Lingua* uses lazy-loading to load only those language models on
demand which are considered relevant by the rule-based filter engine. For web
services, for instance, it is rather beneficial to preload all language models
into memory to avoid unexpected latency while waiting for the service response.
If you want to enable the eager-loading mode, you can do it like this:```python
LanguageDetectorBuilder.from_all_languages().with_preloaded_language_models().build()
```Multiple instances of `LanguageDetector` share the same language models in
memory which are accessed asynchronously by the instances.### 11.5 Low accuracy mode versus high accuracy mode
*Lingua's* high detection accuracy comes at the cost of being noticeably slower
than other language detectors. The large language models also consume significant
amounts of memory. These requirements might not be feasible for systems running low
on resources. If you want to classify mostly long texts or need to save resources,
you can enable a *low accuracy mode* that loads only a small subset of the language
models into memory:```python
LanguageDetectorBuilder.from_all_languages().with_low_accuracy_mode().build()
```The downside of this approach is that detection accuracy for short texts consisting
of less than 120 characters will drop significantly. However, detection accuracy for
texts which are longer than 120 characters will remain mostly unaffected.In high accuracy mode (the default), the language detector consumes approximately
1 GB of memory if all language models are loaded. In low accuracy mode, memory
consumption is reduced to approximately 103 MB.An alternative for a smaller memory footprint and faster performance is to reduce the set
of languages when building the language detector. In most cases, it is not advisable to
build the detector from all supported languages. When you have knowledge about
the texts you want to classify you can almost always rule out certain languages as impossible
or unlikely to occur.### 11.6 Detection of multiple languages in mixed-language texts
In contrast to most other language detectors, *Lingua* is able to detect multiple languages
in mixed-language texts. This feature can yield quite reasonable results but it is still
in an experimental state and therefore the detection result is highly dependent on the input
text. It works best in high-accuracy mode with multiple long words for each language.
The shorter the phrases and their words are, the less accurate are the results. Reducing the
set of languages when building the language detector can also improve accuracy for this task
if the languages occurring in the text are equal to the languages supported by the respective
language detector instance.```python
>>> from lingua import Language, LanguageDetectorBuilder
>>> languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN]
>>> detector = LanguageDetectorBuilder.from_languages(*languages).build()
>>> sentence = "Parlez-vous français? " + \
... "Ich spreche Französisch nur ein bisschen. " + \
... "A little bit is better than nothing."
>>> for result in detector.detect_multiple_languages_of(sentence):
... print(f"{result.language.name}: '{sentence[result.start_index:result.end_index]}'")
FRENCH: 'Parlez-vous français? '
GERMAN: 'Ich spreche Französisch nur ein bisschen. '
ENGLISH: 'A little bit is better than nothing.'
```In the example above, a list of
[`DetectionResult`](https://github.com/pemistahl/lingua-py/blob/pure-python-impl/lingua/detector.py#L148)
is returned. Each entry in the list describes a contiguous single-language text section,
providing start and end indices of the respective substring.### 11.7 Single-threaded versus multi-threaded language detection
The `LanguageDetector` methods explained above all operate in a single thread.
If you want to classify a very large set of texts, you will probably want to
use all available CPU cores efficiently in multiple threads for maximum performance.Every single-threaded method has a multi-threaded equivalent that accepts a list of texts
and returns a list of results.| Single-threaded | Multi-threaded |
|--------------------------------------|--------------------------------------------------|
| `detect_language_of` | `detect_languages_in_parallel_of` |
| `detect_multiple_languages_of` | `detect_multiple_languages_in_parallel_of` |
| `compute_language_confidence_values` | `compute_language_confidence_values_in_parallel` |
| `compute_language_confidence` | `compute_language_confidence_in_parallel` |### 11.8 Methods to build the LanguageDetector
There might be classification tasks where you know beforehand that your
language data is definitely not written in Latin, for instance. The detection
accuracy can become better in such cases if you exclude certain languages from
the decision process or just explicitly include relevant languages:```python
from lingua import LanguageDetectorBuilder, Language, IsoCode639_1, IsoCode639_3# Include all languages available in the library.
LanguageDetectorBuilder.from_all_languages()# Include only languages that are not yet extinct (= currently excludes Latin).
LanguageDetectorBuilder.from_all_spoken_languages()# Include only languages written with Cyrillic script.
LanguageDetectorBuilder.from_all_languages_with_cyrillic_script()# Exclude only the Spanish language from the decision algorithm.
LanguageDetectorBuilder.from_all_languages_without(Language.SPANISH)# Only decide between English and German.
LanguageDetectorBuilder.from_languages(Language.ENGLISH, Language.GERMAN)# Select languages by ISO 639-1 code.
LanguageDetectorBuilder.from_iso_codes_639_1(IsoCode639_1.EN, IsoCode639_1.DE)# Select languages by ISO 639-3 code.
LanguageDetectorBuilder.from_iso_codes_639_3(IsoCode639_3.ENG, IsoCode639_3.DEU)
```## 12. What's next for version 2.1.0?
Take a look at the [planned issues](https://github.com/pemistahl/lingua-py/milestone/6).
## 13. Contributions
Any contributions to *Lingua* are very much appreciated. Please read the instructions
in [`CONTRIBUTING.md`](https://github.com/pemistahl/lingua-rs/blob/main/CONTRIBUTING.md)
in the repository of the Rust implementation for how to add new languages to the library.