https://github.com/danieljdufour/language-detector

Detect the language of text
https://github.com/danieljdufour/language-detector

arabic farsi french german kurdish kurmanci language language-detector nlp sorani spanish turkish

Last synced: 3 months ago
JSON representation

Detect the language of text

Host: GitHub
URL: https://github.com/danieljdufour/language-detector
Owner: DanielJDufour
License: apache-2.0
Created: 2015-12-04T22:35:57.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2020-06-07T02:40:09.000Z (about 5 years ago)
Last Synced: 2025-03-19T18:06:00.299Z (4 months ago)
Topics: arabic, farsi, french, german, kurdish, kurmanci, language, language-detector, nlp, sorani, spanish, turkish
Language: Python
Size: 1.35 MB
Stars: 36
Watchers: 6
Forks: 12
Open Issues: 5
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

        [![Build Status](https://travis-ci.org/DanielJDufour/language-detector.svg?branch=master)](https://travis-ci.org/DanielJDufour/language-detector)

# language-detector

language-detector detects the language of text

# Installation

```

pip install language-detector

```

# Python Version

Works with both Python 2 and 3

# Use

```

from language_detector import detect_language

text = "I arrived in that city on January 4, 1937"

language = detect_language(text)

# prints English

```

# Features

| Languages Supported |

| ------------------- |

| Arabic |

| English |

| Farsi |

| French |

| German |

| Khmer |

| Kurmanci (Kurdish) |

| Mandarin |

| Russian |

| Sorani (Kurdish) |

| Spanish |

| Turkish |

# Testing

To test the package run

```

python -m unittest language_detector.tests.test

```

# Comparison

Test is a comparison of how well language-detector and langid identify languages in the [data sources](language_detector/prep/sources). 

 

| package | language-detector | langid |

| ------- | ----------------- | ------ |

| test-duration (in seconds)| 0.10 | 3.83 |

| accuracy | 96.77% | 67.74% |

# Excluding Languages

If you don't want language-detector to look for certain languages, you can monkey-patch the code.  For example, in order to exclude English:

```

import language_detector

language_detector.char_language = [cl for cl in char_language if cl[1] != "English"]

# proceed as normal

``` 

# Datasets

The following is a list of datasets used for each language:  

| Language | Datasets |

| ------------------- | -------------------------- |

| Arabic | [UN Corpora](http://www.uncorpora.org/) |

| English |  [UN Corpora](http://www.uncorpora.org/) |

| Farsi | [BBC News Persian](https://www.bbc.com/persian) |

| French | [UN Corpora](http://www.uncorpora.org/) |

| German | [Deutsche Welle](https://www.dw.com/de) |

| Khmer | [Cambodia Daily](https://www.cambodiadaily.com) |

| Kurmanci (Kurdish) | [Rudaw](https://rudaw.net/kurmanci) |

| Mandarin | [UN Corpora](http://www.uncorpora.org/) |

| Russian | [UN Corpora](http://www.uncorpora.org/) |

| Sorani (Kurdish) | [Rudaw](https://www.rudaw.net/sorani) |

| Spanish | [UN Corpora](http://www.uncorpora.org/) |

| Turkish | [BBC News Türkçe](https://www.bbc.com/turkce) |

# How Does It Work?

When training the model, we scan all the data sources and compute the frequency of how often a character appears in each specific language.  We also compute the frequency of how often a characters appears in all of the data sources for all the languages.  For each language, we then calculate a score for each character as `frequency_in_language / frequency_in_all_languages`.  We then save the top ten highest scoring characters for each language.  

When detecting a language, we simply iterate through the saved characters (ten for each language), and add their score as a weighted-vote for each language.  Whichever, language has the highest score is selected as the winner.

# Contributing

If you'd like to contribute a new language, please consult [CONTRIBUTING.md](CONTRIBUTING.md)

# Support

Contact the package author, Daniel J. Dufour, at [email protected]

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/danieljdufour/language-detector

Awesome Lists containing this project

README