{"id":18417788,"url":"https://github.com/danieljdufour/language-detector","last_synced_at":"2025-04-07T12:32:56.159Z","repository":{"id":62575065,"uuid":"47432249","full_name":"DanielJDufour/language-detector","owner":"DanielJDufour","description":"Detect the language of text","archived":false,"fork":false,"pushed_at":"2020-06-07T02:40:09.000Z","size":1416,"stargazers_count":36,"open_issues_count":5,"forks_count":12,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-19T18:06:00.299Z","etag":null,"topics":["arabic","farsi","french","german","kurdish","kurmanci","language","language-detector","nlp","sorani","spanish","turkish"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DanielJDufour.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-12-04T22:35:57.000Z","updated_at":"2025-01-08T10:47:44.000Z","dependencies_parsed_at":"2022-11-03T18:50:06.529Z","dependency_job_id":null,"html_url":"https://github.com/DanielJDufour/language-detector","commit_stats":null,"previous_names":[],"tags_count":24,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielJDufour%2Flanguage-detector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielJDufour%2Flanguage-detector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielJDufour%2Flanguage-detector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielJDufour%2Flanguage-detector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DanielJDufour","download_url":"https://codeload.github.com/DanielJDufour/language-detector/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247653388,"owners_count":20973821,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arabic","farsi","french","german","kurdish","kurmanci","language","language-detector","nlp","sorani","spanish","turkish"],"created_at":"2024-11-06T04:11:16.320Z","updated_at":"2025-04-07T12:32:51.149Z","avatar_url":"https://github.com/DanielJDufour.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.org/DanielJDufour/language-detector.svg?branch=master)](https://travis-ci.org/DanielJDufour/language-detector)\n\n# language-detector\nlanguage-detector detects the language of text\n\n# Installation\n```\npip install language-detector\n```\n\n# Python Version\nWorks with both Python 2 and 3\n\n# Use\n```\nfrom language_detector import detect_language\ntext = \"I arrived in that city on January 4, 1937\"\nlanguage = detect_language(text)\n# prints English\n```\n\n# Features\n| Languages Supported |\n| ------------------- |\n| Arabic |\n| English |\n| Farsi |\n| French |\n| German |\n| Khmer |\n| Kurmanci (Kurdish) |\n| Mandarin |\n| Russian |\n| Sorani (Kurdish) |\n| Spanish |\n| Turkish |\n\n# Testing\nTo test the package run\n```\npython -m unittest language_detector.tests.test\n```\n\n# Comparison\nTest is a comparison of how well language-detector and langid identify languages in the [data sources](language_detector/prep/sources). \n \n| package | language-detector | langid |\n| ------- | ----------------- | ------ |\n| test-duration (in seconds)| 0.10 | 3.83 |\n| accuracy | 96.77% | 67.74% |\n\n\n# Excluding Languages\nIf you don't want language-detector to look for certain languages, you can monkey-patch the code.  For example, in order to exclude English:\n```\nimport language_detector\nlanguage_detector.char_language = [cl for cl in char_language if cl[1] != \"English\"]\n\n# proceed as normal\n``` \n\n# Datasets\nThe following is a list of datasets used for each language:  \n\n| Language | Datasets |\n| ------------------- | -------------------------- |\n| Arabic | [UN Corpora](http://www.uncorpora.org/) |\n| English |  [UN Corpora](http://www.uncorpora.org/) |\n| Farsi | [BBC News Persian](https://www.bbc.com/persian) |\n| French | [UN Corpora](http://www.uncorpora.org/) |\n| German | [Deutsche Welle](https://www.dw.com/de) |\n| Khmer | [Cambodia Daily](https://www.cambodiadaily.com) |\n| Kurmanci (Kurdish) | [Rudaw](https://rudaw.net/kurmanci) |\n| Mandarin | [UN Corpora](http://www.uncorpora.org/) |\n| Russian | [UN Corpora](http://www.uncorpora.org/) |\n| Sorani (Kurdish) | [Rudaw](https://www.rudaw.net/sorani) |\n| Spanish | [UN Corpora](http://www.uncorpora.org/) |\n| Turkish | [BBC News Türkçe](https://www.bbc.com/turkce) |\n\n# How Does It Work?\nWhen training the model, we scan all the data sources and compute the frequency of how often a character appears in each specific language.  We also compute the frequency of how often a characters appears in all of the data sources for all the languages.  For each language, we then calculate a score for each character as `frequency_in_language / frequency_in_all_languages`.  We then save the top ten highest scoring characters for each language.  \nWhen detecting a language, we simply iterate through the saved characters (ten for each language), and add their score as a weighted-vote for each language.  Whichever, language has the highest score is selected as the winner.\n\n# Contributing\nIf you'd like to contribute a new language, please consult [CONTRIBUTING.md](CONTRIBUTING.md)\n\n# Support\nContact the package author, Daniel J. Dufour, at daniel.j.dufour@gmail.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanieljdufour%2Flanguage-detector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanieljdufour%2Flanguage-detector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanieljdufour%2Flanguage-detector/lists"}