{"id":13617439,"url":"https://github.com/optimaize/language-detector","last_synced_at":"2026-04-06T03:36:52.943Z","repository":{"id":14551243,"uuid":"17266622","full_name":"optimaize/language-detector","owner":"optimaize","description":"Language Detection Library for Java","archived":false,"fork":false,"pushed_at":"2022-07-23T20:22:31.000Z","size":2102,"stargazers_count":568,"open_issues_count":57,"forks_count":165,"subscribers_count":38,"default_branch":"master","last_synced_at":"2024-11-08T02:32:56.599Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/optimaize.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-02-27T22:05:31.000Z","updated_at":"2024-10-13T09:58:47.000Z","dependencies_parsed_at":"2022-08-07T08:00:21.333Z","dependency_job_id":null,"html_url":"https://github.com/optimaize/language-detector","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/optimaize%2Flanguage-detector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/optimaize%2Flanguage-detector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/optimaize%2Flanguage-detector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/optimaize%2Flanguage-detector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/optimaize","download_url":"https://codeload.github.com/optimaize/language-detector/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248835022,"owners_count":21169139,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T20:01:41.720Z","updated_at":"2025-12-17T00:31:24.646Z","avatar_url":"https://github.com/optimaize.png","language":"Java","funding_links":[],"categories":["Java","人工智能"],"sub_categories":["自然语言处理"],"readme":"# language-detector\n\nLanguage Detection Library for Java\n\n    \u003cdependency\u003e\n        \u003cgroupId\u003ecom.optimaize.languagedetector\u003c/groupId\u003e\n        \u003cartifactId\u003elanguage-detector\u003c/artifactId\u003e\n        \u003cversion\u003e0.6\u003c/version\u003e\n    \u003c/dependency\u003e\n\n\n## Language Support\n\n### 71 Built-in Language Profiles\n\n1. af Afrikaans\n1. an Aragonese\n1. ar Arabic\n1. ast Asturian\n1. be Belarusian\n1. br Breton\n1. ca Catalan\n1. bg Bulgarian\n1. bn Bengali\n1. cs Czech\n1. cy Welsh\n1. da Danish\n1. de German\n1. el Greek\n1. en English\n1. es Spanish\n1. et Estonian\n1. eu Basque\n1. fa Persian\n1. fi Finnish\n1. fr French\n1. ga Irish\n1. gl Galician\n1. gu Gujarati\n1. he Hebrew\n1. hi Hindi\n1. hr Croatian\n1. ht Haitian\n1. hu Hungarian\n1. id Indonesian\n1. is Icelandic\n1. it Italian\n1. ja Japanese\n1. km Khmer\n1. kn Kannada\n1. ko Korean\n1. lt Lithuanian\n1. lv Latvian\n1. mk Macedonian\n1. ml Malayalam\n1. mr Marathi\n1. ms Malay\n1. mt Maltese\n1. ne Nepali\n1. nl Dutch\n1. no Norwegian\n1. oc Occitan\n1. pa Punjabi\n1. pl Polish\n1. pt Portuguese\n1. ro Romanian\n1. ru Russian\n1. sk Slovak\n1. sl Slovene\n1. so Somali\n1. sq Albanian\n1. sr Serbian\n1. sv Swedish\n1. sw Swahili\n1. ta Tamil\n1. te Telugu\n1. th Thai\n1. tl Tagalog\n1. tr Turkish\n1. uk Ukrainian\n1. ur Urdu\n1. vi Vietnamese\n1. wa Walloon\n1. yi Yiddish\n1. zh-cn Simplified Chinese\n1. zh-tw Traditional Chinese\n\nUser danielnaber has made available a profile for Esperanto on his website, see open tasks.\n\nThere are two kinds of profiles. The standard ones created from Wikipedia articles and similar.\nAnd the \"short text\" profiles created from Twitter tweets. Fewer language profiles exist for the\nshort text, more would be available, see https://github.com/optimaize/language-detector/issues/57\n\n### Other Languages\n\nYou can create a language profile for your own language easily.\nSee https://github.com/optimaize/language-detector/blob/master/src/main/resources/README.md\n\n\n## How it Works\n\nThe software uses language profiles which were created based on common text for each language.\nN-grams http://en.wikipedia.org/wiki/N-gram were then extracted from that text, and that's what is stored in the profiles.\n\nWhen trying to figure out in what language a certain text is written, the program goes through the same process:\nIt creates the same kind of n-grams of the input text. Then it compares the relative frequency of them, and finds the\nlanguage that matches best.\n\n\n### Challenges\n\nThis software does not work as well when the input text to analyze is short, or unclean. For example tweets.\n\nWhen a text is written in multiple languages, the default algorithm of this software is not appropriate.\nYou can try to split the text (by sentence or paragraph) and detect the individual parts. Running the language guesser\non the whole text will just tell you the language that is most dominant, in the best case.\n\nThis software cannot handle it well when the input text is in none of the expected (and supported) languages.\nFor example if you only load the language profiles from English and German, but the text is written in French,\nthe program may pick the more likely one, or say it doesn't know. (An improvement would be to clearly detect that\nit's unlikely one of the supported languages.)\n\nIf you are looking for a language detector / language guesser library in Java, this seems to be the best open source\nlibrary you can get at this time. If it doesn't need to be Java, you may want to take a look at https://code.google.com/p/cld2/\n\n\n## How to Use\n\n#### Language Detection for your Text\n\n    //load all languages:\n    List\u003cLanguageProfile\u003e languageProfiles = new LanguageProfileReader().readAllBuiltIn();\n\n    //build language detector:\n    LanguageDetector languageDetector = LanguageDetectorBuilder.create(NgramExtractors.standard())\n            .withProfiles(languageProfiles)\n            .build();\n\n    //create a text object factory\n    TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();\n\n    //query:\n    TextObject textObject = textObjectFactory.forText(\"my text\");\n    Optional\u003cLdLocale\u003e lang = languageDetector.detect(textObject);\n\n\n#### Creating Language Profiles for your Training Text\n\nSee https://github.com/optimaize/language-detector/wiki/Creating-Language-Profiles\n\n\n## How You Can Help\n\nIf your language is not supported yet, then you can provide clean \"training text\", that is, common text written in your\nlanguage. The text should be fairly long (a couple of pages at the very least). If you can provide that, please open\na ticket.\n\nIf your language is supported already, but not identified clearly all the time, you can still provide such training\ntext. We might then be able to improve detection for your language.\n\nIf you're a programmer, dig in the source and see what you can improve. Check the open tasks.\n\n\n## Memory Consumption\n\nLoading all 71 language profiles uses 74MB ram to store the data in memory.\nFor memory considerations see https://github.com/optimaize/language-detector/wiki/Memory-Consumption\n\n\n## History and Changes\n\nThis project is a fork of a fork, the original author is Nakatani Shuyo.\nFor detail see https://github.com/optimaize/language-detector/wiki/History-and-Changes\n\n\n## Where it's used\n\nAn adapted version of this is used by the http://www.NameAPI.org server.\n\nhttps://www.languagetool.org/ is a proof-reading software for LibreOffice/OpenOffice, for the Desktop and for Firefox.\n\n\n\n## License\n\nApache 2 (business friendly)\n\n\n\n## Authors\n\nNakatani Shuyo, Fabian Kessler, Francois ROLAND, Robert Theis\n\nFor detail see https://github.com/optimaize/language-detector/wiki/Authors\n\n\n## For Maven Users\n\nThe project is in Maven central http://search.maven.org/#artifactdetails%7Ccom.optimaize.languagedetector%7Clanguage-detector%7C0.4%7Cjar this is the latest version:\n\n    \u003cdependency\u003e\n        \u003cgroupId\u003ecom.optimaize.languagedetector\u003c/groupId\u003e\n        \u003cartifactId\u003elanguage-detector\u003c/artifactId\u003e\n        \u003cversion\u003e0.6\u003c/version\u003e\n    \u003c/dependency\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foptimaize%2Flanguage-detector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foptimaize%2Flanguage-detector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foptimaize%2Flanguage-detector/lists"}