{"id":30065222,"url":"https://github.com/xi/tiny-lang-detect","last_synced_at":"2025-08-08T05:50:06.269Z","repository":{"id":291716999,"uuid":"978538681","full_name":"xi/tiny-lang-detect","owner":"xi","description":"Generate tiny models for language detection","archived":false,"fork":false,"pushed_at":"2025-05-26T18:53:48.000Z","size":24,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-26T19:51:41.235Z","etag":null,"topics":["langdetect","language-identification"],"latest_commit_sha":null,"homepage":"https://xi.github.io/tiny-lang-detect/demo/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-06T06:24:35.000Z","updated_at":"2025-05-26T18:53:51.000Z","dependencies_parsed_at":"2025-05-26T19:37:04.143Z","dependency_job_id":"e5bddb11-7e2a-47e1-8d35-62fdbd5f1603","html_url":"https://github.com/xi/tiny-lang-detect","commit_stats":null,"previous_names":["xi/tiny-lang-detect"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/xi/tiny-lang-detect","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xi%2Ftiny-lang-detect","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xi%2Ftiny-lang-detect/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xi%2Ftiny-lang-detect/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xi%2Ftiny-lang-detect/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xi","download_url":"https://codeload.github.com/xi/tiny-lang-detect/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xi%2Ftiny-lang-detect/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269373068,"owners_count":24406313,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-08T02:00:09.200Z","response_time":72,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["langdetect","language-identification"],"created_at":"2025-08-08T05:50:01.183Z","updated_at":"2025-08-08T05:50:06.221Z","avatar_url":"https://github.com/xi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tiny language detection\n\nLanguage detection libraries like\n[langdetect](https://github.com/DoodleBears/langdetect/) usually come with\nlarge models. But if we just want to distinguish between a small set of\nlanguages, the size of the model can be reduce significantly.\n\nThis is an experiment to generate tiny models that only contain the most\nsignificant n-grams needed to distinguish between two languages.\n\nExample usage:\n\n```sh\n$ ./download_data.sh\n$ python gen_model.py en de -n 10 \u003e en_de.json\n$ python test.py en_de.json\n981 out of 1000 samples were detected correctly (98.1%)\n```\n\nA model might look like this:\n\n```json\n{\n  \"ngrams\": [\"o\", \"e\", \"a\", \"en \", \"er\", \" th\", \"ch\", \" t\", \"en\", \"ei\"],\n  \"freq\": {\n    \"en\": [0.0716, 0.1067, 0.0897, 0.0023, 0.0135, 0.0161, 0.0036, 0.0164, 0.0079, 0.0009],\n    \"de\": [0.0311, 0.1466, 0.0574, 0.0202, 0.0299, 0.0002, 0.0195, 0.0006, 0.0233, 0.0159]\n  }\n}\n```\n\nYou can use the model like this:\n\n```py\ndef probability(p, q):\n    return math.prod(qi ** pi * (1 - qi) ** (1 - pi) for pi, qi in zip(p, q))\n\ndef classify(model, text):\n    n = len(text) + 1\n    freq = [text.count(g) / (n - len(g)) for g in model['ngrams']]\n    return max(model['freq'], key=lambda lang: probability(freq, model['freq'][lang]))\n```\n\n## An even simpler classifier\n\nTo take this idea to the exteme, you could reduce the model to the single most\nsiginificant n-gram:\n\n```py\ndef classify(text):\n    freq = text.count('o') / len(text)\n    return 'en' if freq \u003e 0.05 else 'de'\n```\n\nThis classifier still has an accuracy of 82.1% on the test data.\n\n## How does it work?\n\n`langdetect` works by comparing n-gram frequencies. For example, the 3-gram\n\" th\" is much more common in English than in German.\n\nBefore counting n-grams, it does some pre-processing, e.g. removing\npunctuation, URLs, or Latin characters in non-Latin texts. Then it uses\nBayesian methods to find the most likely language for those frequencies.\n\nThe examples in this repo are much simpler though. They do not do any\npre-processing. This is ultimately a trade-off between accuracy and simplicity.\n\nTo simplify the model, `gen_model.py` filters out all but the most significant\nn-grams. N-grams are considered more significant if their frequencies have a\nlarge absolute difference between the candidate languages.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxi%2Ftiny-lang-detect","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxi%2Ftiny-lang-detect","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxi%2Ftiny-lang-detect/lists"}