{"id":18654350,"url":"https://github.com/llmkira/fast-langdetect","last_synced_at":"2025-04-05T05:02:48.188Z","repository":{"id":217810693,"uuid":"744392434","full_name":"LlmKira/fast-langdetect","owner":"LlmKira","description":"⚡️ 80x faster Fasttext language detection out of the box | Split text by language","archived":false,"fork":false,"pushed_at":"2025-03-04T02:33:35.000Z","size":940,"stargazers_count":180,"open_issues_count":1,"forks_count":8,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-29T04:04:13.199Z","etag":null,"topics":["detect-languages","fasttext","i18n","language-identification","languagedetector","svc","tts"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LlmKira.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-17T07:54:01.000Z","updated_at":"2025-03-27T21:05:00.000Z","dependencies_parsed_at":"2024-08-04T04:06:52.918Z","dependency_job_id":"cfa02697-4c6d-46b4-8ad2-b6a0109d9538","html_url":"https://github.com/LlmKira/fast-langdetect","commit_stats":null,"previous_names":["llmkira/fast-langdetect"],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LlmKira%2Ffast-langdetect","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LlmKira%2Ffast-langdetect/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LlmKira%2Ffast-langdetect/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LlmKira%2Ffast-langdetect/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LlmKira","download_url":"https://codeload.github.com/LlmKira/fast-langdetect/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247289409,"owners_count":20914464,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["detect-languages","fasttext","i18n","language-identification","languagedetector","svc","tts"],"created_at":"2024-11-07T07:14:43.851Z","updated_at":"2025-04-05T05:02:48.182Z","avatar_url":"https://github.com/LlmKira.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# fast-langdetect 🚀\n\n[![PyPI version](https://badge.fury.io/py/fast-langdetect.svg)](https://badge.fury.io/py/fast-langdetect)\n[![Downloads](https://pepy.tech/badge/fast-langdetect)](https://pepy.tech/project/fast-langdetect)\n[![Downloads](https://pepy.tech/badge/fast-langdetect/month)](https://pepy.tech/project/fast-langdetect/)\n\n## Overview\n\n**`fast-langdetect`** is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.\n\n- Supported Python `3.9` to `3.13`.\n- Works offline  in low memory mode\n- No `numpy` required (thanks to @dalf).\n\n\u003e ### Background\n\u003e \n\u003e This project builds upon [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect#benchmark) with enhancements in packaging.\n\u003e For more information about the underlying model, see the official FastText documentation: [Language Identification](https://fasttext.cc/docs/en/language-identification.html).\n\n\u003e ### Possible memory usage\n\u003e \n\u003e *This library requires at least **200MB memory** in low-memory mode.*\n\n## Installation 💻\n\nTo install fast-langdetect, you can use either `pip` or `pdm`:\n\n### Using pip\n\n```bash\npip install fast-langdetect\n```\n\n### Using pdm\n\n```bash\npdm add fast-langdetect\n```\n\n## Usage 🖥️\n\nIn scenarios **where accuracy is important**, you should not rely on the detection results of small models, use `low_memory=False` to download larger models!\n\n### Prerequisites\n\n- If the sample is too long or too short, the accuracy will be reduced.\n- The model will be downloaded to system temporary directory by default. You can customize it by:\n  - Setting `FTLANG_CACHE` environment variable\n  - Using `LangDetectConfig(cache_dir=\"your/path\")`\n\n### Native API (Recommended)\n\n```python\nfrom fast_langdetect import detect, detect_multilingual, LangDetector, LangDetectConfig, DetectError\n\n# Simple detection\nprint(detect(\"Hello, world!\"))\n# Output: {'lang': 'en', 'score': 0.12450417876243591}\n\n# Using large model for better accuracy\nprint(detect(\"Hello, world!\", low_memory=False))\n# Output: {'lang': 'en', 'score': 0.98765432109876}\n\n# Custom configuration with fallback mechanism\nconfig = LangDetectConfig(\n    cache_dir=\"/custom/cache/path\",  # Custom model cache directory\n    allow_fallback=True             # Enable fallback to small model if large model fails\n)\ndetector = LangDetector(config)\n\ntry:\n    result = detector.detect(\"Hello world\", low_memory=False)\n    print(result)  # {'lang': 'en', 'score': 0.98}\nexcept DetectError as e:\n    print(f\"Detection failed: {e}\")\n\n# How to deal with multiline text\nmultiline_text = \"\"\"\nHello, world!\nThis is a multiline text.\n\"\"\"\nmultiline_text = multiline_text.replace(\"\\n\", \" \")  \nprint(detect(multiline_text))\n# Output: {'lang': 'en', 'score': 0.8509423136711121}\n\n# Multi-language detection\nresults = detect_multilingual(\n    \"Hello 世界 こんにちは\", \n    low_memory=False,  # Use large model for better accuracy\n    k=3               # Return top 3 languages\n)\nprint(results)\n# Output: [\n#     {'lang': 'ja', 'score': 0.4}, \n#     {'lang': 'zh', 'score': 0.3}, \n#     {'lang': 'en', 'score': 0.2}\n# ]\n```\n\n#### Fallbacks\n\nWe provide a fallback mechanism: when `allow_fallback=True`, if the program fails to load the **large model** (`low_memory=False`), it will fall back to the offline **small model** to complete the prediction task.\n\n```python\n# Disable fallback - will raise error if large model fails to load\n# But fallback disabled when custom_model_path is not None, because its a custom model, we will directly use it.\nimport tempfile\nconfig = LangDetectConfig(\n    allow_fallback=False, \n    custom_model_path=None,\n    cache_dir=tempfile.gettempdir(),\n    )\ndetector = LangDetector(config)\n\ntry:\n    result = detector.detect(\"Hello world\", low_memory=False)\nexcept DetectError as e:\n    print(\"Model loading failed and fallback is disabled\")\n```\n\n### Convenient `detect_language` Function\n\n```python\nfrom fast_langdetect import detect_language\n\n# Single language detection\nprint(detect_language(\"Hello, world!\"))\n# Output: EN\n\nprint(detect_language(\"Привет, мир!\"))\n# Output: RU\n\nprint(detect_language(\"你好，世界！\"))\n# Output: ZH\n```\n\n### Load Custom Models\n\n```python\n# Load model from local file\nconfig = LangDetectConfig(\n    custom_model_path=\"/path/to/your/model.bin\",  # Use local model file\n    disable_verify=True                     # Skip MD5 verification\n)\ndetector = LangDetector(config)\nresult = detector.detect(\"Hello world\")\n```\n\n### Splitting Text by Language 🌐\n\nFor text splitting based on language, please refer to the [split-lang](https://github.com/DoodleBears/split-lang)\nrepository.\n\n## Benchmark 📊\n\nFor detailed benchmark results, refer\nto [zafercavdar/fasttext-langdetect#benchmark](https://github.com/zafercavdar/fasttext-langdetect#benchmark).\n\n## References 📚\n\n[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification\n\n```bibtex\n@article{joulin2016bag,\n  title={Bag of Tricks for Efficient Text Classification},\n  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},\n  journal={arXiv preprint arXiv:1607.01759},\n  year={2016}\n}\n```\n\n[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification\nmodels\n\n```bibtex\n@article{joulin2016fasttext,\n  title={FastText.zip: Compressing text classification models},\n  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\\'e}gou, H{\\'e}rve and Mikolov, Tomas},\n  journal={arXiv preprint arXiv:1612.03651},\n  year={2016}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fllmkira%2Ffast-langdetect","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fllmkira%2Ffast-langdetect","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fllmkira%2Ffast-langdetect/lists"}