{"id":18119116,"url":"https://github.com/zjaume/heliport","last_synced_at":"2025-07-17T13:35:12.434Z","repository":{"id":244810159,"uuid":"719087944","full_name":"ZJaume/heliport","owner":"ZJaume","description":"Fast and accurate language identifier","archived":false,"fork":false,"pushed_at":"2025-04-14T11:01:28.000Z","size":117149,"stargazers_count":4,"open_issues_count":4,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-14T17:11:56.729Z","etag":null,"topics":["language-detection","language-identification","nlp","python","rust"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ZJaume.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-15T12:32:17.000Z","updated_at":"2025-04-14T11:01:31.000Z","dependencies_parsed_at":"2025-04-07T14:42:18.732Z","dependency_job_id":null,"html_url":"https://github.com/ZJaume/heliport","commit_stats":null,"previous_names":["zjaume/heli-otr"],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJaume%2Fheliport","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJaume%2Fheliport/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJaume%2Fheliport/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZJaume%2Fheliport/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ZJaume","download_url":"https://codeload.github.com/ZJaume/heliport/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248923765,"owners_count":21183953,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["language-detection","language-identification","nlp","python","rust"],"created_at":"2024-11-01T05:14:47.328Z","updated_at":"2025-07-17T13:35:12.427Z","avatar_url":"https://github.com/ZJaume.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# heliport\n![License](https://img.shields.io/github/license/zjaume/heliport?color=blue)\n![PyPi-version](https://img.shields.io/pypi/v/heliport)\n![Python-version](https://img.shields.io/python/required-version-toml?tomlFilePath=https%3A%2F%2Fgithub.com%2FZJaume%2Fheliport%2Fraw%2Frefs%2Fheads%2Fmain%2Fpyproject.toml)\n![Supported-languages](https://img.shields.io/badge/supported_languages-220-green)\n\n\nA language identification tool which aims for both speed and accuracy, with support for [220 languages](LANGS.md)(or [add your own languages!](docs/train.md)).\n\nThis tool is an efficient [HeLI-OTS](https://aclanthology.org/2022.lrec-1.416/) port to Rust,\nachieving 25x speedups while having almost identical output.\n\n## Installation\n### From PyPi\nInstall it in your environment\n```\npip install heliport\n```\n\nNOTE: Since version 0.8 models do not need to be downloaded anymore.\n\n### From source\nInstall the requirements:\n - Python\n - PIP\n - [Rust](https://rustup.rs)\n\nClone the repo, build the package and binarize the model\n```\ngit clone https://github.com/ZJaume/heliport\ncd heliport\npip install .\n```\n\n## Usage\n### CLI\nJust run the `heliport identify` command that reads lines from stdin\n```\ncat sentences.txt | heliport identify\n```\n```\neng\ncat\nrus\n...\n```\n\n```\nIdentify languages of input text\n\nUsage: heliport identify [OPTIONS] [INPUT_FILE] [OUTPUT_FILE]\n\nArguments:\n  [INPUT_FILE]   Input file, default: stdin\n  [OUTPUT_FILE]  Output file, default: stdout\n\nOptions:\n  -j, --threads \u003cTHREADS\u003e                Number of parallel threads to use.\n                                         0 means no multi-threading\n                                         1 means running the identification in a separated thread\n                                         \u003e1 run multithreading [default: 0]\n  -b, --batch-size \u003cBATCH_SIZE\u003e          Number of text segments to pre-load for parallel processing [default:\n                                         100000]\n  -c, --ignore-confidence                Ignore confidence thresholds. Predictions under the thresholds will not be\n                                         labeled as 'und'\n  -s, --print-scores                     Print confidence score (higher is better) or raw score (lower is better) in\n                                         case '-c' is provided\n  -n, --not-strict                       Do not be strict when loading confidence thresholds (do not fail if one\n                                         language is missing)\n  -p, --precision \u003cPRECISION\u003e            Number of decimals precision when printing scores [default: 4]\n  -m, --model-dir \u003cMODEL_DIR\u003e            Model directory containing binarized model or plain text model. Default is\n                                         Python module path or './LanguageModels' if relevant languages are requested\n  -l, --relevant-langs \u003cRELEVANT_LANGS\u003e  Load only relevant languages. Specify a comma-separated list of language\n                                         codes. Needs plain text model directory\n  -h, --help                             Print help\n```\n\n### Python package\n```python\n\u003e\u003e\u003e from heliport import Identifier\n\u003e\u003e\u003e i = Identifier()\n\u003e\u003e\u003e i.identify(\"L'aigua clara\")\n'cat'\n```\n\nFor further information of the avaliable functions and parameters, please take a look at the module docs:\n```python\n\u003e\u003e\u003e import heliport\n\u003e\u003e\u003e help(heliport)\n```\n\n### Rust crate\n```rust\nuse std::path::PathBuf;\nuse heliport::identifier::Identifier;\nuse heliport::lang::Lang;\n\nlet identifier = Identifier::load(\n    PathBuf::from(\"/path/to/model_dir\",\n    None,\n    );\nlet lang, score = identifier.identify(\"L'aigua clara\");\nassert_eq!(lang, Lang::cat);\n```\n\n## Differences with HeLI-OTS\nAlthough `heliport` currently uses the same models as HeLI-OTS 2.0 and the \nidentification algorithm is almost the same, there are a few differences\n(mainly during pre-processing) that may cause different results.\nHowever, in most case, these should not deacrease accuracy and should not happen frequently.\n\n**Note**: Both tools have a pre-processing step for each identified text to\nremove all non-alphabetic characters.\n\nThe implementation differences that can change results are:\n - `HeLI` during preprocessing removes urls and words beginning with `@`, while `heliport` does not.\n - Since 1.5, during preprocessing, HeLI repeats every word that does not start with capital letter, This is probably to penalize proper nouns. However, in our tests, we have not find a significant improvement with this. Therefore,to avoid multiplying the cost of prediction by almost x2, this has not been implemented. In the future it might end up being implemented if there is need for it and can be implemented efficiently.\n - Rust and Java implementations have small precision differences due to Rust accumulating probabilities with double precision floats.\n\n## Benchmarks\nSpeed benchmarks with 100k random sentences from [OpenLID](https://github.com/laurieburchell/open-lid-dataset), all the tools running single-threaded:\n| tool | time (s) |\n| :--------- | ---------: |\n| CLD2 | 1.12 |\n| HeLI-OTS | 60.37 |\n| lingua all high preloaded | 56.29 |\n| lingua all low preloaded | 23.34\n| fasttext openlid193 | 8.44 |\n| heliport | 2.33 |\n\n___\n\n![Connecting Europe Facility](https://www.paracrawl.eu/images/logo_en_cef273x39.png)\n\nAll documents and software contained in this repository reflect only the authors' view. The Innovation and Networks Executive Agency of the European Union is not responsible for any use that may be made of the information it contains.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjaume%2Fheliport","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzjaume%2Fheliport","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjaume%2Fheliport/lists"}