{"id":13497021,"url":"https://github.com/quickwit-oss/whichlang","last_synced_at":"2025-05-15T20:07:32.539Z","repository":{"id":152332244,"uuid":"610607251","full_name":"quickwit-oss/whichlang","owner":"quickwit-oss","description":"A blazingly fast and lightweight language detection library for Rust","archived":false,"fork":false,"pushed_at":"2025-01-22T19:42:55.000Z","size":527,"stargazers_count":404,"open_issues_count":7,"forks_count":19,"subscribers_count":20,"default_branch":"main","last_synced_at":"2025-05-11T14:44:45.155Z","etag":null,"topics":["language-detection","natural-language-processing","rust-lang"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/quickwit-oss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-03-07T05:32:29.000Z","updated_at":"2025-05-08T11:12:57.000Z","dependencies_parsed_at":null,"dependency_job_id":"be7e7f7f-6ca7-451c-8fc8-db10d00dfe1d","html_url":"https://github.com/quickwit-oss/whichlang","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quickwit-oss%2Fwhichlang","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quickwit-oss%2Fwhichlang/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quickwit-oss%2Fwhichlang/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/quickwit-oss%2Fwhichlang/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/quickwit-oss","download_url":"https://codeload.github.com/quickwit-oss/whichlang/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254414501,"owners_count":22067272,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["language-detection","natural-language-processing","rust-lang"],"created_at":"2024-07-31T20:00:21.091Z","updated_at":"2025-05-15T20:07:27.512Z","avatar_url":"https://github.com/quickwit-oss.png","language":"Rust","funding_links":[],"categories":["Core Libraries","Rust","NLP"],"sub_categories":[],"readme":"# Whichlang\n\nThis is a language detection library, aiming for both precision and performance.\n\n# Why build this?\nWhile building [Quickwit](https://github.com/quickwit-oss/quickwit), a search engine tailored for log and tracing data, we found ourselves needing a light, fast, and precise language detection library in Rust that works well with our high throughput requirement. The full story and how it works are detailed in this [blog post](https://quickwit.io/blog/whichlang-language-detection-library).\n\n# Features\n\n- No dependency\n- Throughput above 100 MB/s for short and long strings.\n- Good accuracy (99.5% on my validation dataset, but it really depends on the size of your input.)\n- Supported languages: Arabic, Dutch, English, French, German, Hindi, Italian, Japanese, Korean, Mandarin, Portuguese, Russian, Spanish, Swedish, Turkish, and Vietnamese.\n\n# How does it work?\n\nIt uses a multiclass logistic regression model over:\n- 2, 3, 4-grams of letters on ASCII\n- codepoint / 128\n- a slightly smarter projection of codepoints over a given class.\n\nWe use the hashing trick and project these features over a space of size `4_096`.\n\nThe logistic regression is trained in the python notebook attached,\nand used to generate `weight.rs`.\n\n# Comparison with [Whatlang](https://github.com/greyblake/whatlang-rs)\n\nThe following compares the throughput using the simple benchmark found in this repository and the accuracy using [whatlang-accuracy-benchmark](https://github.com/evanxg852000/whatlang-accuracy-benchmark) benchmark. Overall, Whichlang is about 10x faster and slightly more accurate than Whatlang.\n\n### Throughput\n\nTo generate the throughput benchmark, we ported the benchmark available in [this repository](https://github.com/quickwit-oss/whichlang/blob/main/benches/bench.rs). Please, check this [repository](https://github.com/evanxg852000/whatlang-accuracy-benchmark) to see our changes.\n\n|                           | Processing Time (µs) | Throughput (MiB/s) |\n| ------------------------- | -------------------- | ------------------ | \n| whatlang/short            | 16.62                | 1.66               | \n| whatlang/long             | 62.00                | 9.42               | \n| whichlang/short           | 0.26                 | 105.69             | \n| whichlang/long            | 5.21                 | 112.31             | \n\n### Accuracy\n\n\nTo generate the accuracy benchmark, we have changed the [whatlang-accuracy-benchmark](https://github.com/whatlang/whatlang-accuracy-benchmark) to add support for Whichlang. Given that Whatlang supports more languages, we have used its FilterList feature to restrict its analysis to only languages that are supported in Whichlang. We also use the `trigram` method in Whatlang.  Please, check this [repository](https://github.com/evanxg852000/whatlang-accuracy-benchmark) to see our changes.\n\n```\nCrate: Whatlang\nAVG: 91.69%\n\n| LANG       | AVG    | \u003c= 20   | 21-50  | 51-100 | \u003e 100   |\n|------------|--------|---------|--------|--------|---------|\n| Arabic     | 99.68% | 99.51%  | 99.64% | 99.83% | 99.76%  |\n| Mandarin   | 96.09% | 97.54%  | 96.92% | 95.45% | 94.43%  |\n| German     | 88.57% | 70.00%  | 88.53% | 96.61% | 99.16%  |\n| English    | 85.99% | 57.82%  | 88.37% | 97.97% | 99.78%  |\n| French     | 90.88% | 72.84%  | 92.51% | 98.54% | 99.65%  |\n| Hindi      | 99.80% | 100.00% | 99.83% | 99.78% | 99.61%  |\n| Italian    | 87.75% | 66.67%  | 87.74% | 97.04% | 99.54%  |\n| Japanese   | 94.37% | 93.97%  | 96.04% | 94.30% | 93.18%  |\n| Korean     | 99.17% | 98.88%  | 99.69% | 99.44% | 98.66%  |\n| Dutch      | 89.68% | 72.13%  | 89.78% | 97.40% | 99.40%  |\n| Portuguese | 88.08% | 72.90%  | 85.76% | 95.22% | 98.44%  |\n| Russian    | 99.98% | 100.00% | 99.96% | 99.98% | 100.00% |\n| Spanish    | 82.91% | 55.45%  | 82.24% | 94.85% | 99.10%  |\n| Swedish    | 84.16% | 58.33%  | 83.78% | 96.35% | 98.18%  |\n| Turkish    | 86.73% | 61.01%  | 88.94% | 97.32% | 99.63%  |\n| Vietnamese | 93.23% | 82.84%  | 92.96% | 97.88% | 99.24%  |\n| AVG        | 91.69% | 78.74%  | 92.04% | 97.37% | 98.61%  |\n```\n\n```\nCrate: Whichlang\nAVG: 97.03%\n\n| LANG       | AVG     | \u003c= 20   | 21-50   | 51-100  | \u003e 100   |\n|------------|---------|---------|---------|---------|---------|\n| Arabic     | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% |\n| Mandarin   | 98.65%  | 98.69%  | 98.48%  | 98.55%  | 98.87%  |\n| German     | 94.20%  | 80.00%  | 97.47%  | 99.49%  | 99.84%  |\n| English    | 97.15%  | 91.84%  | 97.25%  | 99.57%  | 99.93%  |\n| French     | 97.59%  | 93.83%  | 97.61%  | 99.20%  | 99.71%  |\n| Hindi      | 100.00% | 100.00% | 100.00% | 100.00% | 100.00% |\n| Italian    | 97.20%  | 93.06%  | 97.33%  | 98.85%  | 99.57%  |\n| Japanese   | 94.92%  | 88.95%  | 95.14%  | 97.74%  | 97.85%  |\n| Korean     | 99.83%  | 99.44%  | 99.98%  | 99.97%  | 99.94%  |\n| Dutch      | 97.08%  | 92.84%  | 96.98%  | 98.91%  | 99.60%  |\n| Portuguese | 94.07%  | 83.87%  | 94.89%  | 98.18%  | 99.36%  |\n| Russian    | 99.92%  | 99.69%  | 99.99%  | 100.00% | 100.00% |\n| Spanish    | 92.12%  | 76.36%  | 93.78%  | 98.65%  | 99.70%  |\n| Swedish    | 95.37%  | 90.28%  | 94.94%  | 97.76%  | 98.51%  |\n| Turkish    | 95.51%  | 88.24%  | 98.11%  | 98.38%  | 97.33%  |\n| Vietnamese | 98.79%  | 96.57%  | 98.87%  | 99.77%  | 99.96%  |\n| AVG        | 97.03%  | 92.10%  | 97.55%  | 99.06%  | 99.39%  |\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquickwit-oss%2Fwhichlang","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fquickwit-oss%2Fwhichlang","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fquickwit-oss%2Fwhichlang/lists"}