{"id":15003651,"url":"https://github.com/nitotm/efficient-language-detector-js","last_synced_at":"2025-04-05T05:02:29.183Z","repository":{"id":171379207,"uuid":"646225068","full_name":"nitotm/efficient-language-detector-js","owner":"nitotm","description":"Fast and accurate natural language detection. Detector written in Javascript. Nito-ELD, ELD. ","archived":false,"fork":false,"pushed_at":"2024-10-30T18:36:03.000Z","size":14199,"stargazers_count":60,"open_issues_count":3,"forks_count":10,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-29T04:03:35.996Z","etag":null,"topics":["javascript","language","language-detection","language-detector","language-identification","natural-language","natural-language-processing","nlp","nodejs"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nitotm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-27T17:30:09.000Z","updated_at":"2025-03-20T13:47:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"6c7f18e9-2eb9-4c63-bfd0-9d1d52508227","html_url":"https://github.com/nitotm/efficient-language-detector-js","commit_stats":null,"previous_names":["nitotm/efficient-language-detector-js"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nitotm%2Fefficient-language-detector-js","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nitotm%2Fefficient-language-detector-js/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nitotm%2Fefficient-language-detector-js/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nitotm%2Fefficient-language-detector-js/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nitotm","download_url":"https://codeload.github.com/nitotm/efficient-language-detector-js/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247289409,"owners_count":20914464,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["javascript","language","language-detection","language-detector","language-identification","natural-language","natural-language-processing","nlp","nodejs"],"created_at":"2024-09-24T18:59:44.121Z","updated_at":"2025-04-05T05:02:29.164Z","avatar_url":"https://github.com/nitotm.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Efficient Language Detector\n\n\u003cdiv align=\"center\"\u003e\n\t\n![supported Javascript versions](https://img.shields.io/badge/JS-%3E%3D%20ES2015-blue)\n![supported Javascript versions](https://img.shields.io/badge/Node.js-%3E%3D%2016-blue)\n[![license](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)\n[![supported languages](https://img.shields.io/badge/supported%20languages-60-brightgreen.svg)](#languages)\n\t\n\u003c/div\u003e\n\nEfficient language detector (*Nito-ELD* or *ELD*) is a fast and accurate language detector, is one of the fastest non compiled detectors, while its accuracy is within the range of the heaviest and slowest detectors.\n\nIt's 100% Javascript (vanilla), easy installation and no dependencies.  \nELD is also available in [Python](https://github.com/nitotm/efficient-language-detector-py) and [PHP](https://github.com/nitotm/efficient-language-detector).\n\n1. [Install](#install)\n2. [How to use](#how-to-use)\n3. [Benchmarks](#benchmarks)\n4. [Languages](#languages)\n\n## Install\n\n- For *Node.js*\n```bash\n$ npm install eld\n```\n- For Web, just download or clone the files  \n`git clone https://github.com/nitotm/efficient-language-detector-js`\n\n## How to use?\n\n### Load ELD\n\n- At Node.js REPL\n```javascript\nconst { eld } = await import('eld')\n```\n- At Node.js\n```javascript\nimport { eld } from 'eld' // use .mjs extension for version \u003c18\n```\n- At the Web Browser\n\n```html\n\n\u003cscript type=\"module\" charset=\"utf-8\"\u003e\n    import { eld } from './src/languageDetector.js' // Update path.\n    /* code */\n\u003c/script\u003e\n```\n- To load the minified version, which is not a module\n```html\n\u003cscript src=\"minified/eld.M60.min.js\" charset=\"utf-8\"\u003e\u003c/script\u003e\n```\n\n### Usage\n\n`detect()` expects a UTF-8 string, and returns an object, with a 'language' variable, with a ISO 639-1 code or empty string\n```javascript\nconsole.log( eld.detect('Hola, cómo te llamas?') )\n// { language: 'es', getScores(): {'es': 0.5, 'et': 0.2}, isReliable(): true }\n// returns { language: string, getScores(): Object, isReliable(): boolean } \n\nconsole.log( eld.detect('Hola, cómo te llamas?').language )\n// 'es'\n```\n - To reduce the languages to be detected, there are 2 options, they only need to be executed once. (Check available [languages](#languages) below)\n```javascript\nlet langSubset = ['en', 'es', 'fr', 'it', 'nl', 'de']\n\n// Option 1 \n// Setting dynamicLangSubset(), detect() executes normally but finally filters the excluded languages\neld.dynamicLangSubset(langSubset) // Returns an Object with the validated languages of the subset\n// to remove the subset\neld.dynamicLangSubset(false)\n\n// Option 2\n// The optimal way to regularly use the same subset, is using saveSubset() to download a new database\neld.saveSubset(langSubset) // ONLY for the Web Browser, and not included at minified files\n// We can load any Ngrams database saved at src/ngrams/, including subsets. Returns true if success\nawait eld.loadNgrams('ngramsL60.js') // eld.loadNgrams('file').then((loaded) =\u003e { if (loaded) { } })\n// To modify the preloaded database, edit the filename loadNgrams('filename') at languageDetector.js\n```\n- Also, we can get the current status of eld: languages, database type and subset\n```javascript\n  console.log( eld.info() )\n```\n## Benchmarks\n\nI compared *ELD* with a different variety of detectors, since the interesting part is the algorithm.\n\n| URL                                                       | Version       | Language     |\n|:----------------------------------------------------------|:--------------|:-------------|\n| https://github.com/nitotm/efficient-language-detector-js/ | 0.9.0         | Javascript   |\n| https://github.com/nitotm/efficient-language-detector/    | 1.0.0         | PHP          |\n| https://github.com/pemistahl/lingua-py                    | 1.3.2         | Python       |\n| https://github.com/CLD2Owners/cld2                        | Aug 21, 2015  | C++          |\n| https://github.com/google/cld3                            | Aug 28, 2020  | C++          |\n| https://github.com/wooorm/franc                           | 6.1.0         | Javascript   |\n\nBenchmarks: **Tweets**: *760KB*, short sentences of 140 chars max.; **Big test**: *10MB*, sentences in all 60 languages supported; **Sentences**: *8MB*, this is the *Lingua* sentences test, minus unsupported languages.  \nShort sentences is what *ELD* and most detectors focus on, as very short text is unreliable, but I included the *Lingua* **Word pairs** *1.5MB*, and **Single words** *880KB* tests to see how they all compare beyond their reliable limits.\n\nThese are the results, first, accuracy and then execution time.\n\n\u003c!-- Accuracy table\n|                     | Tweets       | Big test     | Sentences    | Word pairs   | Single words |\n|:--------------------|:------------:|:------------:|:------------:|:------------:|:------------:|\n| **Nito-ELD**        | 99.3%        | 99.4%        | 98.8%        | 87.6%        | 73.3%        |\n| **Nito-ELD-L**      | 99.4%        | 99.4%        | 98.7%        | 89.4%        | 76.1%        |\n| **Nito-ELD-xs**     | 99.2%        | 99.4%        | 98.4%        | 84.4%        | 66.8%        |\n| **Lingua**          | 98.8%        | 99.1%        | 98.6%        | 93.1%        | 80.0%        |\n| **CLD2**            | 93.8%        | 97.2%        | 97.2%        | 87.7%        | 69.6%        |\n| **Lingua low**      | 96.0%        | 97.2%        | 96.3%        | 83.7%        | 68.0%        |\n| **CLD3**            | 92.2%        | 95.8%        | 94.7%        | 69.0%        | 51.5%        |\n| **franc**           | 89.8%        | 92.0%        | 90.5%        | 65.9%        | 52.9%        |\n--\u003e\n\u003cimg alt=\"accuracy table\" width=\"800\" src=\"https://raw.githubusercontent.com/nitotm/efficient-language-detector-js/main/misc/table_accuracy_js.svg\"\u003e\n\n\u003c!--- Time table\n|                     | Tweets       | Big test     | Sentences    | Word pairs   | Single words |\n|:--------------------|:------------:|:------------:|:------------:|:------------:|:------------:|\n| **Nito-ELD-js**     |     0.58\"    |      5.1\"    |      4.3\"    |     1.2\"     |     0.73\"    |\n| **Nito-ELD-L-js**   |     0.59\"    |      5.2\"    |      4.5\"    |     1.2\"     |     0.77\"    |\n| **Nito-ELD-XS-js**  |     0.5\"     |      4.6\"    |      4\"      |     1.1\"     |     0.71\"    |\n| **Lingua**          |  4790\"       |  24000\"      |  18700\"      |  8450\"       |  6700\"       |\n| **CLD2**            |     0.35\"    |      2\"      |      1.7\"    |     0.98\"    |     0.8\"     |\n| **Lingua low**      |    64\"       |    370\"      |    308\"      |   108\"       |    85\"       |\n| **CLD3**            |     3.9\"     |     29\"      |     26\"      |    12\"       |    11\"       |\n| **franc**           |     1.2\"     |      8\"      |      7.8\"    |     2.8\"     |     2\"       |\n| **Nito-ELD-php**    |     0.31\"    |      2.5\"    |      2.2\"    |     0.66\"    |     0.48\"    |\n--\u003e\n\u003cimg alt=\"time table\" width=\"800\" src=\"https://raw.githubusercontent.com/nitotm/efficient-language-detector-js/main/misc/table_time_js.svg\"\u003e\n\n\u003csup style=\"color:#08e\"\u003e1.\u003c/sup\u003e \u003csup style=\"color:#777\"\u003eLingua could have a small advantage as it participates with 54 languages, 6 less.\u003c/sup\u003e  \n\u003csup style=\"color:#08e\"\u003e2.\u003c/sup\u003e \u003csup style=\"color:#777\"\u003eCLD2 and CLD3, return a list of languages, the ones not included in this test where discarded, but usually they return one language, I believe they have a disadvantage. \nAlso, I confirm the results of CLD2 for short text are correct, contrary to the test on the *Lingua* page, they did not use the parameter \"bestEffort = True\", their benchmark for CLD2 is unfair.\n\n*Lingua* is the average accuracy winner, but at what cost, the same test that in *ELD* or *CLD2* is below 6 seconds, in Lingua takes more than 5 hours! It acts like a brute-force software. \nAlso, its lead comes from single and pair words, which are unreliable regardless.\n\nI added *ELD-L* for comparison, which has a 2.3x bigger database, but only increases execution time marginally, a testament to the efficiency of the algorithm. *ELD-L* is not the main database as it does not improve language detection in sentences.\n\nFor a client side solution, I included an all-in-one detector+Ngrams minified file, of the standard version (M), and XS which still performs great for sentences. \nThe XS version only weights 865kb, when gzipped it's only 245kb. The standard version is 486kb gzipped.\n\nHere is the average, per benchmark, of Tweets, Big test \u0026 Sentences.\n\n![Sentences tests average](https://raw.githubusercontent.com/nitotm/efficient-language-detector-js/main/misc/sentences-tests-avg-js.png)\n\u003c!--- Sentences average\n|                     | Time         | Accuracy     |\n|:--------------------|:------------:|:------------:|\n| **Nito-ELD-js**     |      3.32\"   | 99.16%       |\n| **Nito-ELD-php**    |      1.65\"   | 99.16%       |\n| **Lingua**          |  15800\"      | 98.84%       |\n| **CLD2**            |      1.35\"   | 96.08%       |\n| **Lingua low**      |    247\"      | 96.51%       |\n| **CLD3**            |     19.6\"    | 94.19%       |\n| **franc**           |      5.7\"    | 90.79%       |\n--\u003e\n\n## Languages\n\nThese are the *ISO 639-1 codes* of the 60 supported languages for *Nito-ELD* v1\n\n\u003e 'am', 'ar', 'az', 'be', 'bg', 'bn', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fr', 'gu', 'he', 'hi', 'hr', 'hu', 'hy', 'is', 'it', 'ja', 'ka', 'kn', 'ko', 'ku', 'lo', 'lt', 'lv', 'ml', 'mr', 'ms', 'nl', 'no', 'or', 'pa', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'te', 'th', 'tl', 'tr', 'uk', 'ur', 'vi', 'yo', 'zh'\n\n\nFull name languages:\n\n\u003e Amharic, Arabic, Azerbaijani (Latin), Belarusian, Bulgarian, Bengali, Catalan, Czech, Danish, German, Greek, English, Spanish, Estonian, Basque, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Armenian, Icelandic, Italian, Japanese, Georgian, Kannada, Korean, Kurdish (Arabic), Lao, Lithuanian, Latvian, Malayalam, Marathi, Malay (Latin), Dutch, Norwegian, Oriya, Punjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovene, Albanian, Serbian (Cyrillic), Swedish, Tamil, Telugu, Thai, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Yoruba, Chinese\n\n\n## Future improvements\n\n- Train from bigger datasets, and more languages.\n- The tokenizer could separate characters from languages that have their own alphabet, potentially improving accuracy and reducing the N-grams database. Retraining and testing is needed.\n\n**Donate / Hire**   \nIf you wish to Donate for open source improvements, Hire me for private modifications / upgrades, or to Contact me, use the following link: https://linktr.ee/nitotm","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnitotm%2Fefficient-language-detector-js","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnitotm%2Fefficient-language-detector-js","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnitotm%2Fefficient-language-detector-js/lists"}