{"id":29061725,"url":"https://github.com/chattylabs/language-detector","last_synced_at":"2025-06-27T08:09:03.144Z","repository":{"id":57101352,"uuid":"144708667","full_name":"chattylabs/language-detector","owner":"chattylabs","description":"Package to detect the language of a given text (focusing on short \"sms\" type text used on tweets, facebook, WhatsApp, etc)","archived":false,"fork":false,"pushed_at":"2018-09-20T20:13:15.000Z","size":3770,"stargazers_count":11,"open_issues_count":2,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-05-23T20:41:03.510Z","etag":null,"topics":["algorithm","detect-language","javascript","language-detection","language-detector","n-grams","node","reducers","translate"],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chattylabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-08-14T11:04:24.000Z","updated_at":"2023-06-28T14:42:56.000Z","dependencies_parsed_at":"2022-08-20T16:20:57.930Z","dependency_job_id":null,"html_url":"https://github.com/chattylabs/language-detector","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chattylabs%2Flanguage-detector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chattylabs%2Flanguage-detector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chattylabs%2Flanguage-detector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chattylabs%2Flanguage-detector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chattylabs","download_url":"https://codeload.github.com/chattylabs/language-detector/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chattylabs%2Flanguage-detector/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":257795098,"owners_count":22604256,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithm","detect-language","javascript","language-detection","language-detector","n-grams","node","reducers","translate"],"created_at":"2025-06-27T08:05:03.569Z","updated_at":"2025-06-27T08:09:03.135Z","avatar_url":"https://github.com/chattylabs.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# \u003cimg src=\"art/logo.png\" width=\"250px\"/\u003e\n\nThis package aids the detection of the language of a given text.\n\nEnd goal is to detect any text, no matter how short or obscure (think messages from Twitter, WhatsApp, Instagram, SMS, etc) and return an object describing the language that best matches it.\n\n```\n{\n  language: 'en',\n  country: 'gb'\n}\n```\n\nThis is obtained with a combination of \"reducing\" and \"matching\". Given a piece of text we can reduce it to a set of potential languages by checking for common patterns (see `src/utils/reducers.js`), additionally we can match the n-grams of sed text to a set of pre-compiled language profiles generated through \"learning\" (processing known samples).\n\n\n## Usage\n\n### With built in language detection\n\nUsage:\n\n```\nconst detect = require('@chattylabs/language-detection')\nconst result = detect('some text to detect')\nconst language = result.language\n```\n\n### With custom language profiles\n\n```\nconst detect = require('@chattylabs/language-detection')\nconst customLanguageProfiles = require('../path/to/data/languageProfiles.json')\n\nconst result = detect(text, {\n  languageProfiles: customLanguageProfiles,\n  reducers: customReducers\n})\nconst language = result.language\n```\n\nNOTE: the languages you provide will be the set used, you could additionally merge them with our base:\n\n```\nconst combinedProfiles = {\n  ...require('@chattylabs/language-detection').languageProfiles,\n  ...customLanguageProfiles\n}\n```\n\n#### Generating your own language profiles\n\nYou will need to build a \"training\" script, which analysis all your sample data and generates the language profiles object. \n\nYour sample data should be a set of txt files containing as much text as possible and similar to the text you will be detecting. Do this per locale or language. e.g. `data/samples/en.txt`, `data/samples/fr.txt`, `data/samples/cn.txt` or `data/samples/en_GB.txt` (for country indentifier locale code must use underscore _ separator)\n\n```\n// bin/train.js\nconst train = require('@chattylabs/language-detection').train\ntrain('./path/to/custom/samples/*.txt', './path/to/custom/export/languageProfiles.json')\n```\n\nthen execute it via the cli `node bin/training.js` or via an npm script.\n\nNOTE: filenames determine the language, but using filename such as en_GB will result in the response splitting this out into language and country.\n\n\n### With custom reducers\n\n```\nconst detect = require('@chattylabs/language-detection')\nconst customLanguageProfiles = require('../path/to/data/languageProfiles.json')\nconst customReducers = require('../path/to/your/reducers')\n\nconst result = detect(text, {\n  languageProfiles: customLanguageProfiles,\n  reducers: customReducers\n})\nconst language = result.language\n```\n\n#### Writing reducers\n\nReducers are a collection of objects which map a regex to an array of languages. They help reduce the amount of languages we need to run the n-gram matching on, by finding intersections of known patterns.\n\nSo for example, imagine we provide the following reducers:\n\n```\n# /path/to/data/languageProfiles.json\nmodule.exports = [\n  {\n    regex: /[ñ]+/i,\n    languages: ['es', 'gn', 'gl']\n  },\n  {\n    regex: /[á|é|í|ó|ú]+/i,\n    languages: ['fr', 'es', 'it', 'cn', 'nl', 'fo', 'is', 'pt', 'vi', 'cy', 'el', 'gl']\n  }\n]\n```\n\nFrom the above, we would reduce the words \"Alimentación de niño\" to the languages ['es', 'gl'], and only run n-gram matching on those. If the reducer were to just return 1 language, that would be our result.\n\nNOTE: providing your own reducers will override the base ones. If you chose not to use them, but do use your own language profiles, languages not in your profiles will not be taken into account.\n\nYou can also combine your own reducers with the base ones:\n\n```\nconst combinedProfiles = {\n  ...require('@chattylabs/language-detection').reducers,\n  ...customReducers\n}\n```\n\n\n## References \n\n- https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html\n- https://shuyo.wordpress.com/2012/02/21/language-detection-for-twitter-with-99-1-accuracy/\n- http://cloudmark.github.io/Language-Detection/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchattylabs%2Flanguage-detector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchattylabs%2Flanguage-detector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchattylabs%2Flanguage-detector/lists"}