{"id":32154158,"url":"https://github.com/nusretipek/languagefinder","last_synced_at":"2025-10-21T11:52:01.756Z","repository":{"id":61798462,"uuid":"323973406","full_name":"nusretipek/LanguageFinder","owner":"nusretipek","description":"A simple to use language detection package written in Julia using bigarms, trigrams and quadrigrams. 25 default languages with a built-in option to train new ones.","archived":false,"fork":false,"pushed_at":"2020-12-27T21:06:35.000Z","size":504,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-05-22T09:45:18.170Z","etag":null,"topics":["detection","julia","language","languagedetection","ngrams","nlp"],"latest_commit_sha":null,"homepage":"","language":"Julia","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nusretipek.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-12-23T18:21:37.000Z","updated_at":"2023-02-08T09:16:37.000Z","dependencies_parsed_at":"2022-10-21T11:15:30.049Z","dependency_job_id":null,"html_url":"https://github.com/nusretipek/LanguageFinder","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/nusretipek/LanguageFinder","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nusretipek%2FLanguageFinder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nusretipek%2FLanguageFinder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nusretipek%2FLanguageFinder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nusretipek%2FLanguageFinder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nusretipek","download_url":"https://codeload.github.com/nusretipek/LanguageFinder/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nusretipek%2FLanguageFinder/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280256223,"owners_count":26299342,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-21T02:00:06.614Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["detection","julia","language","languagedetection","ngrams","nlp"],"created_at":"2025-10-21T11:52:00.604Z","updated_at":"2025-10-21T11:52:01.751Z","avatar_url":"https://github.com/nusretipek.png","language":"Julia","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LanguageFinder\n\n[![Build Status](https://travis-ci.com/nusretipek/LanguageFinder.svg?branch=main)](https://travis-ci.com/nusretipek/LanguageFinder)\n[![Check Status](https://img.shields.io/github/checks-status/nusretipek/LanguageFinder/main)](https://img.shields.io/github/checks-status/nusretipek/LanguageFinder/main)\n[![Lang Status](https://img.shields.io/github/languages/top/nusretipek/LanguageFinder?color=blueviolet)](https://img.shields.io/github/languages/top/nusretipek/LanguageFinder?color=blueviolet)\n\n*A simple Julia package for language detection using bigrams, trigrams and quadrigrams.*\n\nThe Julia package is designed to detect most common languages accurately and train any language that has Wikipedia pages (\u003e200) on demand. It use consensus approach rto guess language rather than only trigrams to improve accuracy. It is the first Julia package that use quadrigrams in language detection.    \n\n## Installation Instructions\n\n```\nusing Pkg\nPkg.add(\"LanguageFinder\")\n```\n\n## Basic Usage\n\n```\nusing LanguageFinder\n\nL = LanguageFinder.LanguageFind\nL(\"This is a ship.\", 0).lang\n```\n\nThe struct takes two parameters; text and ngram. Ngram = 0 is a consensus (of bigram, trigram and quadrigram) and default parameter. It is slower than single ngram evaluation but more accurate. If speed is the concern, ngram parameter can take 1,2,3,4 representing unigram, bigram, trigram and quadrigram check. Trigram and quadrigrams are reliable. Prefer bigrams for languages like Chinese or Japanese where single character represent a word and there are not enough training set. \n\nThere are 25 default languages, each trained from approximately 500 wikipedia articles. The languages included;\n1. AR - Arabic\n2. CS - Czech\n3. DA - Danish\n4. DE - German\n5. EL - Greek\n6. EN - English\n7. ES - Spanish\n8. FA - Persian\n9. FI - Finnish\n10. FR - French\n11. HE - Hebrew\n12. HI - Hindi\n13. HU - Hungarian\n14. IT - Italian\n15. JP - Japanese\n16. KO - Korean\n17. NL - Dutch\n18. NO - Norwegian\n19. PL - Polish\n20. PT - Portuguese\n21. RU - Russian\n22. SV - Swedish\n23. TR - Turkish\n24. UK - Ukrainian\n25. ZH - Chinese\n\n## Training New Languages / Improve Existing Weights\nIn some systems, the package directory may be read only. Make sure that *C:\\Users\\USERNAME\\.julia\\packages\\LanguageFinder* folder is **not** only read-only. \n\n```\ntrain_wikipedia_text(\"eo\", 5, 15)\n```\n\nThe function has three parameters namely language code, number of pages to train and number of seconds to rest. \nPlease see [List of Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias) for possible language codes (WP Code). There is no default page number. The default sleep seconds is 15 but can be changed. It is there to make sure that program treats Wikipedia servers fairly. \n\nThe function not only capable to train on new language but one can use it to override the default weights. \n\n```\ntrain_wikipedia_text(\"es\", 1000, 5)\n```\n\nThis would override the ngram files of Spanish language by using 1,000 Wikipedia pages instead of 500.  \n\n*If you train your corpus using Wikipedia servers, please consider to support/donate the non-profit orgatization: https://wikimediafoundation.org/support/*\n\n\u003chr\u003e\n\nRelease v0.1.1 - Relative paths are corrected for the linux and osx environments.  \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnusretipek%2Flanguagefinder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnusretipek%2Flanguagefinder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnusretipek%2Flanguagefinder/lists"}