{"id":27426953,"url":"https://github.com/hyper-node/language_detector","last_synced_at":"2025-06-24T07:07:15.503Z","repository":{"id":82615118,"uuid":"191208685","full_name":"Hyper-Node/language_detector","owner":"Hyper-Node","description":"Python program for detecting language of texts","archived":false,"fork":false,"pushed_at":"2019-07-18T18:38:37.000Z","size":16213,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-14T12:57:29.048Z","etag":null,"topics":["language-detection","language-processing"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Hyper-Node.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-06-10T16:51:37.000Z","updated_at":"2019-07-18T18:38:39.000Z","dependencies_parsed_at":null,"dependency_job_id":"33aee253-5a4d-4e01-8868-656b29101c2b","html_url":"https://github.com/Hyper-Node/language_detector","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Hyper-Node/language_detector","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hyper-Node%2Flanguage_detector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hyper-Node%2Flanguage_detector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hyper-Node%2Flanguage_detector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hyper-Node%2Flanguage_detector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Hyper-Node","download_url":"https://codeload.github.com/Hyper-Node/language_detector/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hyper-Node%2Flanguage_detector/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261624960,"owners_count":23186118,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["language-detection","language-processing"],"created_at":"2025-04-14T12:45:38.179Z","updated_at":"2025-06-24T07:07:15.493Z","avatar_url":"https://github.com/Hyper-Node.png","language":"Python","readme":"## Language detector\nThis is an implementation of a software which can detect language of texts. \nIt's designed for detecting the language in longer texts such as scientific-papers, which can come in *.txt\nor *.pdf format. The language detector introduces a custom algorithm which is referred as for detecting the \"spatial language identificator (sli) comparison\".\nIt aims to use as less external libraries for the classification as possible. \n### Program overview \n![Overview of workflow of langdet and structure of sli](/docs/graphics_langdet/workflow_langdet.png)\n\n\n### Test Results \nThere we're 12 documents checked for testing. The document pool exists of scientific papers, wikipedia-articles and brochures on countries in different languages. In the pool there we're 3 english, 3 german, 3 french, 2 thai and 1 spanish document. These documents are not provided here because of licensing. All of these documents were classified correctly. The system was trained with input data in the same languages. \n\nThe first document was a scientific paper on sentiment-analysis, which is availible [here](https://www.aclweb.org/anthology/D13-1170). \nResult output can be seen below. \n\n```\nResults______________________________________________________\nInput:     io_data/langdet/document_rsocher.pdf\nDet. lang: en                       \n_____________________________________________________________\nlanguage   distance                  likeliness               \nen         6375.422232185437         68.11193438031768        \nfr         7032.559338922403         64.82511973218834        \nsp         7586.859986309798         62.05266407777083        \nde         7835.992906497068         60.8065714085122         \nth         19993.129430373243        0.0            \n```\n\n### Pro's and Con's \nPro's: \n- learning language datasets is fast, under 3 minutes for 6 languages\n- the saved sli's for identification take nearly no disk space (for 6 languages it's below 85 kb)\n- making comparison is fast and doesn't take much cpu to process\n- learn dataset is availible freely in nearly any language (bible)\n- detection with test dataset of 12 documents in different languages was 100% accurate \n- simple algorithm, which could be easily adapted in other programming languages and even used on microcontrollers\n\nCon's: \n- much text required compared to dictionary approach \n- no measurement that a text isn't of any of the supported languages\n\n### Usage \n- Adapt read in file config in \"configurations/language_detector.conf\". \n- Execute 'langedet_create_dataset.py' for creating some comparison sli's. \n- Execute 'langdet_check_language.py' for language identification of your specified documents, results appear in stdout\n\nIDE for development was PyChar Community Edition by Jetbrains\n\n### Possible future improvements  \n- instead of using least mean square comparison, train a neural classificator for distinquishing input texts. The training could be with chapter wise info from the training data \n- pre-filter non-text info from pdf-data \n- filter the charset used for comparison in the sli-objects\n- provide more language support, by learning in more bibles \n- make an adaptive web-interface, which allows to add correctly classified sli's to the comparison dataset\n- create an n-gram based sli, to take character follow up sequences to account \n- take one international version of the bible in many languages as learn in dataset\n- find treshold or algorithm to detect texts which are not any of the supported languages \n- find alternative for tika webservice for pdf-reading \n\n\n### References used\n\n- book graphic in diagram by Abilngeorge and under CC-License  [here](https://de.wikipedia.org/wiki/Datei:Indian_Election_Symbol_Book.svg)\n- [tika library](https://github.com/chrismattmann/tika-python) is used to get *.pdf-file content. The library contacts a web service for that\n- for creating documentation diagrams [yED](https://www.yworks.com/products/yed) was used \n- bibles for learning in can be obtained from [ebible.org](https://ebible.org/find/)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyper-node%2Flanguage_detector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhyper-node%2Flanguage_detector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyper-node%2Flanguage_detector/lists"}