{"id":16209463,"url":"https://github.com/alichtman/text-language-identifier","last_synced_at":"2025-07-29T05:32:34.002Z","repository":{"id":208569158,"uuid":"132475221","full_name":"alichtman/text-language-identifier","owner":"alichtman","description":"Accurately identify written English, French or Italian text with up to 99% accuracy.","archived":false,"fork":false,"pushed_at":"2018-06-30T00:32:02.000Z","size":6166,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-07T20:51:23.805Z","etag":null,"topics":["bigram-model","language-identification","language-model","linguistic-analysis","n-grams","text-classification-python","text-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alichtman.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-05-07T14:48:17.000Z","updated_at":"2023-11-22T05:05:12.000Z","dependencies_parsed_at":"2023-11-22T07:35:57.858Z","dependency_job_id":null,"html_url":"https://github.com/alichtman/text-language-identifier","commit_stats":null,"previous_names":["alichtman/text-language-identifier"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/alichtman/text-language-identifier","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alichtman%2Ftext-language-identifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alichtman%2Ftext-language-identifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alichtman%2Ftext-language-identifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alichtman%2Ftext-language-identifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alichtman","download_url":"https://codeload.github.com/alichtman/text-language-identifier/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alichtman%2Ftext-language-identifier/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267633670,"owners_count":24118777,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigram-model","language-identification","language-model","linguistic-analysis","n-grams","text-classification-python","text-processing"],"created_at":"2024-10-10T10:29:46.789Z","updated_at":"2025-07-29T05:32:33.961Z","avatar_url":"https://github.com/alichtman.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Text Language Identifier\n\n`text-language-identifier` accurately identifies English, French and Italian written text with up to 99% accuracy.\n\nSince this project was used to better understand `n-gram analysis`, no natural language processing modules were imported -- everything was implemented from first principles.\n\n![GIF demo](img/demo.gif)\n\n### Usage\n\n0. Download this repo as a `.zip`\n1. `cd src`\n2. `$ python3 lang-bigram-id.py`\n\nAccuracy information for each model will be displayed in the terminal after the analysis is complete.\n\nDiff the output files to see which lines were predicted differently by certain pairs of models.\nHere are some commands to try:\n```shell\n$ diff ../output/letter-bigram-laplace-smoothing-predictions.txt ../output/letter-bigram-no-smoothing-predictions.txt\n$ diff ../output/letter-bigram-laplace-smoothing-predictions.txt ../output/word-bigram-no-smoothing-predictions.txt\n$ diff ../output/letter-bigram-laplace-smoothing-predictions.txt ../output/word-bigram-laplace-smoothing-predictions.txt\n```\n\n### How does it work?\n\nThis program creates a probabilistic model of each language based on bigram analyses of French, English and Italian sample corpora. To predict the language of a test sentence, it creates another probabilistic model to represent the sentence and chooses the language whose model is most similar to the sentence model using the RMSE.\n\n### Why?\n\nThis is was a computational linguistics experiment to see which language model, word or letter bigrams, performs the best. I also tested the impact of LaPlace smoothing on the predictive accuracy of the model.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falichtman%2Ftext-language-identifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falichtman%2Ftext-language-identifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falichtman%2Ftext-language-identifier/lists"}