{"id":28599096,"url":"https://github.com/alphacep/vosk","last_synced_at":"2025-08-29T03:33:24.847Z","repository":{"id":39351760,"uuid":"176828893","full_name":"alphacep/vosk","owner":"alphacep","description":"VOSK Speech Recognition Toolkit","archived":false,"fork":false,"pushed_at":"2022-07-13T13:15:38.000Z","size":43,"stargazers_count":458,"open_issues_count":3,"forks_count":56,"subscribers_count":29,"default_branch":"master","last_synced_at":"2025-08-02T16:45:12.050Z","etag":null,"topics":["lifelong-learning","multilingual","python","semi-supervised-learning","speech-recognition","speech-to-text","voice-recognition"],"latest_commit_sha":null,"homepage":"http://alphacephei.com","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alphacep.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-03-20T22:49:51.000Z","updated_at":"2025-07-25T16:25:49.000Z","dependencies_parsed_at":"2022-07-11T21:31:08.730Z","dependency_job_id":null,"html_url":"https://github.com/alphacep/vosk","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/alphacep/vosk","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alphacep%2Fvosk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alphacep%2Fvosk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alphacep%2Fvosk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alphacep%2Fvosk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alphacep","download_url":"https://codeload.github.com/alphacep/vosk/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alphacep%2Fvosk/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272619474,"owners_count":24965416,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-29T02:00:10.610Z","response_time":87,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lifelong-learning","multilingual","python","semi-supervised-learning","speech-recognition","speech-to-text","voice-recognition"],"created_at":"2025-06-11T12:13:04.424Z","updated_at":"2025-08-29T03:33:24.824Z","avatar_url":"https://github.com/alphacep.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# For Kaldi API for Android and Linux please see [Vosk API](https://github.com/alphacep/vosk-api). This is a server project.\n\nThis is Vosk, the lifelong speech recognition system.\n\n## Concepts\n\nAs of 2019, the neural network based speech recognizers are pretty\nlimited in terms of amount of the speech data they can use in training\nand require enormous computing power and time to train and optimize the\nparameters. Neural networks have problems with human-like one shot\nlearning, their decisions are not very robust to unseen conditions and\nhard to understand and correct.\n\nThat is why we decided to build a system based on large signal database\nconcept. We apply audio fingerprinting scheme. The audio is segmented on \nchunks, the chunks are stored in the database based on LSH hash value. \nDuring decoding we simply lookup the chunks in the database to get the\nidea what are the possible phones. That helps us to make a proper decision\non decoding results.\n\nThe advantages of this approach are:\n\n  - We can quickly train on 100000 hours of speech data on very simple hardware\n  - We can easily correct recognizer behavior just by adding samples\n  - We can make sure that recognition result is correct because it is sufficiently\n    represented in the training dataset\n  - We can parallelize training across thousands of nodes\n  - We support lifelong learning paradigm\n  - We can use this method together with more common neural network training to improve recognition accuracy\n  - The system is robust against noise\n\nThe disandvantages are:\n\n  - The index is really huge, it is not expected to fit a memory of single server\n  - The generalization capabilities of the model are quite questionable, at the same time\n    the generalization capabilities of the neural networks are also questionable.\n  - For now the segmentation requires conventional ASR, but in the future we might segment ourselves.\n\nThe nice to have things in the future would be:\n\n  - Multilingual training\n  - Our own segmentation\n  - The tool to reduce the model to fit the mobile\n  - Specialized hardware to implement this AI paradigm\n\n## Usage\n\nTo install the requirements run\n\n```\npip3 install -r requirements.txt\n```\n\nTo prepare the training/verification data create the following two files:\n\n  - `wav.scp` list to map uterances to wav files in filesystem\n  - `phones.txt` the CTM file with phonemes and timings. It could be CTM file from the alignment or\n    it could be a CTM file from the decoding\n\nYou can create them with [Kaldi ASR toolkit](http://kaldi-asr.org)\n\n### Indexing\n\nTo add the data to the database run\n\n```\npython3 index.py wavs-train.txt phones-train.txt data.idx\n```\n\nThat will add the data to the database data.idx or create a new one\n\n### Verification\n\nTo verify decoding results run\n\n```\npython3 verify.py wavs-test.txt phones-test.txt data.idx\n```\n\nThe tool will search for segments in the index and report suspicious\nsegments which you can additionally check and later add to the database\nto improve the accuracy of recognition.\n\n### Related papers and links\n\n - [VOSK presentation at NSU (in Russian)](https://www.youtube.com/watch?v=gsOMU1UTF7s)\n - [Memory, Modularity, and the Theory of Deep Learnability. Google Tech Talk by Rina Panigrahy](https://www.youtube.com/watch?v=bP5oyH_5nMU) shows importance of memory for learning complex functions.\n - [Large Language Models in Machine Translation by Thorsten Brants at al.](https://aclweb.org/anthology/D07-1090.pdf) Google's paper on simple backoff terascale LM.\n - [Deep Learning of Binary Hash Codes for Fast Image Retrieval by Kevin Lin at al.](https://www.iis.sinica.edu.tw/~kevinlin311.tw/cvprw15.pdf) a nice deephash [implementation](https://github.com/flyingpot/pytorch_deephash)\n - [Episodic Memory in Lifelong Language Learning](https://arxiv.org/pdf/1906.01076.pdf)\n - [Extreme Classification in Log Memory using Count-Min Sketch: A Case Study of Amazon Search with 50M Products](https://arxiv.org/abs/1910.13830)\n - [On-device Supermarket Product Recognition](https://ai.googleblog.com/2020/07/on-device-supermarket-product.html) Google's good example of kNN for mobile search\n - [Hash-Routed Neural Networks](https://github.com/ma3oun/hrn) Great idea and solid math\n - [Towards Lifelong Learning of End-to-end ASR](https://arxiv.org/pdf/2104.01616.pdf) Methods get more publicity\n - [Building Scalable, Explainable, and Adaptive NLP Models with Retrieval](http://ai.stanford.edu/blog/retrieval-based-NLP)\n - [Continual Learning for Monolingual End-to-End Automatic Speech Recognition](https://arxiv.org/abs/2112.09427)\n - [Mammoth - An Extendible (General) Continual Learning Framework for Pytorch](https://github.com/aimagelab/mammoth)\n - [Progressive Continual Learning for Spoken Keyword Spotting](https://arxiv.org/abs/2201.12546)\n - [Online Continual Learning of End-to-End Speech Recognition Models](https://arxiv.org/abs/2207.05071)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falphacep%2Fvosk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falphacep%2Fvosk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falphacep%2Fvosk/lists"}