{"id":13793433,"url":"https://github.com/shenasa-ai/speech2text","last_synced_at":"2025-05-12T20:31:01.791Z","repository":{"id":40959409,"uuid":"265961360","full_name":"shenasa-ai/speech2text","owner":"shenasa-ai","description":"A Deep-Learning-Based Persian Speech Recognition System ","archived":false,"fork":false,"pushed_at":"2023-05-22T22:54:39.000Z","size":22913,"stargazers_count":207,"open_issues_count":6,"forks_count":29,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-11-18T08:53:18.945Z","etag":null,"topics":["attention","attention-mechanism","ctc","keras","mozilla-deepspeech","python","speech-recognition","speech-to-text","teacher-forcing","tensorflow2"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shenasa-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"License.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-21T22:07:24.000Z","updated_at":"2024-11-06T20:44:34.000Z","dependencies_parsed_at":"2024-08-03T23:11:41.202Z","dependency_job_id":null,"html_url":"https://github.com/shenasa-ai/speech2text","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shenasa-ai%2Fspeech2text","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shenasa-ai%2Fspeech2text/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shenasa-ai%2Fspeech2text/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shenasa-ai%2Fspeech2text/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shenasa-ai","download_url":"https://codeload.github.com/shenasa-ai/speech2text/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253816687,"owners_count":21968867,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention","attention-mechanism","ctc","keras","mozilla-deepspeech","python","speech-recognition","speech-to-text","teacher-forcing","tensorflow2"],"created_at":"2024-08-03T23:00:21.263Z","updated_at":"2025-05-12T20:31:00.512Z","avatar_url":"https://github.com/shenasa-ai.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook","Persian"],"sub_categories":[],"readme":"# Speech to Text 🚀\nThis repo create ASR system by using toolkits PLUS own implementations.\n\n\nToolkits used: \n- [Mozilla Deep Speech](#mozilla-deep-speech)\n- [DeepSpeech2](#deepspeech2) \n- [Wave2vec](#wav2vec-20) \n- [Our Implementations](#my-own-implementations)\n\n\u003cbr\u003e\n\nTIP : If you want to just use the scripts( text_cleaning / data_collecting/ create final dataset csv files) simply use requirements.txt to install important dependencies.\n```\npip3 install requirements.txt\n```\n\n\n# Prerequisites 📋\n\nYou need to know about RNNs, Attention mechanism, CTC, a little pandas/numpy, Tensorflow, KERAS and NLP Stuff(e.g. transformers, text cleaning etc. ). Also knowing about Spectrograms, MFCC, Filter Bank will help you to understand preprocess of audios. \u003cbr\u003e\u003cbr\u003e\nif you don't know any of these stuff you can check Wiki page there are many links\n\n\u003cbr\u003e\n\n# Models Output 🎯\n\u003ctable width=\"100%\"\u003e\n \u003ctr\u003e\n   \u003cth width=\"30%\"\u003e model Name  \u003c/th\u003e \n   \u003cth width=\"40%\"\u003e DataSet \u003c/th\u003e \n   \u003cth width=\"70%\"\u003e Loss  \u003c/th\u003e \n \u003c/tr\u003e\n\n\u003ctr\u003e\n   \u003ctd width=\"25%\"\u003e our own implementations : (first try) \u003c/td\u003e\n   \u003ctd width=\"25%\"\u003e common_voice_en (400h) \u003c/td\u003e\n \u003ctd width=\"25%\"\u003e 74 ( So bad results / many models tested. ) \u003c/td\u003e\n \u003c/tr\u003e\n\n \u003ctr\u003e\n   \u003ctd width=\"25%\"\u003eDeep Speech 1 : Mozilla (first try) \u003c/td\u003e\n   \u003ctd width=\"25%\"\u003e common_voice + tv programms + radio programs (totally 300h) \u003c/td\u003e\n \u003ctd width=\"25%\"\u003e 28 \u003c/td\u003e\n \u003c/tr\u003e\n\n  \u003ctr\u003e\n   \u003ctd width=\"25%\"\u003eDeep Speech 1 : Mozilla (second try) \u003c/td\u003e\n   \u003ctd width=\"25%\"\u003e common_voice + tv programms + radio programs (totally 300h) \u003c/td\u003e\n \u003ctd width=\"25%\"\u003e 25\u003c/td\u003e\n \u003c/tr\u003e\n\n  \u003ctr\u003e\n   \u003ctd width=\"25%\"\u003eDeep Speech 1 : Mozilla + Transfer Learning (third try) \u003c/td\u003e\n   \u003ctd width=\"25%\"\u003e common_voice + tv programms + radio programs (totally 300h) \u003c/td\u003e\n \u003ctd width=\"25%\"\u003e 24\u003c/td\u003e\n \u003c/tr\u003e\n\n  \u003ctr\u003e\n   \u003ctd width=\"25%\"\u003eDeep Speech 1 : Mozilla + Transfer Learning (third try) \u003c/td\u003e\n   \u003ctd width=\"25%\"\u003e common_voice + tv programms + radio programs (totally +1000h) \u003c/td\u003e\n \u003ctd width=\"25%\"\u003e 22 \u003c/td\u003e\n \u003c/tr\u003e\n\n\n\u003c/table\u003e\n\n\n# Dataset we used 📁\nthere are many public datasets for English. But for persian there is not enough and free STT dataset.So we created our own data crawler for collecting data.\n\n[common voice dataset](https://voice.mozilla.org/en/datasets) is a rich free dataset.\u003cbr\u003e\n\n### How to use our script to collect data\nin this repo there is 1 folders:\n- **crawler**\n\n**crawler** : this folder has one script. the script will crawl in radio archive and collect the data we need. you can edit this crawler to download other websites too. [More info Check README file in crawler folder]\n\n\n# Full DataSet 📁 ⚡🔥\nHere in Hamtech Company, we decided to open source our ASR dataset. This Dataset is near 200Gb of voice plus CSV file which includes the transcription(some files contain txt file not csv). you can find a column named Confidence_level, this means how much the transcription is reliable, here is the, you can use LM(language models) or any other idea to clean them or any other ideas. . In conclusion :\n\n* You may use some techniques ( like using LMs, using annotator, etc. ) to clean more the transcription\n* you can use the confidence_level column in the CSV file to select more accurate rows.\n* wav files are nearly 200Gb. \n* voices format is : format : Wav / channels : mono/ sample_rate : 16000 Hz/ \n\n\u003cbr\u003e \nNote : 9 Gb of data is lost. :( \n\nLinks : \n\u003cbr\u003e\nVersion 1 data content is like : WavFile+TxtFile | Version 2 data content is like : ZipFile+CsvFile\n- Dataset_part_1_v1 : https://drive.google.com/drive/folders/1jdR4joj1BsU_LYHXriUW0xT4tgXicaqv?usp=share_link\n- Dataset_part_2_v1 : https://drive.google.com/drive/folders/1tVOrcwpxVcfrON5t9rdFSLZ5LNwJuCrV?usp=share_link\n- Dataset_part_3_v1 : https://drive.google.com/drive/folders/1FWY3MTrpMF-WrqFbMSM-fNmLn0FiUoNx?usp=share_link\n- Dataset_part_4_v1 : Data is lost :( ( this part of data was 9.7 Gb of zipped wav/txt files )\n- Dataset_part_1_v2 : https://drive.google.com/drive/folders/1ZsTMb_V-UAXxxi-wRE-g4hXXntonA_P3?usp=share_link\n- Dataset_part_2_v2 : https://drive.google.com/drive/folders/1eAPjF_DVU9j4nQ8S0aWQTbCbTI5sBrYp?usp=share_link\n- Dataset_part_3_v2 : https://drive.google.com/drive/folders/1rMNYwKtkyz8tprhwErrcDT-TLKtWA0OB?usp=share_link\n- Dataset_part_4_v2 : https://drive.google.com/drive/folders/1Lxq8ouA6UWEOkHfNjxJ7Kf5k51D5t2V8?usp=share_link\n\n\u003cbr\u003e\n\n\n\nNOTE : if you need more tips dont hesitate to email me  : masoudparpanchi@gmail.com\n\n# Part of Our Dataset V0.1 📁 ⚡🔥\nHere in Hamtech Company, we decided to open source a challenging part of our ASR dataset. This Dataset is near 30 Hours of voice plus CSV file which includes the transcription. you can find a column named Confidence_level, this means how much the transcription is reliable, here is the, you can use LM(language models) or any other idea to clean them or any other ideas. The variety of speakers in this dataset is not so much But the quality of voices is good enough. Check Dataset Folder in this repo. In conclusion :\n\n* Dataset is near 30H\n* Transcriptions are not an exact match. \n* Variety of speakers in this dataset is not so much\n* You may use some techniques ( like using LMs, using annotator, etc. ) to clean more the transcription\n* you can use the confidence_level column in the CSV file to select more accurate rows.\n* wav files are nearly 2Gb. \n* voices format is : format : Wav / channels : mono/ sample_rate : 16000 Hz/ \n* Google drive dataset URL is https://drive.google.com/drive/folders/1BLOYLBOUSWI50k4RTnpTc7Ni4rYxjVi2?usp=sharing\n* to download just the CSV file check Dataset folder in this repo or the [google drive link ](https://drive.google.com/file/d/1vqvn0F0YYhEFbzLgP9wJ36vyInUnO5b5/view?usp=sharing)\n\n# Mozilla Deep Speech\n\nLast checkpoint of trained Speech to text ( These are not ready to use for commercial usecases. only a finetuned model for you, to use it in your own project : \n- Mozzila deepspeech checkpoints ( WARNING : Check experiments log for hyperparametrs when you want to finetune using these checkpoints ) : https://drive.google.com/drive/folders/1FyLFudV_o71WeBQEQIn_-GficBpEMIzM?usp=share_link\n- pb format of checkpoint : https://drive.google.com/file/d/1RhISAEUwG9MwkLIFyrb1sIFi4UhNTKL6/view?usp=share_link\n\n### Start Using deepspeech : clone and download common voice dataset \nTo use this toolkit you must first do what  [this link ](https://deepspeech.readthedocs.io/en/latest/TRAINING.html) says or follow short Installation blow.\n\nCurrently we are using DeepSpeech V0.9.3\n\n#### Short Installation\n- Clone and create virtual enviroments\n```\ngit clone --branch v0.9.3 https://github.com/mozilla/DeepSpeech\ncd DeepSpeech\npip3 install --upgrade pip==20.2.2 wheel==0.34.2 setuptools==49.6.0\npip3 install --upgrade -e .\npip install tensorflow-gpu==1.15.4\n```\n\nAfter cloning and install dependecies you need [common voice dataset](https://voice.mozilla.org/en/datasets).\nDownload the proper language and then preprocessing. all the steps for preprocess common voice dataset is documented in [here](https://deepspeech.readthedocs.io/en/latest/TRAINING.html) too.\n\n- Start training from scratch ( no pretrained model)\n```\n‫‪python3‬‬ ‫‪DeepSpeech.py‬‬ ‫‪--train_files‬‬ ‫‪../data/CV/en/clips/train.csv‬‬ ‫‪--dev_files‬‬ \\\n‫‪../data/CV/en/clips/dev.csv‬‬ ‫‪--test_files‬‬ ‫‪../data/CV/en/clips/test.csv‬‬\n```\n\n\n### your own dataset\nif you want to create your own dataset you need these TIPS : \n*   all your audio files must be in wav format\n*   your audios must be in :  16khz / 16bitpersample / mono / no noise ( optional ) \n*   don't use more than 30 minutes of audio files recorded by one person\n*   at least you must have 300h of data.\n*   you need a csv file for all the audio files you have. you can see the structure of csv file after you preprocessed the common voice files by import_cv2.py( this python file is in Deepspeech repo. you'll have it after cloning)\n*   and also you need to clean your transcripts (Some more NLP task :) ).\n*   (optional) if you want to use my crawler to create dataset you can check crawler.py (in crawler folder ) and parallel.py(in data_collector folder) file.\n*   (optional) if you want to use my transcription desktop application you can check [ciacada repository](https://github.com/shenasa-ai/cicada-audio-annotation-tool)\n*  ( optional ) you can use my scripts to clean transcripts and make final csv file. they are available in github\n* don't worry about creating spectrogram/FastFourier/MFCC/Filterbank , Mozilla will do it all for you.\n* TODO :  need to remember more tips.\n\n### Language model\nyou need language model for testing the model. the language model is trained on [Kenlm](https://github.com/kpu/kenlm)\n\nthe steps to train language model is [here](https://deepspeech.readthedocs.io/en/v0.9.3/Scorer.html)\n\ntext file size : 2GB\n\ntest file to optimize : 20M\n\n\u003cbr\u003e\nkenlm checkpoint link (This checkpoint is just a toy language model. trained on 2.5 Gb of Persian txt but not deep optimization ): https://drive.google.com/file/d/1IGL_SXNQdYINWEP93JnbAw1NjxtmZ-Hw/view?usp=share_link\n\u003cbr\u003e\n\ntxt dataset to train kenlm can be found here ( you can find near 80 Gb of Persian txt there ): https://nlpdataset.ir/farsi/raw_text_corpora.html\n\n\u003cbr\u003e\n\u003cbr\u003e\n\nafter all these steps it means you have your dataset ready and you want to train . if you want to train on English there is no more steps **BUT** if you are in another language ( like Persian ) you need to check transfer learning part of [this link](https://deepspeech.readthedocs.io/en/latest/TRAINING.html#transfer-learning-new-alphabet)\nTIP : don't forget to change alphabete.txt\n\nQuestion : can I use Other languages checkpoints to start transfer learning? Sure, do it. But remember to drop weights of last N layers.\n\nyou may need the meaning of flags to use all the abilities of mozilla deep speech. [check their Documentation](https://deepspeech.readthedocs.io/en/v0.9.3/Flags.html)\n\nTip : if you faced with CUDA/CudNN errors. try to use conda and install proper versions.\n\n#### Where to find Persian Pretrained Checkpoints : \n\u003cbr\u003e\nLast checkpoint of trained Speech to text ( These are not ready to use for commercial usecases. only a finetuned model for you, to use it in your own project : \n- Mozzila deepspeech checkpoints ( WARNING : Check experiments log for hyperparametrs when you want to finetune using these checkpoints ) : https://drive.google.com/drive/folders/1FyLFudV_o71WeBQEQIn_-GficBpEMIzM?usp=share_link\n- pb format of checkpoint : https://drive.google.com/file/d/1RhISAEUwG9MwkLIFyrb1sIFi4UhNTKL6/view?usp=share_link\n\n\u003cbr\u003e\n\u003cbr\u003e\n\u003chr\u003e\n\n## DeepSpeech2\nusing TensorSpeech\n[Link to repository](https://github.com/TensorSpeech/TensorFlowASR)\ntheir repo is really complete and you can pass their steps to train a model but I will say some tips : \n\n*   to change any option you need to change config.yml file\n*   Remember to change alphabetes. you need to change the vocabulary in config.yml file\n* the dataset in this repo is a little different. you must have tsv file. and columns hav different names and values\n*   (optional) to prepare your own dataset for this approach you can use my script. it is available in github\n\n\n\n\u003cbr\u003e\n\u003chr\u003e\n\n## Wav2vec 2.0\nusing facebook fairseq toolkit\n\u003cbr\u003e\nthis checkpoint of wav2vec2 is trained on 30 Gb of Speech dataset( all data with 90percent and higher confidence ): https://drive.google.com/file/d/1DX4R3wyjDiDyQ6-0EKv_0P3WV_co13H6/view?usp=share_link\n\n\n\u003cbr\u003e\n\u003chr\u003e\n\n\n## My Own Implementations\n\n### Installation 🔧\n\nSome libraries you need to install. I'll list them here ( These are the most important ) : \u003cbr\u003e\n* ffmpeg\n* pydub\n* python_speech_features\n* numpy V1.18.1\n* pandas V1.0.1\n* tensorflow  V2.1.0\n* sklearn V0.22.2.post1\n* librosa V0.7.2\n\u003cbr\u003e \u003cbr\u003e\nIn next line I'll show how to install three of them ( I used these commands to install in Kaggle Notebooks too.)\u003cbr\u003e\n\n```\npip install pydub\n```\n\n``` \npip install python_speech_features\n```\n\n``` \n!apt-get install -y ffmpeg\n``` \n\nthe codes are developed to use commonvoice data. make sure your data are in that format.\n\n\n# Hyperparameters\n\nAll experiments and all hyperparameters : https://drive.google.com/file/d/1h7DhMsS_AGAguKypI_jhjv3JNT2Naemq/view?usp=share_link\n\n\u003cbr\u003e\n\n\u003cbr\u003e\n\n# WIKI page 📖\nVisit our wiki page for more info about Tutorials, useful Links, Hardware Info, Result and other things. \n\n\n\u003cbr\u003e\n\n# Contributing 🖇️\n\nIf you want to help us for better models and new approaches, please contact us, we will be happy\n\u003cbr\u003e\nEmail :  masoudparpanchi@gmail.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshenasa-ai%2Fspeech2text","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshenasa-ai%2Fspeech2text","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshenasa-ai%2Fspeech2text/lists"}