{"id":13825688,"url":"https://github.com/prateekralhan/Speech2Text-for-Long-Audio-Files","last_synced_at":"2025-07-08T22:32:18.278Z","repository":{"id":110868948,"uuid":"226992763","full_name":"prateekralhan/Speech2Text-for-Long-Audio-Files","owner":"prateekralhan","description":"Perform SOTA Speech2Text on Long Audio Files with/without diarization Using Google Cloud Speech API","archived":false,"fork":false,"pushed_at":"2022-02-21T19:54:58.000Z","size":40923,"stargazers_count":12,"open_issues_count":0,"forks_count":4,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-08-05T09:14:23.644Z","etag":null,"topics":["gcp","googlecloud","opensourceforgood","python3","speech-recognition","speech-to-text"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/prateekralhan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-12-10T00:11:16.000Z","updated_at":"2023-08-07T02:41:33.000Z","dependencies_parsed_at":"2023-04-23T21:53:57.763Z","dependency_job_id":null,"html_url":"https://github.com/prateekralhan/Speech2Text-for-Long-Audio-Files","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prateekralhan%2FSpeech2Text-for-Long-Audio-Files","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prateekralhan%2FSpeech2Text-for-Long-Audio-Files/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prateekralhan%2FSpeech2Text-for-Long-Audio-Files/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prateekralhan%2FSpeech2Text-for-Long-Audio-Files/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/prateekralhan","download_url":"https://codeload.github.com/prateekralhan/Speech2Text-for-Long-Audio-Files/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225470631,"owners_count":17479366,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gcp","googlecloud","opensourceforgood","python3","speech-recognition","speech-to-text"],"created_at":"2024-08-04T09:01:25.379Z","updated_at":"2025-07-08T22:32:18.272Z","avatar_url":"https://github.com/prateekralhan.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Speech2Text-for-Long-Audio-Files\n\nSpeech recognition is a fun task. A lot of API resources are available in market today which makes it easier for user to opt for one or another. However, when it comes to audio files like processing lengthy audio files then this becomes quite challenging.I have used Google Speech to Text API for performing this operation.\n\n## A simple Demo: \n***( Use Google Chrome/Microsoft Edge for viewing the demo)***\n\nhttps://user-images.githubusercontent.com/29462447/155017259-31b057d9-361b-48dd-8e48-52db5b72480a.mp4\n\nGoogle Speech to text has three types of API requests based on audio content:\n![speech1](https://user-images.githubusercontent.com/29462447/70484067-a4691c00-1b10-11ea-9e14-87be7e40a4ad.png)\n\n### 1. Synchronous Request:\nThe audio file content should be approximately 1 minute to make a synchronous request. In this type of request, the user does not have to upload the data to Google cloud. This provides the flexibility to users to store the audio file in their local computer or server and reference the API to get the text.\n\n### 2. Asynchronous Request:\nThe audio file content should be approximately 480 minutes(8 hours). In this type of request, the user have to upload their data to Google cloud. **Something that I am using here.)**\n\n### 3. Streaming Request:\nIt is suitable for streaming data where the user is talking to microphone directly and needs to get it transcribed. This type of request is apt for chatbots. Again, the streaming data should be approximately a minute for this type of request.\n\n## Initial Setup\n* Before we begin, we need to do some initial setup for setting up the API client and storing the necessary credentials details which you would be needing later. Please follow this [link](https://cloud.google.com/speech-to-text/docs/quickstart-client-libraries?source=post_page-----1c886f4eb3e9----------------------) \n\n* Once we create the API client, the next step is to create a [storage bucket.](https://accounts.google.com/signin/v2/identifier?service=cloudconsole\u0026passive=1209600\u0026osid=1\u0026continue=https%3A%2F%2Fconsole.cloud.google.com%2Fstorage%2F%3Fsource%3Dpost_page-----1c886f4eb3e9----------------------\u0026followup=https%3A%2F%2Fconsole.cloud.google.com%2Fstorage%2F%3Fsource%3Dpost_page-----1c886f4eb3e9----------------------\u0026flowName=GlifWebSignIn\u0026flowEntry=ServiceLogin). \n\nMy methodology for converting speech to text:\n* **Importing the necessary packages.**\n* **Audio file encoding.** \nYou can read about it [here.](https://cloud.google.com/speech-to-text/docs/encoding?source=post_page-----1c886f4eb3e9----------------------)\n\n![speech2](https://user-images.githubusercontent.com/29462447/70484068-a4691c00-1b10-11ea-950a-c7c4937b081d.png)\n\n* **Audio file specifications**\nOne other limitation is that the API does not support stereo audio files. So we need to convert a **stereo** file to **mono** file before using the API. In addition, we also have to provide the **audio frame rate** for the file. I already implemented a function in the code to convert the audio files to **.wav** format.\n* **Upload files to Google storage**\nIn order to perform asynchronous request the file is uploaded to google cloud.\n* **Delete files in Google storage**\nOnce the speech to text operation is completed, the file can be deleted from Google cloud to avoid unnecessary costs.\n* **Transcribe**\nConvert the speech to plain text and save them as separate transcripts(text files). A sample transcript looks like this:\n\n![trascribe](https://user-images.githubusercontent.com/29462447/70485181-37f01c00-1b14-11ea-987a-f5ad4dd2810b.png)\n\n### What if I have more than 1 speaker in my audio file? Like a conversation!!?\nSpeaker Diarization is a process of distinguishing speakers in an audio file. I Google speech to text API to perform speaker diarization which is given as a separate script. The final transcripts generated by Google after speaker diarization looks like below.\n\n![trascribe2](https://user-images.githubusercontent.com/29462447/70485182-37f01c00-1b14-11ea-8c09-cc4ca9a98858.png)\n\n#### Now, why not you go ahead and record some voice notes of yours or some meetings and transcribe them using Speech2Text?? :)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprateekralhan%2FSpeech2Text-for-Long-Audio-Files","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprateekralhan%2FSpeech2Text-for-Long-Audio-Files","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprateekralhan%2FSpeech2Text-for-Long-Audio-Files/lists"}