{"id":13625966,"url":"https://github.com/goberoi/cloud_speech_experiments","last_synced_at":"2025-04-16T11:30:59.982Z","repository":{"id":217073307,"uuid":"66494854","full_name":"goberoi/cloud_speech_experiments","owner":"goberoi","description":"Scripts to experiment with cloud speech vendors like Google Cloud Speech.","archived":false,"fork":false,"pushed_at":"2016-08-24T20:25:28.000Z","size":7,"stargazers_count":12,"open_issues_count":0,"forks_count":5,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-11-08T15:46:54.684Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/goberoi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-08-24T19:50:24.000Z","updated_at":"2019-11-06T06:35:21.000Z","dependencies_parsed_at":null,"dependency_job_id":"47c78cc2-e94b-42f8-8b55-33f9c2c125c5","html_url":"https://github.com/goberoi/cloud_speech_experiments","commit_stats":null,"previous_names":["goberoi/cloud_speech_experiments"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goberoi%2Fcloud_speech_experiments","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goberoi%2Fcloud_speech_experiments/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goberoi%2Fcloud_speech_experiments/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goberoi%2Fcloud_speech_experiments/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/goberoi","download_url":"https://codeload.github.com/goberoi/cloud_speech_experiments/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249235032,"owners_count":21235133,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T21:02:06.728Z","updated_at":"2025-04-16T11:30:59.744Z","avatar_url":"https://github.com/goberoi.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Experimenting with Cloud Speech Vendors\n\nThis was a quick experiment to see how Google Cloud Speech's transcription output looks for long sound files. As input, I used a ~63 minute FLAC file of an episode from the podcast [Acquired](http://www.acquired.fm/episodes/2016/8/3/acquired-episode-17-waze) (thanks [Ben](https://twitter.com/gilbert)!).\n\nThe code is a modified example from Google to run an asynchronous speech processing request. I had to call asyncRecognize because this is the only option if the audio length is greater than 1 minute. Furthermore, the format of the file must be LINEAR16 signed-integer little-endian encoded raw... see the audio format section, and resources sections below for processing tips.\n\nThis script currently only supports Google Cloud Speech for long recordings (\u003e1m), and with the stated limitations above. I may or may not modify it later to include other speech vendor examples. For a more comprehensive tool to do this, see [Speech Recognition](https://github.com/Uberi/speech_recognition).\n\n\n## Learnings\n\n1. The output was not at all what I expected, and fairly disappointing; take a glance below.\n1. I've had much better results with a short bit of speech, say 20s on their test page here: https://cloud.google.com/speech/\n1. One possible issue is that the post-processing on the podcast audio created artifacts that led to inaccuracy. There is some indication this could be true based on their [best practices](https://cloud.google.com/speech/docs/best-practices), e.g. \"All noise reduction processing should be disabled.\".\n1. Another concern is that Google Cloud Speech will try to return up to 30 guesses... unclear how long each guess can be. In the output below they are short phrases, but in my tests with 20 second long phrases on their site, I see much longer sentences per guess.\n  * See \"maxAlternatives\" deep down in their API docs here: https://cloud.google.com/speech/reference/rest/v1beta1/RecognitionConfig\n1. Based on the little I've seen here, and the fact that most of Google's Cloud Speech API is focused on \u003c1m long recordings... my guess is that long form transcription is not really Google's goal. They want to build something that one can use to create listening bots (like Siri or Alexa), and not something like a transcription service. Please correct me if you know more and find this to be an incorrect conclusion.\n\n## Sample Output\n\nThe result for this [5 minute section](https://storage.cloud.google.com/example-content/5m-ben-podcast-waze.raw?_ga=1.96114268.1595850234.1461692478) of the podcast was:\n```\n{\n  \"response\": {\n    \"@type\": \"type.googleapis.com/google.cloud.speech.v1beta1.AsyncRecognizeResponse\",\n    \"results\": [\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.31786352,\n            \"transcript\": \"Mississauga\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.92226779,\n            \"transcript\": \"weather for Bakersfield College\"\n          }\n        ]\n      }\n    ]\n  },\n  \"done\": true,\n  \"name\": \"4920998656671129691\",\n  \"metadata\": {\n    \"lastUpdateTime\": \"2016-08-24T18:38:09.049576Z\",\n    \"@type\": \"type.googleapis.com/google.cloud.speech.v1beta1.AsyncRecognizeMetadata\",\n    \"startTime\": \"2016-08-24T18:33:03.320765Z\",\n    \"progressPercent\": 100\n  }\n}\n```\n\nFor the [entire hour long podcast](https://storage.cloud.google.com/example-content/ben-podcast-waze.raw?_ga=1.53665000.1595850234.1461692478) it was:\n```\n{\n  \"response\": {\n    \"@type\": \"type.googleapis.com/google.cloud.speech.v1beta1.AsyncRecognizeResponse\",\n    \"results\": [\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.31786352,\n            \"transcript\": \"Mississauga\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.92226779,\n            \"transcript\": \"weather for Bakersfield College\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.578178,\n            \"transcript\": \"Bed Bath \u0026 Beyond\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.70186234,\n            \"transcript\": \"ESPN\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.2595036,\n            \"transcript\": \"Bakersfield High School\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.93311942,\n            \"transcript\": \"White Castle\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.81802225,\n            \"transcript\": \"Kelly Services in Columbia Maryland\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.51821041,\n            \"transcript\": \"weather Chicago\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.30766535,\n            \"transcript\": \"Yahoo\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.31368122,\n            \"transcript\": \"Mexican grocery store\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.90448982,\n            \"transcript\": \"Facebook\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.26522613,\n            \"transcript\": \"sexy\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.69060469,\n            \"transcript\": \"restaurants in Los Angeles\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.49876788,\n            \"transcript\": \"American toxicology\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.63318396,\n            \"transcript\": \"Dumb Ways to Die\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.59668612,\n            \"transcript\": \"facebook.com\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.95454544,\n            \"transcript\": \"661 Sheldon\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.95454544,\n            \"transcript\": \"weather\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.3689521,\n            \"transcript\": \"open Facebook\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.57749414,\n            \"transcript\": \"restaurants\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.4529644,\n            \"transcript\": \"dictionary.com\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.95454544,\n            \"transcript\": \"browser\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.83804256,\n            \"transcript\": \"transgender\"\n          }\n        ]\n      },\n      {\n        \"alternatives\": [\n          {\n            \"confidence\": 0.63055462,\n            \"transcript\": \"Dublin Toyota\"\n          }\n        ]\n      }\n    ]\n  },\n  \"done\": true,\n  \"name\": \"6402258854541648563\",\n  \"metadata\": {\n    \"lastUpdateTime\": \"2016-08-24T06:47:00.016904Z\",\n    \"@type\": \"type.googleapis.com/google.cloud.speech.v1beta1.AsyncRecognizeMetadata\",\n    \"startTime\": \"2016-08-24T05:40:58.702306Z\",\n    \"progressPercent\": 100\n  }\n}\n```\n\n\n\n## Setup\n\n### Get an API key from Google\n\n1. [Visit the Google Developer Console](https://console.developers.google.com/apis/dashboard) and create a new project.\n2. For that project, click \"Enable API\" and add \"Google Cloud Speech API\"\n3. For that project, click \"Credentials\" and create an \"API Key\".\n4. Copy the key, and set it as the env variable GOOGLE_API_KEY to run the script.\n\n### Get audio into the right format\n\nTo process Audio greater than \n\n```\n# Install SOX on OSX\nbrew install sox --with-lame --with-flac --with-libvorbis\n\n# Convert FLAC file I got from Ben to PCM LINEAR16 format\nsox input/ben-podcast-waze.flac --channels=1 --bits=16 --rate=44100 --encoding=signed-integer --endian=little input/ben-podcast-waze.raw\n\n# Optional: trim it down to try a smaller sample.\nsox --rate 44100 --bits 16 --encoding signed-integer input/ben-podcast-waze.raw input/5m-ben-podcast-waze.raw trim 0 05:00\n```\n\n### Upload Audio\n\n1. Visit https://console.cloud.google.com/storage/browser\n2. Upload files and make them public.\n3. URL is of the format: gs://yourbucketname/yourfilename, e.g: gs://example-content/5m-ben-podcast-waze.raw\n\n### Install dependencies for script\n\n```\npip install --upgrade google-api-python-client\n```\n\n### Run script\n\n```\n# Kick off job to process file hosted on Goole Storage\npython google_speech.py -u gs://yourbucketname/yourfilename.raw\n\n# Check status of job\npython google_speech.py -n somenumber\n```\n\n\n## Useful resources\n\nAudio cleanup:\n* http://apple.stackexchange.com/questions/137108/how-can-i-add-support-for-flac-files-in-sox\n* http://stackoverflow.com/questions/38926114/flac-file-with-google-cloud-speech-api-fails\n* http://stackoverflow.com/questions/9667081/how-do-you-trim-the-audio-files-end-using-sox\n\nGoogle dev:\n* https://console.developers.google.com\n* https://console.cloud.google.com/storage/browser\n\nGoogle speech:\n* https://cloud.google.com/speech/\n* https://cloud.google.com/speech/reference/rest/v1beta1/RecognitionAudio\n* https://cloud.google.com/speech/limits\n* https://cloud.google.com/speech/support\n\nExample code:\n\n\n## Related businesses\n\nPodcast transcription:\n* https://www.popuparchive.com/\n* https://www.audiosear.ch/\n* https://castingwords.com/\n\n## License\n\nApache License, Version 2.0 (the \"License\"): http://www.apache.org/licenses/LICENSE-2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoberoi%2Fcloud_speech_experiments","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoberoi%2Fcloud_speech_experiments","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoberoi%2Fcloud_speech_experiments/lists"}