{"id":29170205,"url":"https://github.com/maniam/speak-io","last_synced_at":"2026-04-29T06:36:58.379Z","repository":{"id":300276469,"uuid":"1002534129","full_name":"ManiAm/Speak-IO","owner":"ManiAm","description":"A web API for speech-to-text (STT) and text-to-speech (TTS) that integrates with existing engines, supporting real-time audio streaming and modular engine selection.","archived":false,"fork":false,"pushed_at":"2025-07-01T11:05:59.000Z","size":13194,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-04-29T06:36:42.279Z","etag":null,"topics":["bark-tts","chatterbox-tts","conqui-tts","fast-whisper","fastapi","piper-tts","vosk","websocket","whisper-ai","whisper-cpp"],"latest_commit_sha":null,"homepage":"https://blog.homelabtech.dev/content/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ManiAm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-15T17:12:21.000Z","updated_at":"2026-04-09T12:00:08.000Z","dependencies_parsed_at":"2025-06-27T09:39:26.226Z","dependency_job_id":null,"html_url":"https://github.com/ManiAm/Speak-IO","commit_stats":null,"previous_names":["maniam/speakio","maniam/speak-io"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/ManiAm/Speak-IO","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ManiAm%2FSpeak-IO","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ManiAm%2FSpeak-IO/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ManiAm%2FSpeak-IO/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ManiAm%2FSpeak-IO/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ManiAm","download_url":"https://codeload.github.com/ManiAm/Speak-IO/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ManiAm%2FSpeak-IO/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32414422,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T06:29:02.080Z","status":"ssl_error","status_checked_at":"2026-04-29T06:29:00.631Z","response_time":110,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bark-tts","chatterbox-tts","conqui-tts","fast-whisper","fastapi","piper-tts","vosk","websocket","whisper-ai","whisper-cpp"],"created_at":"2025-07-01T12:39:25.430Z","updated_at":"2026-04-29T06:36:58.363Z","avatar_url":"https://github.com/ManiAm.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Speak-IO\n\nSpeak-IO is a modular, containerized system for bi-directional speech processing. It offers both speech-to-text (STT) and text-to-speech (TTS) capabilities through a unified, offline-first architecture. The system is divided into multiple Docker containers, each encapsulating a specific responsibility to ensure scalability, maintainability, and ease of deployment.\n\nNote that Speak-IO does not provide or train any speech models. Instead, it offers a microservice-based architecture that integrates and orchestrates existing open-source STT and TTS engines. Speak-IO handles model selection, loading, and inference - allowing you to experiment with and compare different models through a consistent and extensible API and user interface.\n\n## System Architecture\n\nThe core components of Speak-IO include:\n\n- **speech_to_text** container handles real-time or file-based transcription of spoken audio into text using a variety of STT engines such as `OpenAI Whisper`, `Whister.cpp`, `Faster-Whisper`, and `Vosk`. This component provides a WebSocket-based and HTTP interface for streaming and batch transcription.\n\n- **text_to_speech** container synthesizes natural-sounding audio from input text using pluggable TTS engines such as `Coqui`, `Piper`, `Bark`, and `Chatterbox`. It supports model loading, multi-language synthesis, and wav output suitable for immediate playback.\n\nEach service is isolated but interoperable via REST APIs and WebSocket endpoints, making Speak-IO extensible and well-suited for both local experimentation and production use.\n\n\u003cimg src=\"pics/speak-io.jpeg\" alt=\"segment\" width=\"800\"\u003e\n\nFor further details, refer to the following documentations:\n\n- [Speech to Text](speech_to_text/README.md)\n- [Text to Speech](text_to_speech/README.md)\n\n## Getting Started\n\nTo run Speak-IO locally, ensure you have Docker and Docker Compose installed. The project is containerized and managed via Docker Compose, allowing you to spin up all services with minimal setup.\n\nFrom the project root directory, build all Speak-IO containers using the following command:\n\n    docker compose build\n\nOnce the build completes, launch all containers in the background:\n\n    docker compose up -d\n\nWait for the containers to fully initialize. You can check the status using:\n\n    docker compose ps\n    docker logs \u003ccontainer/name\u003e\n\nThese URLs provide access to speech-to-text service:\n\n- Swagger API docs: http://localhost:5000/api/docs\n- API Base URL: http://localhost:5000/api/stt/\n\nThese URLs provide access to text-to-speech service:\n\n- Swagger API docs: http://localhost:5500/api/docs\n- API Base URL: http://localhost:5500/api/tts/\n\nOnce the containers are up and running, access the Voice UI at:\n\n    http://localhost:5700\n\nFrom the web interface, you can begin experimenting with speech-to-text and text-to-speech features.\n\n## Voice UI\n\nVoice UI is the web interface for the Speak-IO project, designed to provide an intuitive and interactive way for users to access and test its speech capabilities. The interface is divided into two primary tabs: one for Speech-to-Text and another for Text-to-Speech.\n\n### Speech-to-Text Tab\n\nIn the Speech-to-Text tab, users can select a supported engine and model combination. Once selected, clicking the \"Load Model\" button instructs the Speak-IO backend to download and prepare the specified model for transcription. Pressing \"Start\" requests microphone access from the browser and begins recording. While the recording is ongoing, the browser streams audio data to the backend in real time using a WebSocket connection. When the user clicks \"Stop\", the frontend sends a signal to the backend indicating that audio capture has ended and transcription should begin. The backend processes the streamed audio and returns the resulting text, which is then displayed in the web interface.\n\nTo demonstrate its effectiveness, we include a sample use case in which the following TikTok video was played through a microphone and transcribed by Speak-IO using two different engines. This showcases the system’s flexibility and accuracy in real-world scenarios.\n\n[sample.mp4](https://github.com/user-attachments/assets/9ee05d9f-1035-47a4-83b7-c85eb8ba5b80)\n\nHere is the sample transcribed output:\n\n\u003cimg src=\"pics/stt.jpg\" alt=\"segment\" width=\"850\"\u003e\n\n### Text-to-Speech (TTS) Tab\n\nIn the Text-to-Speech tab, users can explore and evaluate a variety of voice synthesis models supported by different engines. The interface allows users to first select a TTS engine, and then choose relevant parameters such as language, voice, and model quality - depending on the capabilities and structure of the selected engine.\n\nOnce the desired configuration is selected, clicking \"Load Model\" instructs the Speak-IO backend to download and initialize the model for use. Users may then enter any custom text they wish to synthesize or choose from a set of predefined sample paragraphs available in multiple popular languages, designed to highlight the clarity, tone, and pronunciation capabilities of each model.\n\nWhen the \"Speak\" button is pressed, the typed or selected text is sent to the backend. The backend processes the text using the loaded model, synthesizes it into audio in WAV format, and streams it back to the web interface. The resulting audio is played in the browser, allowing users to hear the synthesized speech and compare the output across different models and languages.\n\n\u003cimg src=\"pics/tts.jpg\" alt=\"segment\" width=\"850\"\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaniam%2Fspeak-io","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaniam%2Fspeak-io","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaniam%2Fspeak-io/lists"}