{"id":25763309,"url":"https://github.com/prakash-aryan/speech_command_server","last_synced_at":"2026-05-17T02:04:40.926Z","repository":{"id":279351304,"uuid":"938524020","full_name":"prakash-aryan/speech_command_server","owner":"prakash-aryan","description":"A real-time voice command detection system that recognizes \"play\" and \"pause\" commands using Vosk speech recognition.","archived":false,"fork":false,"pushed_at":"2025-02-25T05:04:03.000Z","size":52,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-29T03:41:13.242Z","etag":null,"topics":["fastapi","python","transcription","uvicorn","vosk","websockets"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/prakash-aryan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-25T05:01:00.000Z","updated_at":"2025-02-25T05:05:38.000Z","dependencies_parsed_at":"2025-02-25T06:28:38.548Z","dependency_job_id":null,"html_url":"https://github.com/prakash-aryan/speech_command_server","commit_stats":null,"previous_names":["prakash-aryan/speech_command_server"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/prakash-aryan/speech_command_server","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prakash-aryan%2Fspeech_command_server","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prakash-aryan%2Fspeech_command_server/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prakash-aryan%2Fspeech_command_server/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prakash-aryan%2Fspeech_command_server/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/prakash-aryan","download_url":"https://codeload.github.com/prakash-aryan/speech_command_server/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prakash-aryan%2Fspeech_command_server/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33125184,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-16T18:38:32.183Z","status":"online","status_checked_at":"2026-05-17T02:00:05.366Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fastapi","python","transcription","uvicorn","vosk","websockets"],"created_at":"2025-02-26T20:16:17.150Z","updated_at":"2026-05-17T02:04:40.911Z","avatar_url":"https://github.com/prakash-aryan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Speech Command Detector\n\nA real-time voice command detection system that recognizes \"play\" and \"pause\" commands using Vosk speech recognition.\n\n[Demo](https://github.com/user-attachments/assets/6401734d-8ede-4605-810c-c4d6c2280e5c)\n\n\n![System Architecture](sysArch.png)\n\n## Features\n\n- Real-time speech recognition with WebSockets\n- Low-latency command detection\n- Responsive web interface with visual feedback\n- Standalone command-line interface option\n- Works in modern browsers (Chrome, Firefox, Edge)\n\n## System Architecture\n\nAs shown in the architecture diagram above, the system consists of two main parts:\n\n1. **Client Side (Browser)**:\n   - **Audio Capture**: Converts microphone input to 16kHz PCM format\n   - **WebSocket Client**: Handles bidirectional communication with the server\n   - **User Interface**: Displays command state and transcriptions\n\n2. **Server Side (Python)**:\n   - **WebSocket Server**: FastAPI and Uvicorn handle connections\n   - **Audio Processor**: Buffers and processes incoming audio\n   - **Speech Model (Vosk)**: Converts audio to text\n   - **Command Handler**: Detects commands in the transcription\n\n3. **Standalone CLI Version**: \n   - Uses the same core components without WebSocket/UI layers\n\n## Installation\n\n1. Clone the repository:\n   ```\n   git clone git@github.com:prakash-aryan/speech_command_server.git\n   cd speech_command_server\n   ```\n\n2. Create a virtual environment and install dependencies:\n   ```\n   python -m venv .venv\n   source .venv/bin/activate  # On Windows: .venv\\Scripts\\activate\n   pip install -r requirements.txt\n   ```\n\n3. Download the Vosk speech recognition model:\n   ```\n   mkdir -p models/data\n   cd models/data\n   wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip\n   unzip vosk-model-small-en-us-0.15.zip\n   cd ../..\n   ```\n\n## Usage\n\n### Web Interface\n\n1. Start the server:\n   ```\n   python app.py\n   ```\n\n2. Open a web browser and navigate to:\n   ```\n   http://localhost:8080\n   ```\n\n3. Click \"Start Listening\" and speak commands like \"play\" or \"pause\"\n\n### Command Line Interface\n\nFor a standalone command-line interface without the web server:\n\n```\npython simple_command_detector.py\n```\n\n## How It Works\n\n1. The browser captures audio from the microphone using the WebAudio API\n2. Audio is processed, resampled to 16kHz, and converted to 16-bit PCM format\n3. Audio data is sent to the server via WebSocket\n4. The server processes the audio through several components:\n   - Audio Processor prepares and buffers the data\n   - Speech Model (Vosk) transcribes the audio to text\n   - Command Handler detects \"play\" or \"pause\" commands\n5. Commands are sent back to the browser, which updates the UI accordingly\n6. Transcriptions are also sent back for real-time feedback\n\n## Project Structure\n\n```\nspeech_command_server/\n│\n├── app.py                    # Main FastAPI server application\n├── simple_command_detector.py # Standalone CLI tool\n├── requirements.txt          # Python dependencies\n├── README.md                 # Documentation\n├── sysArch.png               # System architecture diagram\n│\n├── models/\n│   ├── __init__.py           # Makes models a package\n│   ├── asr_model.py          # Speech recognition model\n│   └── data/                 # Speech model data\n│       └── vosk-model-small-en-us-0.15/\n│\n├── utils/\n│   ├── __init__.py           # Makes utils a package\n│   ├── audio_processor.py    # Audio processing utilities\n│   └── command_handler.py    # Command detection logic\n│\n└── static/\n    └── index.html            # Web interface\n```\n\n## Extending the System\n\nThis project can be extended in several ways:\n\n### 1. Add More Voice Commands\n\nTo add new commands, modify the `CommandHandler` class in `utils/command_handler.py`:\n\n```python\ndef __init__(self):\n    \"\"\"Initialize the command handler\"\"\"\n    # Define command keywords and synonyms\n    self.commands = {\n        \"play\": [\"play\", \"start\", \"begin\", \"resume\", \"go\"],\n        \"pause\": [\"pause\", \"stop\", \"halt\", \"freeze\", \"wait\"],\n        # Add new commands here:\n        \"next\": [\"next\", \"skip\", \"forward\"],\n        \"previous\": [\"previous\", \"back\", \"backward\"],\n        \"volume_up\": [\"louder\", \"increase volume\", \"volume up\"],\n        \"volume_down\": [\"quieter\", \"decrease volume\", \"volume down\"]\n    }\n```\n\n### 2. Integrate with External Systems\n\nYou can extend the command handling to control real applications:\n\n```python\ndef _apply_cooldown(self, command: str) -\u003e Optional[str]:\n    \"\"\"Apply cooldown logic and handle system integration\"\"\"\n    # [Existing cooldown code]\n    \n    # Add integration with external systems\n    if command == \"play\":\n        # Example: Use subprocess to control a media player\n        import subprocess\n        subprocess.run([\"playerctl\", \"play\"])\n    elif command == \"pause\":\n        subprocess.run([\"playerctl\", \"pause\"])\n        \n    return command\n```\n\n### 3. Improve Speech Recognition\n\nYou can improve recognition accuracy by:\n\n1. Using a larger Vosk model\n2. Adding custom vocabulary or word boosting:\n\n```python\ndef __init__(self, model_path: Optional[str] = None):\n    # [Existing initialization code]\n    \n    # Create recognizer with model\n    self.recognizer = KaldiRecognizer(self.vosk_model, self.sample_rate)\n    \n    # Add custom vocabulary or boost specific words\n    self.recognizer.SetWords(True)\n    self.recognizer.SetPartialWords(True)\n    \n    # Boost command keywords for better recognition\n    grammar = '[\"play\", \"pause\", \"next\", \"previous\", \"stop\"]'\n    self.recognizer.SetGrammar(grammar)\n```\n\n### 4. Add Authentication\n\nFor multi-user applications, add authentication to the FastAPI server:\n\n```python\nfrom fastapi import Depends, FastAPI, HTTPException, status\nfrom fastapi.security import OAuth2PasswordBearer, OAuth2PasswordRequestForm\n\n# [Authentication code setup]\n\n@app.websocket(\"/ws\")\nasync def websocket_endpoint(websocket: WebSocket, current_user: User = Depends(get_current_user)):\n    # Only authenticated users can use the WebSocket\n    # [Existing WebSocket code]\n```\n\n### 5. Implement Voice Profiles\n\nFor improved accuracy, add voice profile training:\n\n```python\nclass UserProfile:\n    def __init__(self, user_id):\n        self.user_id = user_id\n        self.voice_samples = []\n        \n    def add_sample(self, audio_data):\n        self.voice_samples.append(audio_data)\n        \n    def train(self):\n        # Process voice samples to create user-specific model adjustments\n        pass\n```\n\n## Troubleshooting\n\n- **Browser doesn't detect microphone**: Make sure you're using a modern browser and accessing the site via https:// or localhost\n- **No transcription appears**: Check that your microphone is working and properly selected in the browser\n- **Server doesn't start**: Make sure the Vosk model is downloaded and extracted to the correct location\n- **WebSocket disconnects**: Check your network connection and firewall settings\n\n## Requirements\n\n- Python 3.10+\n- Vosk speech recognition model\n- Modern web browser\n- Microphone\n\n## License\n\n[MIT License](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprakash-aryan%2Fspeech_command_server","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprakash-aryan%2Fspeech_command_server","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprakash-aryan%2Fspeech_command_server/lists"}