{"id":44975913,"url":"https://github.com/k-l-lambda/m4t","last_synced_at":"2026-02-18T17:04:32.684Z","repository":{"id":324874197,"uuid":"1098888188","full_name":"k-l-lambda/m4t","owner":"k-l-lambda","description":null,"archived":false,"fork":false,"pushed_at":"2025-12-02T13:01:21.000Z","size":162,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-05T10:38:34.753Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/k-l-lambda.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-18T09:22:02.000Z","updated_at":"2025-12-02T13:01:25.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/k-l-lambda/m4t","commit_stats":null,"previous_names":["k-l-lambda/m4t"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/k-l-lambda/m4t","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/k-l-lambda%2Fm4t","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/k-l-lambda%2Fm4t/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/k-l-lambda%2Fm4t/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/k-l-lambda%2Fm4t/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/k-l-lambda","download_url":"https://codeload.github.com/k-l-lambda/m4t/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/k-l-lambda%2Fm4t/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29587066,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-18T16:55:40.614Z","status":"ssl_error","status_checked_at":"2026-02-18T16:55:37.558Z","response_time":162,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-18T17:04:31.648Z","updated_at":"2026-02-18T17:04:32.656Z","avatar_url":"https://github.com/k-l-lambda.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SeamlessM4T Inference API\n\nMultilingual speech and text translation API using Meta's **SeamlessM4T v2** model.\n\n## Features\n\n- **9 Translation \u0026 Speech Tasks:**\n  - 🎤→📝 **S2TT**: Speech-to-Text Translation (e.g., Japanese audio → Chinese text)\n  - 🎤→🔊 **S2ST**: Speech-to-Speech Translation (e.g., Japanese audio → Chinese audio)\n  - 🎤→📝 **ASR**: Automatic Speech Recognition (e.g., Japanese audio → Japanese text)\n  - 📝→📝 **T2TT**: Text-to-Text Translation (e.g., English text → Chinese text)\n  - 📝→🔊 **TTS**: Text-to-Speech (e.g., Chinese text → Chinese audio)\n  - 🎙️ **VAD**: Voice Activity Detection (detect speech segments in audio)\n  - 🎵 **Vocal Separation**: Extract vocals from background music (optional Spleeter)\n  - 🎭 **Voice Cloning**: Clone speaker voice with GPT-SoVITS (direct Python integration)\n  - 🎼 **Audio Split**: Split audio into vocals + accompaniment (two separate streams)\n\n- **Wide Language Support:** 101 languages for speech, 96 for text\n- **High Quality:** 2.3B parameter model with state-of-the-art translation quality\n- **Easy Deployment:** Standalone Python or Docker container\n- **RESTful API:** FastAPI with automatic OpenAPI documentation\n\n## Quick Start\n\n### Option 1: Development Mode (Standalone Python)\n\n```bash\n# Start server\n./start_dev.sh\n\n# Or manually:\npython3 -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\npython server.py\n```\n\n### Option 2: Docker (with GPT-SoVITS integrated)\n\n**Latest version: v1.1.0** includes GPT-SoVITS voice cloning support built-in.\n\n#### Method 1: Auto-download models (easiest)\n\n```bash\n# Build image with integrated GPT-SoVITS\n./build_docker.sh v1.1.0\n\n# Run container (models will download automatically on first start)\ndocker run -d --name m4t-server \\\n  --gpus all \\\n  -p 8000:8000 \\\n  kllambda/m4t:v1.1.0\n\n# First startup takes ~5-10 minutes to download models (~1.2 GB)\n# Subsequent starts are instant as models are cached in container\ndocker logs -f m4t-server\n```\n\n#### Method 2: Volume-mount pretrained models (faster startup)\n\n```bash\n# If you already have pretrained models on host:\ndocker run -d --name m4t-server \\\n  --gpus all \\\n  -p 8000:8000 \\\n  -v /path/to/pretrained_models:/app/third_party/GPT-SoVITS/GPT_SoVITS/pretrained_models \\\n  -e SKIP_MODEL_DOWNLOAD=true \\\n  kllambda/m4t:v1.1.0\n\n# Example: Mount from host system\n# -v ~/work/hf-GPT-SoVITS:/app/third_party/GPT-SoVITS/GPT_SoVITS/pretrained_models\n```\n\n#### Method 3: Using Docker Compose (legacy)\n\n```bash\n# Using Docker Compose (without GPT-SoVITS)\ndocker-compose up -d\n\n# Or using the script\n./start_docker.sh\n```\n\n#### Model Files\n\nThe container includes both v1 and **v3 models** (~2.0 GB total):\n\n**v1 models** (baseline):\n- `s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt` (~155 MB) - GPT v1 model\n- `s2G488k.pth` (~106 MB) - SoVITS v1 generator\n- `s2D488k.pth` (~94 MB) - SoVITS v1 discriminator\n\n**v3 models** (improved quality, included by default):\n- `s1v3.ckpt` (~149 MB) - GPT v3 model with better duration control\n- `s2Gv3.pth` (~734 MB) - SoVITS v3 generator (7x larger, higher quality)\n- `G2PWModel/` (~562 MB) - G2PW ONNX model for advanced text processing\n\n**Shared models**:\n- `chinese-hubert-base/` - Chinese HuBERT model\n- `chinese-roberta-wwm-ext-large/` - Chinese RoBERTa model for tokenization\n\n**v3 Model Advantages**:\n- Better voice quality and naturalness\n- Improved pronunciation with ML-powered G2PW text processing\n- More stable voice characteristics across different texts\n- Better handling of Chinese text (tone and pronunciation)\n\nModels are automatically downloaded from HuggingFace (`k-l-lambda/GPT-SoVITS-pretrained-models`) on first container start, or can be volume-mounted to skip download.\n\n## Configuration\n\nThe server can be configured using environment variables or a `.env.local` file in the project root.\n\n### Configuration File (.env.local)\n\nCreate a `.env.local` file to customize server settings:\n\n```bash\n# Server Configuration\nSERVER_HOST=0.0.0.0\nSERVER_PORT=8000\n\n# Audio and Text Limits\nMAX_AUDIO_LENGTH=300  # seconds\nMAX_TEXT_LENGTH=2000  # characters\n\n# GPT-SoVITS Configuration (for voice cloning)\nGPTSOVITS_API_URL=http://localhost:9880\n\n# Proxy Configuration (for model downloads)\nHTTP_PROXY=http://localhost:1091\nHTTPS_PROXY=http://localhost:1091\n\n# Logging\nLOG_LEVEL=INFO  # DEBUG, INFO, WARNING, ERROR, CRITICAL\n```\n\n### Environment Variables\n\nAll configuration options can also be set via environment variables:\n\n```bash\n# Change server port\nexport SERVER_PORT=9000\npython server.py\n\n# Or inline\nSERVER_PORT=9000 python server.py\n```\n\n### Restart Script\n\nA convenience script is provided to restart the server with proper cache clearing:\n\n```bash\n./restart.sh\n```\n\nThis script will:\n- Stop existing server processes\n- Clear Python cache\n- Start server with nohup (runs in background)\n- Wait for server readiness\n- Check GPT-SoVITS availability\n- Display server status and configuration\n\n## API Endpoints\n\n### Base URL: `http://localhost:8000`\n\n| Endpoint | Method | Description |\n|----------|--------|-------------|\n| `/docs` | GET | Interactive API documentation (Swagger UI) |\n| `/health` | GET | Health check |\n| `/languages` | GET | List supported languages |\n| `/tasks` | GET | List supported tasks |\n| `/v1/speech-to-text-translation` | POST | Translate speech to text (S2TT) |\n| `/v1/speech-to-speech-translation` | POST | Translate speech to speech (S2ST) |\n| `/v1/transcribe` | POST | Transcribe speech (ASR) |\n| `/v1/text-to-text-translation` | POST | Translate text to text (T2TT) |\n| `/v1/text-to-speech` | POST | Convert text to speech (TTS) |\n| `/v1/detect-voice` | POST | Detect speech segments (VAD) |\n| `/v1/separate-vocals` | POST | Separate vocals from music (requires Spleeter) |\n\n## Usage Examples\n\n### 1. Text-to-Text Translation (T2TT)\n\n```bash\ncurl -X POST \"http://localhost:8000/v1/text-to-text-translation\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"text\": \"こんにちは、今日は良い天気ですね。\",\n    \"source_lang\": \"jpn\",\n    \"target_lang\": \"cmn\"\n  }'\n```\n\n**Response:**\n```json\n{\n  \"task\": \"t2tt\",\n  \"source_language\": \"jpn\",\n  \"target_language\": \"cmn\",\n  \"input_text\": \"こんにちは、今日は良い天気ですね。\",\n  \"output_text\": \"你好，今天天气真好。\",\n  \"processing_time\": 0.45\n}\n```\n\n### 2. Speech-to-Text Translation (S2TT)\n\n```bash\ncurl -X POST \"http://localhost:8000/v1/speech-to-text-translation\" \\\n  -F \"audio=@japanese_audio.wav\" \\\n  -F \"target_lang=cmn\" \\\n  -F \"source_lang=jpn\"\n```\n\n**Response:**\n```json\n{\n  \"task\": \"s2tt\",\n  \"source_language\": \"jpn\",\n  \"target_language\": \"cmn\",\n  \"input_duration\": 5.2,\n  \"output_text\": \"你好，今天天气真好。\",\n  \"processing_time\": 1.85\n}\n```\n\n### 3. Transcription (ASR)\n\n```bash\ncurl -X POST \"http://localhost:8000/v1/transcribe\" \\\n  -F \"audio=@japanese_audio.wav\" \\\n  -F \"language=jpn\"\n```\n\n### 4. Speech-to-Speech Translation (S2ST)\n\n```bash\n# Get audio file directly\ncurl -X POST \"http://localhost:8000/v1/speech-to-speech-translation\" \\\n  -F \"audio=@japanese_audio.wav\" \\\n  -F \"target_lang=cmn\" \\\n  -F \"source_lang=jpn\" \\\n  -F \"response_format=audio\" \\\n  -o translated_audio.wav\n\n# Or get JSON with base64-encoded audio\ncurl -X POST \"http://localhost:8000/v1/speech-to-speech-translation\" \\\n  -F \"audio=@japanese_audio.wav\" \\\n  -F \"target_lang=cmn\" \\\n  -F \"response_format=json\"\n```\n\n### 5. Text-to-Speech (TTS)\n\n```bash\ncurl -X POST \"http://localhost:8000/v1/text-to-speech\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"text\": \"你好，今天天气很好\",\n    \"source_lang\": \"cmn\"\n  }'\n```\n\n**Response:**\n```json\n{\n  \"task\": \"tts\",\n  \"language\": \"cmn\",\n  \"input_text\": \"你好，今天天气很好\",\n  \"output_audio\": [0.001, -0.002, 0.003, ...],\n  \"output_sample_rate\": 16000,\n  \"processing_time\": 9.11\n}\n```\n\n**Save audio to file (Python):**\n```python\nimport requests\nimport numpy as np\nimport soundfile as sf\n\nresponse = requests.post(\n    \"http://localhost:8000/v1/text-to-speech\",\n    json={\n        \"text\": \"Hello, how are you today?\",\n        \"source_lang\": \"eng\"\n    }\n)\n\nresult = response.json()\naudio_array = np.array(result['output_audio'], dtype=np.float32)\nsample_rate = result['output_sample_rate']\n\n# Save to WAV file\nsf.write('output_speech.wav', audio_array, sample_rate)\n```\n\n### 6. Voice Activity Detection (VAD)\n\nDetect speech segments in audio files with precise timestamps.\n\n**Basic usage:**\n```bash\ncurl -X POST \"http://localhost:8000/v1/detect-voice\" \\\n  -F \"audio=@audio_file.wav\"\n```\n\n**With parameters:**\n```bash\ncurl -X POST \"http://localhost:8000/v1/detect-voice\" \\\n  -F \"audio=@audio_file.wav\" \\\n  -F \"threshold=0.5\" \\\n  -F \"min_speech_duration_ms=250\" \\\n  -F \"min_silence_duration_ms=300\"\n```\n\n**Parameters:**\n- `threshold` (float, default: 0.5): Speech detection threshold (0.0-1.0). Lower = more sensitive\n- `min_speech_duration_ms` (int, default: 250): Minimum speech segment duration in milliseconds\n- `min_silence_duration_ms` (int, default: 300): Minimum silence duration between segments\n\n**Response:**\n```json\n{\n  \"task\": \"vad\",\n  \"total_duration\": 3.18,\n  \"speech_segments\": [\n    {\n      \"start\": 0.258,\n      \"end\": 2.846,\n      \"duration\": 2.588\n    }\n  ],\n  \"segment_count\": 1,\n  \"total_speech_duration\": 2.588,\n  \"processing_time\": 0.064\n}\n```\n\n**Use cases:**\n- Intelligent audio segmentation for long recordings\n- Silence removal for efficient processing\n- Speech quality analysis\n- Preprocessing for translation workflows\n\n**Performance:** ~40-50x faster than real-time (3s audio processed in 0.06s)\n\n### 7. Vocal Separation (Audio Preprocessing)\n\nSeparate vocals from background music using Spleeter. Useful for improving speech recognition/translation quality on audio with music.\n\n**Note:** Requires Spleeter installation: `pip install spleeter`\n\n```bash\ncurl -X POST \"http://localhost:8000/v1/separate-vocals\" \\\n  -F \"audio=@audio_with_music.wav\" \\\n  | python3 -c \"import sys, json, base64; d=json.load(sys.stdin); open('vocals.wav','wb').write(base64.b64decode(d['vocals_audio_base64']))\"\n```\n\n**Response:**\n```json\n{\n  \"task\": \"separate\",\n  \"input_duration\": 5.2,\n  \"vocals_audio_base64\": \"UklGRiQAAABXQVZFZm10...\",\n  \"sample_rate\": 16000,\n  \"processing_time\": 3.45,\n  \"separator_available\": true\n}\n```\n\n**Optional preprocessing for translation:**\n\nYou can automatically separate vocals before translation by adding `separate_vocals=true`:\n\n```bash\n# Speech-to-text translation with vocal separation\ncurl -X POST \"http://localhost:8000/v1/speech-to-text-translation\" \\\n  -F \"audio=@audio_with_music.wav\" \\\n  -F \"target_lang=cmn\" \\\n  -F \"source_lang=jpn\" \\\n  -F \"separate_vocals=true\"\n\n# Speech-to-speech translation with vocal separation\ncurl -X POST \"http://localhost:8000/v1/speech-to-speech-translation\" \\\n  -F \"audio=@audio_with_music.wav\" \\\n  -F \"target_lang=cmn\" \\\n  -F \"separate_vocals=true\" \\\n  -F \"response_format=audio\" \\\n  -o translated_vocals.wav\n```\n\nIf Spleeter is not installed, the parameter is ignored and processing continues without separation (with a warning).\n\n### 8. Audio Split (Source Separation)\n\n**Endpoint:** `POST /v1/audio-split`\n\nSplit audio into two separate streams: vocals and accompaniment (background music).\n\n**Use cases:**\n- Extract clean vocals for further processing\n- Create karaoke versions (accompaniment only)\n- Remix or mashup production\n- Audio analysis and research\n\n**Features:**\n- ✅ Automatic chunked processing for long audio files (\u003e10 minutes)\n- ✅ No duration limit - processes audio of any length\n- ✅ GPU-accelerated separation using Spleeter's 2-stem model\n\n**Note:** Requires Spleeter installation: `pip install spleeter`\n\n**Basic usage:**\n\n```bash\n# Split audio and save both streams\ncurl -X POST \"http://localhost:8000/v1/audio-split\" \\\n  -F \"audio=@song.wav\" \\\n  -o output.json\n\n# Extract vocals only\njq -r '.vocals_audio_base64' output.json | base64 -d \u003e vocals.wav\n\n# Extract accompaniment only\njq -r '.accompaniment_audio_base64' output.json | base64 -d \u003e accompaniment.wav\n\n# Extract both streams in one command\ncat output.json | jq -r '.vocals_audio_base64' | base64 -d \u003e vocals.wav \u0026\u0026 \\\ncat output.json | jq -r '.accompaniment_audio_base64' | base64 -d \u003e accompaniment.wav\n```\n\n**One-liner to extract both streams:**\n\n```bash\ncurl -s -X POST \"http://localhost:8000/v1/audio-split\" \\\n  -F \"audio=@song.wav\" \\\n  | tee \u003e(jq -r '.vocals_audio_base64' | base64 -d \u003e vocals.wav) \\\n  | jq -r '.accompaniment_audio_base64' | base64 -d \u003e accompaniment.wav\n```\n\n**Response:**\n\n```json\n{\n  \"task\": \"audio_split\",\n  \"input_duration\": 5.2,\n  \"vocals_audio_base64\": \"UklGRiQAAABXQVZFZm10...\",\n  \"accompaniment_audio_base64\": \"UklGRkgBAABXQVZFZm10...\",\n  \"sample_rate\": 16000,\n  \"processing_time\": 3.45,\n  \"separator_available\": true\n}\n```\n\n**Python example:**\n\n```python\nimport requests\nimport base64\n\n# Split audio\nwith open(\"song.wav\", \"rb\") as f:\n    response = requests.post(\n        \"http://localhost:8000/v1/audio-split\",\n        files={\"audio\": f}\n    )\n\nresult = response.json()\n\n# Save vocals\nwith open(\"vocals.wav\", \"wb\") as f:\n    f.write(base64.b64decode(result['vocals_audio_base64']))\n\n# Save accompaniment\nwith open(\"accompaniment.wav\", \"wb\") as f:\n    f.write(base64.b64decode(result['accompaniment_audio_base64']))\n\nprint(f\"Separated {result['input_duration']:.2f}s audio in {result['processing_time']:.2f}s\")\n```\n\n**Performance:**\n- Short audio (\u003c10 min): ~0.6-0.7x real-time (5s audio processed in 3.5s on GPU)\n- Long audio (\u003e10 min): Processed in 5-minute chunks with automatic concatenation\n\n**Difference from `/v1/separate-vocals`:**\n- `/v1/separate-vocals`: Returns vocals only (single stream)\n- `/v1/audio-split`: Returns BOTH vocals and accompaniment (two streams)\n\n### 9. Voice Cloning (GPT-SoVITS)\n\n**Endpoint:** `POST /v1/voice-clone`\n\nClone a speaker's voice from a reference audio and generate speech with the same voice characteristics.\n\n**Features:**\n- Direct Python integration (no external service needed)\n- Supports multiple languages: Chinese (zh), English (en), Japanese (ja), Korean (ko), etc.\n- **Automatic language code mapping**: Accepts both SeamlessM4T codes (eng, cmn, jpn) and GPT-SoVITS codes (en, zh, ja)\n- High-quality voice cloning using GPT-SoVITS\n- Auto-downloads language detection models on first use\n\n**Installation:**\n\nThe voice cloning feature requires additional dependencies:\n\n```bash\n# Install GPT-SoVITS dependencies\ncd /home/camus/work/m4t\n./env/bin/pip install cn2an num2words eng_to_ipa fugashi[unidic-lite] unidic-lite\n\n# GPT-SoVITS models are already included in third_party/GPT-SoVITS/\n```\n\n**First-time setup:**\nOn first use with Chinese text, fast-langdetect will automatically download a 126MB language detection model. This only happens once.\n\n**Example (save directly to WAV file):**\n\n```bash\n# Using SeamlessM4T language codes (eng, cmn, jpn)\ncurl -s -X POST \"http://localhost:8000/v1/voice-clone\" \\\n  -F \"audio=@reference_audio.wav\" \\\n  -F \"text=Hello, this is a voice cloning test.\" \\\n  -F \"text_language=eng\" \\\n  -F \"prompt_text=Original text from reference audio\" \\\n  -F \"prompt_language=eng\" \\\n  | jq -r '.output_audio_base64' | base64 -d \u003e cloned_voice.wav\n\n# Using GPT-SoVITS language codes (en, zh, ja) - also supported\ncurl -s -X POST \"http://localhost:8000/v1/voice-clone\" \\\n  -F \"audio=@reference_audio.wav\" \\\n  -F \"text=Hello, this is a voice cloning test.\" \\\n  -F \"text_language=en\" \\\n  -F \"prompt_text=Original text from reference audio\" \\\n  -F \"prompt_language=en\" \\\n  | jq -r '.output_audio_base64' | base64 -d \u003e cloned_voice.wav\n\n# Chinese text with English reference audio\ncurl -s -X POST \"http://localhost:8000/v1/voice-clone\" \\\n  -F \"audio=@reference_audio.wav\" \\\n  -F \"text=你好，这是一个中文语音克隆测试。\" \\\n  -F \"text_language=cmn\" \\\n  -F \"prompt_text=Original English text from reference\" \\\n  -F \"prompt_language=eng\" \\\n  | jq -r '.output_audio_base64' | base64 -d \u003e chinese_cloned.wav\n\n# With fixed seed for reproducible results\ncurl -s -X POST \"http://localhost:8000/v1/voice-clone\" \\\n  -F \"audio=@reference_audio.wav\" \\\n  -F \"text=Hello, this is a test.\" \\\n  -F \"text_language=eng\" \\\n  -F \"prompt_text=Original text from reference audio\" \\\n  -F \"prompt_language=eng\" \\\n  -F \"seed=42\" \\\n  | jq -r '.output_audio_base64' | base64 -d \u003e reproducible_voice.wav\n```\n\n**Parameters:**\n- `audio`: Reference audio file (WAV format recommended, 5-30 seconds)\n- `text`: Text to synthesize in the target language\n- `text_language`: Language code - Supports both:\n  - SeamlessM4T codes: `eng` (English), `cmn` (Chinese), `jpn` (Japanese), `kor` (Korean)\n  - GPT-SoVITS codes: `en` (English), `zh` (Chinese), `ja` (Japanese), `ko` (Korean)\n- `prompt_text`: Transcription of the reference audio (what is being said)\n- `prompt_language`: Language of the reference audio (supports both code formats)\n- `cut_punc` (optional): Punctuation for text segmentation\n- `seed` (optional): Random seed for reproducibility\n  - `-1` (default): Random generation (different result each time)\n  - `0-1000000`: Fixed seed for reproducible results (same result with same seed)\n\n**Response:**\n```json\n{\n  \"task\": \"voice_clone\",\n  \"output_audio_base64\": \"UklGRiQAAABXQVZFZm10...\",\n  \"output_sample_rate\": 32000,\n  \"text_length\": 35,\n  \"output_duration\": 3.5,\n  \"processing_time\": 2.1,\n  \"service_available\": true\n}\n```\n\n**Performance:**\n- English text: ~1-2 seconds processing time\n- Chinese text (first time): ~28 seconds (includes model download)\n- Chinese text (subsequent): ~1-2 seconds\n- Output: 32kHz mono WAV audio\n\n**Notes:**\n- Reference audio should be clear and noise-free for best results\n- Longer reference audio (10-30 seconds) generally produces better quality\n- The cloned voice will maintain the speaker's characteristics but speak the new text\n- GPU recommended for faster processing (uses ~4GB VRAM)\n- **Reproducibility**: Set `seed` parameter to a fixed value (e.g., 42) for reproducible results across multiple runs\n  - Useful for A/B testing, debugging, or when consistent output is needed\n  - Use default `-1` for natural variety in voice generation\n\n### Python Client Example\n\n```python\nimport requests\nimport numpy as np\nimport soundfile as sf\n\n# Text translation\nresponse = requests.post(\n    \"http://localhost:8000/v1/text-to-text-translation\",\n    json={\n        \"text\": \"Hello\",\n        \"source_lang\": \"eng\",\n        \"target_lang\": \"cmn\"\n    }\n)\nprint(response.json()[\"output_text\"])\n\n# Speech translation\nwith open(\"audio.wav\", \"rb\") as f:\n    response = requests.post(\n        \"http://localhost:8000/v1/speech-to-text-translation\",\n        files={\"audio\": f},\n        data={\"target_lang\": \"cmn\", \"source_lang\": \"jpn\"}\n    )\nprint(response.json()[\"output_text\"])\n\n# Text-to-Speech\nresponse = requests.post(\n    \"http://localhost:8000/v1/text-to-speech\",\n    json={\n        \"text\": \"你好，今天天气很好\",\n        \"source_lang\": \"cmn\"\n    }\n)\nresult = response.json()\naudio_array = np.array(result['output_audio'], dtype=np.float32)\nsf.write('chinese_speech.wav', audio_array, result['output_sample_rate'])\n```\n\n**Full example script:** See `tts_example.py` for complete TTS usage examples.\n\n## Supported Languages\n\nCommon language codes:\n\n| Code | Language |\n|------|----------|\n| `jpn` | Japanese |\n| `cmn` | Chinese (Simplified) |\n| `cmn_Hant` | Chinese (Traditional) |\n| `yue` | Cantonese |\n| `kor` | Korean |\n| `eng` | English |\n| `fra` | French |\n| `deu` | German |\n| `spa` | Spanish |\n| `rus` | Russian |\n| `ara` | Arabic |\n| `hin` | Hindi |\n| `tha` | Thai |\n| `vie` | Vietnamese |\n\n**Full list:** GET `/languages` or check `config.py`\n\n## System Requirements\n\n### Hardware\n- **GPU:** Recommended (NVIDIA with 24GB+ VRAM)\n  - Model uses ~7GB VRAM in FP16\n  - Can run on CPU but much slower\n- **RAM:** 16GB+ recommended\n- **Disk:** 10GB+ (for model cache)\n\n### Software\n- Python 3.8+\n- CUDA 11.8+ (for GPU)\n- Docker + nvidia-docker (for Docker deployment)\n\n## Testing\n\nRun test suite:\n\n```bash\npython test_api.py\n```\n\nTests include:\n- Health check\n- Language and task listing\n- Text-to-text translation\n- Speech-to-text translation (requires audio file)\n- Transcription (requires audio file)\n- Speech-to-speech translation (requires audio file)\n\nAdd test audio file at `examples/test_audio.wav` for full testing.\n\n## Configuration\n\nEdit `config.py` to customize:\n\n- **Model:** Change `MODEL_NAME` for different model sizes\n- **Device:** Set `DEVICE` to \"cuda\" or \"cpu\"\n- **Proxy:** Configure `HTTP_PROXY` for model downloads\n- **Server:** Modify `SERVER_HOST` and `SERVER_PORT`\n- **Languages:** Add/remove from `SUPPORTED_LANGUAGES`\n\n## Troubleshooting\n\n### Model Download Issues\n\nIf model download fails or hangs:\n\n```bash\n# Set proxy\nexport HTTP_PROXY=http://localhost:1091\nexport HTTPS_PROXY=http://localhost:1091\n\n# Pre-download model\nhuggingface-cli download facebook/seamless-m4t-v2-large\n```\n\n### GPU Memory Issues\n\nIf running out of VRAM:\n1. Use smaller model: `facebook/seamless-m4t-medium`\n2. Reduce batch size (process one audio at a time)\n3. Use CPU mode (set `DEVICE=\"cpu\"` in config.py)\n\n### Audio Format Issues\n\nSupported formats: WAV, MP3, FLAC, M4A, OGG\n\nIf audio fails to process:\n- Ensure audio is mono or stereo (will be converted)\n- Check sample rate (will be resampled to 16kHz)\n- Verify file is not corrupted\n\n## API Documentation\n\nInteractive documentation available at:\n- **Swagger UI:** http://localhost:8000/docs\n- **ReDoc:** http://localhost:8000/redoc\n\n## Performance\n\nTypical processing times on NVIDIA H20 GPU:\n\n| Task | Duration | Processing Time | Throughput |\n|------|----------|-----------------|------------|\n| T2TT | - | 0.3-0.5s | ~100 requests/min |\n| TTS | - | 8-10s | ~6-8 requests/min |\n| S2TT | 5s audio | 1-2s | ~30 audio files/min |\n| ASR | 5s audio | 1-2s | ~30 audio files/min |\n| S2ST | 5s audio | 3-5s | ~15 audio files/min |\n\n## Project Structure\n\n```\nm4t/\n├── config.py              # Configuration settings\n├── models.py              # Model loading and inference\n├── server.py              # FastAPI application\n├── voice_detector.py      # Voice activity detection (Silero VAD)\n├── audio_separator.py     # Audio source separation (Spleeter)\n├── requirements.txt       # Python dependencies\n├── Dockerfile            # Docker image definition\n├── docker-compose.yml    # Docker Compose configuration\n├── start_dev.sh          # Development startup script\n├── start_docker.sh       # Docker startup script\n├── test_api.py           # API test suite\n├── tts_example.py        # Text-to-Speech examples\n├── commands.local.sh     # Quick test commands\n├── README.md             # This file\n└── examples/             # Example files\n    └── test_audio.wav    # Test audio file\n```\n\n## License\n\nThis API wrapper is open source. SeamlessM4T v2 model is released under CC BY-NC 4.0 license (non-commercial use).\n\n## References\n\n- **SeamlessM4T:** https://github.com/facebookresearch/seamless_communication\n- **Model Card:** https://huggingface.co/facebook/seamless-m4t-v2-large\n- **Paper:** [SeamlessM4T—Massively Multilingual \u0026 Multimodal Machine Translation](https://ai.meta.com/research/publications/seamless-m4t/)\n\n## Support\n\nFor issues and questions:\n- Check `/docs` endpoint for API documentation\n- Review logs: `docker logs seamless-m4t-api`\n- Test with `test_api.py`\n\n---\n\n**Built with ❤️ using Meta SeamlessM4T v2**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fk-l-lambda%2Fm4t","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fk-l-lambda%2Fm4t","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fk-l-lambda%2Fm4t/lists"}