{"id":27853542,"url":"https://github.com/Blaizzy/mlx-audio","last_synced_at":"2025-05-04T07:01:36.032Z","repository":{"id":279991963,"uuid":"895253710","full_name":"Blaizzy/mlx-audio","owner":"Blaizzy","description":"A text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon.","archived":false,"fork":false,"pushed_at":"2025-05-01T19:20:06.000Z","size":3607,"stargazers_count":723,"open_issues_count":36,"forks_count":65,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-05-01T20:29:14.916Z","etag":null,"topics":["apple-silicon","audio-processing","mlx","multimodal","speech-recognition","speech-synthesis","speech-to-text","text-to-speech","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Blaizzy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":"Blaizzy","patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"lfx_crowdfunding":null,"polar":null,"buy_me_a_coffee":null,"thanks_dev":null,"custom":null}},"created_at":"2024-11-27T21:14:34.000Z","updated_at":"2025-05-01T09:15:51.000Z","dependencies_parsed_at":"2025-02-28T20:58:18.657Z","dependency_job_id":"132bf025-700f-438e-8ada-e3465789b1c6","html_url":"https://github.com/Blaizzy/mlx-audio","commit_stats":null,"previous_names":["blaizzy/mlx-audio"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blaizzy%2Fmlx-audio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blaizzy%2Fmlx-audio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blaizzy%2Fmlx-audio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blaizzy%2Fmlx-audio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Blaizzy","download_url":"https://codeload.github.com/Blaizzy/mlx-audio/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252299530,"owners_count":21725721,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apple-silicon","audio-processing","mlx","multimodal","speech-recognition","speech-synthesis","speech-to-text","text-to-speech","transformers"],"created_at":"2025-05-04T07:01:34.910Z","updated_at":"2025-05-04T07:01:36.012Z","avatar_url":"https://github.com/Blaizzy.png","language":"Python","funding_links":["https://github.com/sponsors/Blaizzy"],"categories":["Python","text-to-speech","Repos","语音识别与合成_其他","🤖 AI \u0026 Machine Learning","Audio \u0026 Speech"],"sub_categories":["资源传输下载"],"readme":"# MLX-Audio\n\nA text-to-speech (TTS) and Speech-to-Speech (STS) library built on Apple's MLX framework, providing efficient speech synthesis on Apple Silicon.\n\n## Features\n\n- Fast inference on Apple Silicon (M series chips)\n- Multiple language support\n- Voice customization options\n- Adjustable speech speed control (0.5x to 2.0x)\n- Interactive web interface with 3D audio visualization\n- REST API for TTS generation\n- Quantization support for optimized performance\n- Direct access to output files via Finder/Explorer integration\n\n## Installation\n\n```bash\n# Install the package\npip install mlx-audio\n\n# For web interface and API dependencies\npip install -r requirements.txt\n```\n\n### Quick Start\n\nTo generate audio with an LLM use:\n\n```bash\n# Basic usage\nmlx_audio.tts.generate --text \"Hello, world\"\n\n# Specify prefix for output file\nmlx_audio.tts.generate --text \"Hello, world\" --file_prefix hello\n\n# Adjust speaking speed (0.5-2.0)\nmlx_audio.tts.generate --text \"Hello, world\" --speed 1.4\n```\n\n### How to call from python\n\nTo generate audio with an LLM use:\n\n```python\nfrom mlx_audio.tts.generate import generate_audio\n\n# Example: Generate an audiobook chapter as mp3 audio\ngenerate_audio(\n    text=(\"In the beginning, the universe was created...\\n\"\n        \"...or the simulation was booted up.\"),\n    model_path=\"prince-canuma/Kokoro-82M\",\n    voice=\"af_heart\",\n    speed=1.2,\n    lang_code=\"a\", # Kokoro: (a)f_heart, or comment out for auto\n    file_prefix=\"audiobook_chapter1\",\n    audio_format=\"wav\",\n    sample_rate=24000,\n    join_audio=True,\n    verbose=True  # Set to False to disable print messages\n)\n\nprint(\"Audiobook chapter successfully generated!\")\n\n```\n\n### Web Interface \u0026 API Server\n\nMLX-Audio includes a web interface with a 3D visualization that reacts to audio frequencies. The interface allows you to:\n\n1. Generate TTS with different voices and speed settings\n2. Upload and play your own audio files\n3. Visualize audio with an interactive 3D orb\n4. Automatically saves generated audio files to the outputs directory in the current working folder\n5. Open the output folder directly from the interface (when running locally)\n\n#### Features\n\n- **Multiple Voice Options**: Choose from different voice styles (AF Heart, AF Nova, AF Bella, BF Emma)\n- **Adjustable Speech Speed**: Control the speed of speech generation with an interactive slider (0.5x to 2.0x)\n- **Real-time 3D Visualization**: A responsive 3D orb that reacts to audio frequencies\n- **Audio Upload**: Play and visualize your own audio files\n- **Auto-play Option**: Automatically play generated audio\n- **Output Folder Access**: Convenient button to open the output folder in your system's file explorer\n\nTo start the web interface and API server:\n\n```bash\n# Using the command-line interface\nmlx_audio.server\n\n# With custom host and port\nmlx_audio.server --host 0.0.0.0 --port 9000\n\n# With verbose logging\nmlx_audio.server --verbose\n```\n\nAvailable command line arguments:\n- `--host`: Host address to bind the server to (default: 127.0.0.1)\n- `--port`: Port to bind the server to (default: 8000)\n\nThen open your browser and navigate to:\n```\nhttp://127.0.0.1:8000\n```\n\n#### API Endpoints\n\nThe server provides the following REST API endpoints:\n\n- `POST /tts`: Generate TTS audio\n  - Parameters (form data):\n    - `text`: The text to convert to speech (required)\n    - `voice`: Voice to use (default: \"af_heart\")\n    - `speed`: Speech speed from 0.5 to 2.0 (default: 1.0)\n  - Returns: JSON with filename of generated audio\n\n- `GET /audio/{filename}`: Retrieve generated audio file\n\n- `POST /play`: Play audio directly from the server\n  - Parameters (form data):\n    - `filename`: The filename of the audio to play (required)\n  - Returns: JSON with status and filename\n\n- `POST /stop`: Stop any currently playing audio\n  - Returns: JSON with status\n\n- `POST /open_output_folder`: Open the output folder in the system's file explorer\n  - Returns: JSON with status and path\n  - Note: This feature only works when running the server locally\n\n\u003e Note: Generated audio files are stored in `~/.mlx_audio/outputs` by default, or in a fallback directory if that location is not writable.\n\n## Models\n\n### Kokoro\n\nKokoro is a multilingual TTS model that supports various languages and voice styles.\n\n#### Example Usage\n\n```python\nfrom mlx_audio.tts.models.kokoro import KokoroPipeline\nfrom mlx_audio.tts.utils import load_model\nfrom IPython.display import Audio\nimport soundfile as sf\n\n# Initialize the model\nmodel_id = 'prince-canuma/Kokoro-82M'\nmodel = load_model(model_id)\n\n# Create a pipeline with American English\npipeline = KokoroPipeline(lang_code='a', model=model, repo_id=model_id)\n\n# Generate audio\ntext = \"The MLX King lives. Let him cook!\"\nfor _, _, audio in pipeline(text, voice='af_heart', speed=1, split_pattern=r'\\n+'):\n    # Display audio in notebook (if applicable)\n    display(Audio(data=audio, rate=24000, autoplay=0))\n\n    # Save audio to file\n    sf.write('audio.wav', audio[0], 24000)\n```\n\n#### Language Options\n\n- 🇺🇸 `'a'` - American English\n- 🇬🇧 `'b'` - British English\n- 🇯🇵 `'j'` - Japanese (requires `pip install misaki[ja]`)\n- 🇨🇳 `'z'` - Mandarin Chinese (requires `pip install misaki[zh]`)\n\n### CSM (Conversational Speech Model)\n\nCSM is a model from Sesame that allows you text-to-speech and to customize voices using reference audio samples.\n\n#### Example Usage\n\n```bash\n# Generate speech using CSM-1B model with reference audio\npython -m mlx_audio.tts.generate --model mlx-community/csm-1b --text \"Hello from Sesame.\" --play --ref_audio ./conversational_a.wav\n```\n\nYou can pass any audio to clone the voice from or download sample audio file from [here](https://huggingface.co/mlx-community/csm-1b/tree/main/prompts).\n\n## Advanced Features\n\n### Quantization\n\nYou can quantize models for improved performance:\n\n```python\nfrom mlx_audio.tts.utils import quantize_model, load_model\nimport json\nimport mlx.core as mx\n\nmodel = load_model(repo_id='prince-canuma/Kokoro-82M')\nconfig = model.config\n\n# Quantize to 8-bit\ngroup_size = 64\nbits = 8\nweights, config = quantize_model(model, config, group_size, bits)\n\n# Save quantized model\nwith open('./8bit/config.json', 'w') as f:\n    json.dump(config, f)\n\nmx.save_safetensors(\"./8bit/kokoro-v1_0.safetensors\", weights, metadata={\"format\": \"mlx\"})\n```\n\n## Requirements\n\n- MLX\n- Python 3.8+\n- Apple Silicon Mac (for optimal performance)\n- For the web interface and API:\n  - FastAPI\n  - Uvicorn\n  \n## License\n\n[MIT License](LICENSE)\n\n## Acknowledgements\n\n- Thanks to the Apple MLX team for providing a great framework for building TTS and STS models.\n- This project uses the Kokoro model architecture for text-to-speech synthesis.\n- The 3D visualization uses Three.js for rendering.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBlaizzy%2Fmlx-audio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FBlaizzy%2Fmlx-audio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FBlaizzy%2Fmlx-audio/lists"}