{"id":31730886,"url":"https://github.com/openmoss/moss-speech","last_synced_at":"2025-10-09T07:45:24.463Z","repository":{"id":317352696,"uuid":"1066945228","full_name":"OpenMOSS/MOSS-Speech","owner":"OpenMOSS","description":"Official implementation of MOSS-Speech, a true speech-to-speech large language model without text guidance.","archived":false,"fork":false,"pushed_at":"2025-09-30T11:43:14.000Z","size":12134,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-30T11:45:25.973Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenMOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-30T07:08:51.000Z","updated_at":"2025-09-30T11:43:17.000Z","dependencies_parsed_at":null,"dependency_job_id":"9aa8953e-fef7-4152-b08f-aa5eec6c5ccb","html_url":"https://github.com/OpenMOSS/MOSS-Speech","commit_stats":null,"previous_names":["openmoss/moss-speech"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/OpenMOSS/MOSS-Speech","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FMOSS-Speech","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FMOSS-Speech/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FMOSS-Speech/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FMOSS-Speech/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenMOSS","download_url":"https://codeload.github.com/OpenMOSS/MOSS-Speech/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FMOSS-Speech/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279000975,"owners_count":26082974,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-09T02:00:07.460Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-09T07:45:20.062Z","updated_at":"2025-10-09T07:45:24.457Z","avatar_url":"https://github.com/OpenMOSS.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance\n\n\u003cdiv align=\"center\" style=\"line-height: 1;\"\u003e\n    \u003ca href=\"https://huggingface.co/spaces/fnlp/MOSS-Speech\" target=\"_blank\" style=\"margin: 2px;\"\u003e\n        \u003cimg alt=\"Chat\" src=\"https://img.shields.io/badge/🤖%20Demo-MOSS--Speech-536af5?color=ffc107\u0026logoColor=white\" style=\"display: inline-block; vertical-align: middle;\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://moss-speech.open-moss.com/\" target=\"_blank\" style=\"margin: 2px;\"\u003e\n    \u003cimg alt=\"Video Demo\" src=\"https://img.shields.io/badge/📹%20Video%20Demo-MOSS--Speech-536af5?color=1ae3f5\u0026logoColor=white\" style=\"display: inline-block; vertical-align: middle;\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://arxiv.org/abs/2510.00499\" target=\"_blank\" style=\"margin: 2px;\"\u003e\n    \u003cimg alt=\"Technical Report\" src=\"https://img.shields.io/badge/📄%20Technical%20Report-MOSS--Speech-4caf50?color=4caf50\u0026logoColor=white\" style=\"display: inline-block; vertical-align: middle;\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://huggingface.co/collections/fnlp/moss-speech-68dbab23bc98501afede0cd3\" target=\"_blank\" style=\"margin: 2px;\"\u003e\n        \u003cimg alt=\"Hugging Face\" src=\"https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-MOSS--Speech-ffc107?color=ffc107\u0026logoColor=white\" style=\"display: inline-block; vertical-align: middle;\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://x.com/Open_MOSS\" target=\"_blank\" style=\"margin: 2px;\"\u003e\n    \u003cimg alt=\"X Follow\" src=\"https://img.shields.io/badge/Twitter-OpenMOSS-black?logo=x\u0026logoColor=white\" style=\"display: inline-block; vertical-align: middle;\"/\u003e\n    \u003c/a\u003e\n\u003c/div\u003e\n\n![Logo](assets/logo-large.png)\n\n\n\n阅读[中文](./README_ZH.md)版本.\n\n---\n\n## 📖 Introduction\n\nMOSS-Speech introduces true end-to-end speech interaction. Unlike cascaded pipelines or text-guided models, it directly generates speech without first producing text. Our design not only overcomes the limitation of generated speech being constrained by a text bottleneck, but also inherits the knowledge of the pretrained text language model, thereby enabling more natural and efficient speech-to-speech dialogue.\n\n![Architecture Comparison](assets/compare.png)\n\nWe add modality-based layer-splitting to a pretrained text LLM, and follow a frozen pre-training strategy to preserve the LLM's capabilities while extending it to speech modality.\n![Architecture](assets/arch.png)\n\nCheck out our [video demo](https://moss-speech.open-moss.com/) and [live demo](https://huggingface.co/spaces/fnlp/MOSS-Speech).\n\nTechnical report is available at [arXiv:2510.00499](https://arxiv.org/abs/2510.00499).\n\n---\n\n## 🔑 Key Features\n\n\n- **True Speech-to-Speech Modeling**: No text guidance required.  \n- **Layer-Splitting Architecture**: Integrates modality-specific layers on top of pretrained text LLM backbones.\n- **Frozen Pre-Training Strategy**: Preserves LLM abilities while extending to speech modality.\n- **State-of-the-Art Performance**: Excels in spoken question answering and speech-to-speech tasks.\n\n---\n\n## 🛠️ Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/OpenMOSS/MOSS-Speech\ncd MOSS-Speech\n\n# Install dependencies\npip install -r requirements.txt \ngit submodule update --init --recursive\n```\n\n---\n\n## 🚀 Usage\nLaunch the web demo\n```sh\npython3 gradio_demo.py\n```\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/gradio.jpg\" width=\"80%\"\u003e \u003cbr\u003e\n\u003c/p\u003e\n\n\n---\n\n## Next Steps\n\n- [ ] **Open source base model**: Release the MOSS-Speech-Base model for community use\n- [ ] **Support streaming output in Gradio**: Implement streaming output for lower response latency in the web demo\n\n---\n\n## License\n- The code in this repository is released under the [Apache 2.0](LICENSE) license.\n\n---\n\n## Acknowledgements\n- [Qwen](https://github.com/QwenLM/Qwen3): We use Qwen3-8B as the base model.\n- We thank an anonymous colleague for Character Voice!\n\n---\n\n## 📜 Citation\n\nIf you use this repository or model in your research, please cite:\n\n```bibtex\n@misc{zhao2025mossspeechtruespeechtospeechmodels,\n      title={MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance}, \n      author={Xingjian Zhao and Zhe Xu and Luozhijie Jin and Yang Wang and Hanfu Chen and Yaozhou Jiang and Ke Chen and Ruixiao Li and Mingshu Chen and Ruiming Wang and Wenbo Zhang and Yiyang Zhang and Donghua Yu and Yang Gao and Xiaogui Yang and Yitian Gong and Yuanfan Xu and Qinyuan Cheng and Zhaoye Fei and Shimin Li and Yaqian Zhou and Xuanjing Huang and Xipeng Qiu},\n      year={2025},\n      eprint={2510.00499},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2510.00499}, \n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenmoss%2Fmoss-speech","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenmoss%2Fmoss-speech","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenmoss%2Fmoss-speech/lists"}