{"id":42994031,"url":"https://github.com/openmoss/mova","last_synced_at":"2026-02-04T07:00:40.258Z","repository":{"id":335267781,"uuid":"1144211118","full_name":"OpenMOSS/MOVA","owner":"OpenMOSS","description":"MOVA: Towards Scalable and Synchronized Video–Audio Generation","archived":false,"fork":false,"pushed_at":"2026-01-31T03:07:59.000Z","size":3391,"stargazers_count":410,"open_issues_count":9,"forks_count":24,"subscribers_count":6,"default_branch":"main","last_synced_at":"2026-02-02T13:23:29.281Z","etag":null,"topics":["diffusion-models","multimodal","sglang","video-audio-generation"],"latest_commit_sha":null,"homepage":"https://mosi.cn/models/mova","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenMOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-28T12:31:50.000Z","updated_at":"2026-02-02T13:20:25.000Z","dependencies_parsed_at":"2026-02-01T04:00:45.002Z","dependency_job_id":null,"html_url":"https://github.com/OpenMOSS/MOVA","commit_stats":null,"previous_names":["openmoss/mova"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/OpenMOSS/MOVA","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FMOVA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FMOVA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FMOVA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FMOVA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenMOSS","download_url":"https://codeload.github.com/OpenMOSS/MOVA/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenMOSS%2FMOVA/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29035219,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-03T02:28:16.591Z","status":"ssl_error","status_checked_at":"2026-02-03T02:27:48.904Z","response_time":96,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["diffusion-models","multimodal","sglang","video-audio-generation"],"created_at":"2026-01-31T03:12:19.533Z","updated_at":"2026-02-04T07:00:40.023Z","avatar_url":"https://github.com/OpenMOSS.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./assets/logo.png\" width=\"400\"/\u003e\n\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003ca href=\"https://github.com/OpenMOSS/MOVA\"\u003e\u003cimg src=\"https://img.shields.io/badge/Github-Star-yellow?logo=Github\u0026amp\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://huggingface.co/collections/OpenMOSS-Team/mova\"\u003e\u003cimg src=\"https://img.shields.io/badge/Huggingface-Download-orange?logo=Huggingface\u0026amp\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://mosi.cn/models/mova\"\u003e\u003cimg src=\"https://img.shields.io/badge/Website-View-blue?logo=Website\u0026amp\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/OpenMOSS/MOVA\"\u003e\u003cimg src=\"https://img.shields.io/badge/Arxiv-Coming soon-red?logo=Arxiv\u0026amp\"\u003e\u003c/a\u003e\n\u003c/div\u003e\n\u003cdiv align=\"center\"\u003e\n    \u003ca href=\"https://discord.gg/J2BBgVMRVZ\"\u003e\u003cimg src=\"https://img.shields.io/badge/Discord-Join-blueviolet?logo=discord\u0026amp\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://x.com/Open_MOSS\"\u003e\u003cimg src=\"https://img.shields.io/badge/X-Follow-blue?logo=x\u0026amp\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://gist.github.com/user-attachments/assets/abf31f41-55d3-4e4e-9f25-966bf6d23fc1\"\u003e\u003cimg src=\"https://img.shields.io/badge/Wechat-Join-green?logo=wechat\u0026amp\"\u003e\u003c/a\u003e\n\u003c/div\u003e\n\n## MOVA: Towards Scalable and Synchronized Video–Audio Generation\nWe introduce **MOVA** (**MO**SS **V**ideo and **A**udio), a foundation model designed to break the \"silent era\" of open-source video generation. Unlike cascaded pipelines that generate sound as an afterthought, MOVA synthesizes video and audio simultaneously for perfect alignment.\n\n🌟Key Highlights\n- **Native Bimodal Generation**: Moves beyond clunky cascaded pipelines. MOVA generates high-fidelity video and synchronized audio in a single inference pass, eliminating error accumulation.\n- **Precise Lip-Sync \u0026 Sound FX**: Achieves state-of-the-art performance in multilingual lip-synchronization and environment-aware sound effects.\n- **Fully Open-Source**: In a field dominated by closed-source models (Sora 2, Veo 3, Kling), we are releasing model weights, inference code, training pipelines, and LoRA fine-tuning scripts. \n- **Asymmetric Dual-Tower Architecture**: Leverages the power of pre-trained video and audio towers, fused via a bidirectional cross-attention mechanism for rich modality interaction.\n\n## 🔥News!!!\n- 2026/01/29: 🎉We released **MOVA**, an open-source foundation model for high-fidelity synchronized video–audio generation!!!\n\n## 🎬Demo\n\u003cdiv align=\"center\"\u003e\n  \u003cvideo src=\"https://gist.github.com/user-attachments/assets/cee573cc-56ce-4987-beef-0b374e1ed3b7\" width=\"70%\" poster=\"\"\u003e \u003c/video\u003e\n\u003c/div\u003e\n\nSingle person speech:\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n  \u003cvideo src=\"https://gist.github.com/user-attachments/assets/118a6597-054b-4bb9-812a-c225e93f12f7\" width=\"70%\"\u003e\u003c/video\u003e\n\u003c/details\u003e\n\nMulti-person speech:\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand\u003c/summary\u003e\n  \u003cvideo src=\"https://gist.github.com/user-attachments/assets/a11b1d1e-b0da-4c45-9aeb-c74a64131b6d\" width=\"70%\"\u003e\u003c/video\u003e\n\u003c/details\u003e\n\nView more demos on our [website](https://mosi.cn/models/mova).\n\n## 🚀Quick Start\n### Environment Setup\n```\nconda create -n mova python=3.13 -y\nconda activate mova\npip install -e .\n```\n\n### Model Downloading\n| Model    | Download Link                                                  | Note |\n|-----------|----------------------------------------------------------------|------|\n| MOVA-360p | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOVA-360p) | Support TI2VA |\n| MOVA-720p | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOVA-720p) | Support TI2VA |\n\n```\nhf download OpenMOSS-Team/MOVA-360p --local-dir /path/to/MOVA-360p\nhf download OpenMOSS-Team/MOVA-720p --local-dir /path/to/MOVA-720p\n```\n\n### Inference\nGenerate a video of single person speech:\n```\nexport CP_SIZE=1\nexport CKPT_PATH=/path/to/MOVA-360p/\n\ntorchrun \\\n    --nproc_per_node=$CP_SIZE \\\n    scripts/inference_single.py \\\n    --ckpt_path $CKPT_PATH \\\n    --cp_size $CP_SIZE \\\n    --height 352 \\\n    --width 640 \\\n    --prompt \"A man in a blue blazer and glasses speaks in a formal indoor setting, framed by wooden furniture and a filled bookshelf. Quiet room acoustics underscore his measured tone as he delivers his remarks. At one point, he says, \\\"I would also say that this election in Germany wasn’t surprising.\\\"\" \\\n    --ref_path \"./assets/single_person.jpg\" \\\n    --output_path \"./data/samples/single_person.mp4\" \\\n    --seed 42 \\\n    --offload cpu\n```\n\nGenerate a video of multi-person speech:\n```\nexport CP_SIZE=1\nexport CKPT_PATH=/path/to/MOVA-360p/\n\ntorchrun \\\n    --nproc_per_node=$CP_SIZE \\\n    scripts/inference_single.py \\\n    --ckpt_path $CKPT_PATH \\\n    --cp_size $CP_SIZE \\\n    --height 352 \\\n    --width 640 \\\n    --prompt \"The scene shows a man and a child walking together through a park, surrounded by open greenery and a calm, everyday atmosphere. As they stroll side by side, the man turns his head toward the child and asks with mild curiosity, in English, \\\"What do you want to do when you grow up?\\\" The boy answers with clear confidence, saying, \\\"A bond trader. That's what Don does, and he took me to his office.\\\" The man lets out a soft chuckle, then responds warmly, \\\"It's a good profession.\\\" as their walk continues at an unhurried pace, the conversation settling into a quiet, reflective moment.\" \\\n    --ref_path \"./assets/multi_person.png\" \\\n    --output_path \"./data/samples/multi_person.mp4\" \\\n    --seed 42 \\\n    --offload cpu\n```\nPlease refer to the [**inference script**](./scripts/inference_single.py) for more argument usage.\n\n#### Key optional arguments (`scripts/inference_single.py`)\n`--offload cpu`: component-wise CPU offload to reduce **VRAM**, typically slower and uses more **Host RAM**.  \n`--offload group`: finer-grained layerwise/group offload, often achieves lower **VRAM** but is usually slower and increases **Host RAM** pressure (see the benchmark table below).  \n`--remove_video_dit`: after switching to low-noise `video_dit_2`, frees the stage-1 `video_dit` reference, which can reduce ~28GB of **Host RAM** when offload is enabled.\n\n### Inference Performance Reference\nWe provide inference benchmarks for generating an **8-second 360p** videos under different offloading strategies. Note that actual performance may vary depending on hardware configurations, driver versions, and PyTorch/CUDA builds.\n\n| Offload Strategy | VRAM (GB) | Host RAM (GB) | Hardware    | Step Time (s) |\n|---------------------------|----------|-------------|-------------|--------------|\n| Component-wise offload    | 48       | 66.7        | RTX 4090    | 37.5         |\n| Component-wise offload    | 48       | 66.7        | H100        | 9.0         |\n| Layerwise (group offload) | 12       | 76.7        | RTX 4090    | 42.3         |\n| Layerwise (group offload) | 12       | 76.7        | H100        | 22.8         |\n\n### Ascend NPU support\n\nWe also support **NPU**s. For more details about NPU training/inference, please refer to **[this document](https://github.com/OpenMOSS/MOVA/blob/feat/npu/ASCEND_SUPPORTS.md)**.\n\n## Evaluation\nWe evaluate our model through both objective benchmarks and subjective human evaluations. \n\n### Evaluation on Verse-Bench\n\nWe provide quantitative comparison of audiovisual generation performance on Verse-Bench. The Audio and AV-Align metrics are evaluated on all subsets; the Lip Sync and Speech metrics are evaluated on Verse-Bench Set3; and ASR Acc is evaluated on a multi-speaker subset proposed by our team. Boldface and underlined numbers indicate the best and second-best results, respectively.\n\nIn the lip-sync task, which shows the largest performance gap, MOVA demonstrates a clear advantage. According to the Lip Sync Error metric, with Dual CFG enabled, MOVA-720p achieves an LSE-D score of 7.094 and an LSE-C score of 7.452. Furthermore, MOVA also attains the best performance on the cpCER metric, which reflects speech recognition accuracy and speaker-switching accuracy.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./assets/verse_bench.jpg\" alt=\"verse-bench\" width=\"100%\"/\u003e\n\u003c/p\u003e\n\n\n### Human Evaluation\nBelow are the Elo scores and win rates comparing MOVA to existing open-source models.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./assets/elo.png\" alt=\"Elo scores comparison\" width=\"60%\"/\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"./assets/winrate.png\" alt=\"Win rate comparison\" width=\"100%\"/\u003e\n\u003c/p\u003e\n\n## SGLang Integration\n```\nsglang generate \\\n  --model-path OpenMOSS-Team/MOVA-720p \\\n  --prompt \"A man in a blue blazer and glasses speaks in a formal indoor setting, \\\n  framed by wooden furniture and a filled bookshelf. \\\n  Quiet room acoustics underscore his measured tone as he delivers his remarks. \\\n  At one point, he says, \\\"I would also say that this election in Germany wasn’t surprising.\\\"\" \\\n  --image-path \"https://github.com/OpenMOSS/MOVA/raw/main/assets/single_person.jpg\" \\\n  --adjust-frames false \\\n  --num-gpus 8 \\\n  --ring-degree 2 \\\n  --ulysses-degree 4 \\\n  --num-frames 193 \\\n  --fps 24 \\\n  --seed 67 \\\n  --num-inference-steps 25 \\\n  --enable-torch-compile \\\n  --save-output\n```\n\n## Training\n### LoRA Fine-tuning\nThe following commands show how to launch LoRA training in different modes; for detailed memory and performance numbers, see the **LoRA Resource \u0026 Performance Reference** section below.\n\n#### Training Preparation\n\n- **Model checkpoints**: Download MOVA weights to your local path and update the `diffusion_pipeline` section of the corresponding config.\n- **Dataset**: Configure your video+audio dataset and transforms in the `data` section of the corresponding config (e.g., `mova_train_low_resource.py`); see `mova/datasets/video_audio_dataset.py` for the expected fields.\n- **Environment**: Use the same environment as inference, then install training-only extras: `pip install -e \".[train]\"` (includes `torchcodec` and `bitsandbytes`).\n- **Configs**: Choose one of the training configs below and edit LoRA, optimizer, and scheduler settings as needed.\n\n#### Low-resource LoRA (single GPU, most memory-efficient)\n\n- **Config**: `configs/training/mova_train_low_resource.py`\n- **Script**:\n\n```bash\nbash scripts/training_scripts/example/low_resource_train.sh\n```\n\n#### Accelerate LoRA (1 GPU)\n\n- **Config**: `configs/training/mova_train_accelerate.py`\n- **Script**:\n\n```bash\nbash scripts/training_scripts/example/accelerate_train.sh\n```\n\n#### Accelerate + FSDP LoRA (8 GPUs)\n\n- **Config**: `configs/training/mova_train_accelerate_8gpu.py`\n- **Accelerate config**: `configs/training/accelerate/fsdp_8gpu.yaml`\n- **Script**:\n\n```bash\nbash scripts/training_scripts/example/accelerate_train_8gpu.sh\n```\n\nAll hyper-parameters (LoRA rank/alpha, target modules, optimizer, offload strategy, etc.) are defined in the corresponding config files; the example scripts only take the config path as input.\n\n### LoRA Resource \u0026 Performance Reference\n\nAll peak usage numbers below are measured on **360p, 8-second** video training settings and will vary with resolution, duration, and batch size.\n\n| Mode | VRAM (GB/GPU) | Host RAM (GB) | Hardware    | Step Time (s) |\n|--------------------------------------|-------------|-------------|-------------|-------------|\n| Low-resource LoRA (single GPU)       | ≈18GB       | ≈80GB       |  RTX 4090   | 600         |\n| Accelerate LoRA (1 GPU)              | ≈100GB      | ≥128GB      |  H100       |  N/A        |\n| Accelerate + FSDP LoRA (8 GPUs)      | ≈50GB       | ≥128GB      |  H100       | 22.2        |\n\n\u003e **Note**: Training 8-second 360p videos on RTX 4090 is **not recommended** due to high resource requirements and slow training speed. We strongly suggest reducing video resolution (e.g., 240p) or total frame count to accelerate training and reduce resource consumption.\n\n\n## 📑TODO List\n- [x] Checkpoints\n- [x] Multi-GPU inference\n- [x] Lora fine-tune\n- [x] Ascend NPU Fine-tune\n- [x] Ascend NPU Inference\n- [x] SGLang Integration\n- [ ] Technical Report\n- [ ] Generation Workflow\n- [ ] Diffusers Integration\n\n## Acknowledgement\nWe would like to thank the contributors to [Wan](https://github.com/Wan-Video/Wan2.2), [SGLang](https://github.com/sgl-project/sglang), [diffusers](https://huggingface.co/docs/diffusers/en/index), [HuggingFace](https://huggingface.co/), [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio), and [HunyuanVideo-Foley](https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley) for their great open-source work, which is helpful to this project.\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=OpenMOSS/MOVA\u0026type=date\u0026legend=top-left)](https://www.star-history.com/#OpenMOSS/MOVA\u0026type=date\u0026legend=top-left)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenmoss%2Fmova","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenmoss%2Fmova","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenmoss%2Fmova/lists"}