{"id":13456514,"url":"https://github.com/Alpha-VLLM/Lumina-T2X","last_synced_at":"2025-03-24T10:32:51.159Z","repository":{"id":230284596,"uuid":"778819999","full_name":"Alpha-VLLM/Lumina-T2X","owner":"Alpha-VLLM","description":"Lumina-T2X is a unified framework for Text to Any Modality Generation","archived":false,"fork":false,"pushed_at":"2024-08-06T02:20:09.000Z","size":60826,"stargazers_count":2062,"open_issues_count":52,"forks_count":87,"subscribers_count":30,"default_branch":"main","last_synced_at":"2024-10-29T15:39:32.282Z","etag":null,"topics":["aigc","diffusion","diffusion-model","diffusion-models","diffusion-transformer","generation-models","transformer","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Alpha-VLLM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-28T13:23:28.000Z","updated_at":"2024-10-29T14:06:52.000Z","dependencies_parsed_at":null,"dependency_job_id":"8b4bf026-4824-435a-8d50-3273f72ed0b1","html_url":"https://github.com/Alpha-VLLM/Lumina-T2X","commit_stats":null,"previous_names":["alpha-vllm/lumina-t2x"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alpha-VLLM%2FLumina-T2X","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alpha-VLLM%2FLumina-T2X/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alpha-VLLM%2FLumina-T2X/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alpha-VLLM%2FLumina-T2X/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Alpha-VLLM","download_url":"https://codeload.github.com/Alpha-VLLM/Lumina-T2X/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245252529,"owners_count":20585085,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aigc","diffusion","diffusion-model","diffusion-models","diffusion-transformer","generation-models","transformer","transformers"],"created_at":"2024-07-31T08:01:23.343Z","updated_at":"2025-03-24T10:32:51.152Z","avatar_url":"https://github.com/Alpha-VLLM.png","language":"Python","funding_links":[],"categories":["Python","Project List","Repos"],"sub_categories":["\u003cspan id=\"tool\"\u003eLLM (LLM \u0026 Tool)\u003c/span\u003e"],"readme":"\u003c!-- \u003cp align=\"center\"\u003e\n \u003cimg src=\"./assets/lumina-logo.png\" width=\"40%\"/\u003e\n \u003cbr\u003e\n\u003c/p\u003e --\u003e\n\n# $\\textbf{Lumina-T2X}$: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers \n\n### \u003cdiv align=\"center\"\u003e ICLR 2025 Spotlight \u0026 NeurIPS 2024 \u003cdiv\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n\u003c!--[![GitHub repo contributors](https://img.shields.io/github/contributors-anon/Alpha-VLLM/Lumina-T2X?style=flat\u0026label=Contributors)](https://github.com/Alpha-VLLM/Lumina-T2X/graphs/contributors)--\u003e\n\n\u003c!--[![GitHub Commit](https://img.shields.io/github/commit-activity/m/Alpha-VLLM/Lumina-T2X?label=Commit)](https://github.com/Alpha-VLLM/Lumina-T2X/commits/main/)--\u003e\n\n\u003c!--[![Pr](https://img.shields.io/github/issues-pr-closed-raw/Alpha-VLLM/Lumina-T2X.svg?label=Merged+PRs\u0026color=green)](https://github.com/Alpha-VLLM/Lumina-T2X/pulls) \u003cbr\u003e--\u003e\n\n\u003c!--[![GitHub repo stars](https://img.shields.io/github/stars/Alpha-VLLM/Lumina-T2X?style=flat\u0026logo=github\u0026logoColor=whitesmoke\u0026label=Stars)](https://github.com/Alpha-VLLM/Lumina-T2X/stargazers) --\u003e\n\n\u003c!--[![GitHub repo watchers](https://img.shields.io/github/watchers/Alpha-VLLM/Lumina-T2X?style=flat\u0026logo=github\u0026logoColor=whitesmoke\u0026label=Watchers)](https://github.com/Alpha-VLLM/Lumina-T2X/watchers) --\u003e\n\n\u003c!--[![GitHub repo size](https://img.shields.io/github/repo-size/Alpha-VLLM/Lumina-T2X?style=flat\u0026logo=github\u0026logoColor=whitesmoke\u0026label=Repo%20Size)](https://github.com/Alpha-VLLM/Lumina-T2X/archive/refs/heads/main.zip) --\u003e\n\n[![Lumina-Next](https://img.shields.io/badge/Paper-Lumina--Next-2b9348.svg?logo=arXiv)](https://arxiv.org/abs/2406.18583)\u0026#160;\n[![Lumina-T2X](https://img.shields.io/badge/Paper-Lumina--T2X-2b9348.svg?logo=arXiv)](https://arxiv.org/abs/2405.05945)\u0026#160;\n[![Lumina-mGPT](https://img.shields.io/badge/Paper-Lumina--mGPT-2b9348.svg?logo=arXiv)](https://arxiv.org/abs/2408.02657)\u0026#160;\n\n[![Badge](https://img.shields.io/badge/-WeChat@Group-000000?logo=wechat\u0026logoColor=07C160)](http://imagebind-llm.opengvlab.com/qrcode/)\u0026#160;\n[![weixin](https://img.shields.io/badge/-WeChat@机器之心-000000?logo=wechat\u0026logoColor=07C160)](https://mp.weixin.qq.com/s/NwwbaeRujh-02V6LRs5zMg)\u0026#160;\n[![zhihu](https://img.shields.io/badge/-知乎-000000?logo=zhihu\u0026logoColor=0084FF)](https://www.zhihu.com/org/opengvlab)\u0026#160;\n[![zhihu](https://img.shields.io/badge/-Twitter@OpenGVLab-black?logo=twitter\u0026logoColor=1D9BF0)](https://twitter.com/opengvlab/status/1788949243383910804)\u0026#160;\n![Static Badge](https://img.shields.io/badge/-MIT-MIT?logoColor=%231082c3\u0026label=Code%20License\u0026link=https%3A%2F%2Fgithub.com%2FAlpha-VLLM%2FLumina-T2X%2Fblob%2Fmain%2FLICENSE)\n\n[![Static Badge](https://img.shields.io/badge/Video%20Introduction%20of%20Lumina--Next-red?logo=youtube)](https://www.youtube.com/watch?v=K0-AJa33Rw4)\n[![Static Badge](https://img.shields.io/badge/Video%20Introduction%20of%20Lumina--T2X-pink?logo=youtube)](https://www.youtube.com/watch?v=KFtHmS5eUCM)\n\n[![Static Badge](https://img.shields.io/badge/Official(node1)-6B88E3?logo=youtubegaming\u0026label=Demo%20Lumina-Next-SFT)](http://106.14.2.150:10020/)\u0026#160;\n[![Static Badge](https://img.shields.io/badge/Official(node2)-6B88E3?logo=youtubegaming\u0026label=Demo%20Lumina-Next-SFT)](http://106.14.2.150:10021/)\u0026#160;\n[![Static Badge](https://img.shields.io/badge/Official(node3)-6B88E3?logo=youtubegaming\u0026label=Demo%20Lumina-Next-SFT)](http://106.14.2.150:10022/)\u0026#160;\n[![Static Badge](https://img.shields.io/badge/Official(compositional)-6B88E3?logo=youtubegaming\u0026label=Demo%20Lumina-Next-T2I)](http://106.14.2.150:10023/)\u0026#160;\n[![Static Badge](https://img.shields.io/badge/Official(node1)-violet?logo=youtubegaming\u0026label=Demo%20Lumina-Text2Music)](http://139.196.83.164:8000/)\u0026#160;\n[![Static Badge](https://img.shields.io/badge/Lumina--Next--SFT-HF_Space-yellow?logoColor=violet\u0026label=%F0%9F%A4%97%20Demo%20Lumina-Next-SFT)](https://huggingface.co/spaces/Alpha-VLLM/Lumina-Next-T2I)\n\n[![Static Badge](https://img.shields.io/badge/Lumina--Next--SFT%20checkpoints-Model(2B)-purple?logoColor=#571482\u0026label=%F0%9F%A4%97%20Lumina-Next-SFT%20checkpoints)](https://wisemodel.cn/models/Alpha-VLLM/Lumina-Next-SFT)\n[![Static Badge](https://img.shields.io/badge/Lumina--Next--T2I%20checkpoints-Model(2B)-purple?logoColor=#571482\u0026label=%F0%9F%A4%97%20Lumina-Next-SFT%20checkpoints)](https://wisemodel.cn/models/Alpha-VLLM/Lumina-Next-T2I)\n\n[![Static Badge](https://img.shields.io/badge/Lumina--Next--SFT%20checkpoints-Model(2B)-yellow?logoColor=violet\u0026label=%F0%9F%A4%97%20Lumina-Next-Diffusers%20checkpoints)](https://huggingface.co/Alpha-VLLM/Lumina-Next-SFT-diffusers)\n[![Static Badge](https://img.shields.io/badge/Lumina--Next--SFT%20checkpoints-Model(2B)-yellow?logoColor=violet\u0026label=%F0%9F%A4%97%20Lumina-Next-SFT%20checkpoints)](https://huggingface.co/Alpha-VLLM/Lumina-Next-SFT)\n[![Static Badge](https://img.shields.io/badge/Lumina--Next--T2I%20checkpoints-Model(2B)-yellow?logoColor=violet\u0026label=%F0%9F%A4%97%20Lumina-Next-T2I%20checkpoints)](https://huggingface.co/Alpha-VLLM/Lumina-Next-T2I)\n[![Static Badge](https://img.shields.io/badge/Lumina--T2I%20checkpoints-Model(5B)-yellow?logoColor=violet\u0026label=%F0%9F%A4%97%20Lumina-T2I%20checkpoints)](https://huggingface.co/Alpha-VLLM/Lumina-T2I)\n\n\u003c!-- [![GitHub issues](https://img.shields.io/github/issues/Alpha-VLLM/Lumina-T2X?color=critical\u0026label=Issues)]() --\u003e\n\n\u003c!-- [![GitHub closed issues](https://img.shields.io/github/issues-closed/Alpha-VLLM/Lumina-T2X?color=success\u0026label=Issues)]() \u003cbr\u003e --\u003e\n\n\u003c!-- [![GitHub repo forks](https://img.shields.io/github/forks/Alpha-VLLM/Lumina-T2X?style=flat\u0026logo=github\u0026logoColor=whitesmoke\u0026label=Forks)](https://github.com/Alpha-VLLM/Lumina-T2X/network)  --\u003e\n\n\u003c!--\n[[📄 Lumina-T2X arXiv](https://arxiv.org/abs/2405.05945)]\n[[📽️ Video Introduction of Lumina-T2X](https://www.youtube.com/watch?v=KFtHmS5eUCM)]\n[👋 join our \u003ca href=\"http://imagebind-llm.opengvlab.com/qrcode/\" target=\"_blank\"\u003eWeChat\u003c/a\u003e]\n\n--\u003e\n\n\u003c!-- [[📺 Website](https://lumina-t2-x-web.vercel.app/)] --\u003e\n\n\u003c/div\u003e\n\n![intro_large](https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/9f52eabb-07dc-4881-8257-6d8a5f2a0a5a)\n\n\u003c!-- [[中文版本]](./README_cn.md) --\u003e\n\n## 📰 News\n\n- **[2024-08-06] 🎉🎉🎉 We have released [Lumina-mGPT](https://arxiv.org/abs/2408.02657), the next-generation of generative models in our Lumina family! Lumina-mGPT is an autoregressive transformer capable of photorealistic image generation and other vision-language tasks, e.g., controllable generation, multi-turn dialog, depth/normal/segmentation map estimation.**\n- **[2024-07-08] 🎉🎉🎉 Lumina-Next is now supported in the [diffusers](https://github.com/huggingface/diffusers)! Thanks to [@yiyixuxu](https://github.com/yiyixuxu) and [@sayakpaul](https://github.com/sayakpaul)! [HF Model Repo](https://huggingface.co/Alpha-VLLM/Lumina-Next-SFT-diffusers).**\n- [2024-06-26] We have released the inference code for img2img translation using `Lumina-Next-T2I`. [CODE](https://github.com/Alpha-VLLM/Lumina-T2X/tree/main/lumina_next_t2i_mini/scripts/sample_img2img.sh) [ComfyUI](https://github.com/kijai/ComfyUI-LuminaWrapper)\n- [2024-06-21] 🥰🥰🥰 Lumina-Next has a jupyter nootbook for inference, thanks to [canenduru](https://github.com/camenduru)! [LINK](https://github.com/camenduru/Lumina-Next-jupyter)\n- [2024-06-21] We have uploaded the `Lumina-Next-SFT` and `Lumina-Next-T2I` to [wisemodel.cn](https://wisemodel.cn/models). [wisemodel repo](https://wisemodel.cn/models/Alpha-VLLM/Lumina-Next-SFT)\n- [2024-06-19] We have released the `Lumina-T2Audio` (Text-to-Audio) code and model for music generation. [MODEL](https://huggingface.co/Alpha-VLLM/Lumina-T2Audio)\n- [2024-06-17] 🚀🚀🚀 We have support both inference and training (including Dreambooth) of SD3, implemented in our Lumina framework! [CODE](https://github.com/Alpha-VLLM/Lumina-T2X/tree/main/lumina_next_t2i_mini)\n- **[2024-06-17] 🥰🥰🥰 Lumina-Next supports ComfyUI now, thanks to [Kijai](https://github.com/kijai)! [LINK](https://github.com/kijai/ComfyUI-LuminaWrapper)**\n- **[2024-06-08] 🚀🚀🚀 We have released the `Lumina-Next-SFT` model, demonstrating better visual quality! [MODEL](https://huggingface.co/Alpha-VLLM/Lumina-Next-SFT)**\n- [2024-06-07] We have released the `Lumina-T2Music` (Text-to-Music) code and model for music generation. [MODEL](https://huggingface.co/Alpha-VLLM/Lumina-T2Music) [DEMO](http://139.196.83.164:8000/)\n- [2024-06-03] We have released the `Compositional Generation` version of `Lumina-Next-T2I`, which enables compositional generation with multiple captions for different regions. [model](https://huggingface.co/Alpha-VLLM/Lumina-Next-T2I). [DEMO](http://106.14.2.150:10023/)\n- [2024-05-29] We updated the new `Lumina-Next-T2I` [Code](https://github.com/Alpha-VLLM/Lumina-T2X/tree/main/lumina_next_t2i) and [HF Model](https://huggingface.co/Alpha-VLLM/Lumina-Next-T2I). Supporting 2K Resolution image generation and Time-aware Scaled RoPE.\n- [2024-05-25] We released training scripts for Flag-DiT and Next-DiT, and we have reported the comparison results between Next-DiT and Flag-DiT. [Comparsion Results](https://github.com/Alpha-VLLM/Lumina-T2X/blob/main/Next-DiT-ImageNet/README.md#results)\n- [2024-05-21] Lumina-Next-T2I supports a higher-order solver. It can generate images in just 10 steps without any distillation. Try our demos [DEMO](http://106.14.2.150:10021/).\n- [2024-05-18] We released training scripts for Lumina-T2I 5B. [README](https://github.com/Alpha-VLLM/Lumina-T2X/tree/main/lumina_t2i#training)\n- [2024-05-16] ❗❗❗ We have converted the `.pth` weights to `.safetensors` weights. Please pull the latest code and use `demo.py` for inference.\n- [2024-05-14] Lumina-Next now supports simple **text-to-music** generation ([examples](#text-to-music-generation)), **high-resolution (1024*4096) Panorama** generation conditioned on text ([examples](#panorama-generation)), and **3D point cloud** generation conditioned on labels ([examples](#point-cloud-generation)).\n- [2024-05-13] We give [examples](#multilingual-generation) demonstrating Lumina-T2X's capability to support **multilingual prompts**, and even support prompts containing **emojis**.\n- **[2024-05-12] We excitedly released our `Lumina-Next-T2I` model ([checkpoint](https://huggingface.co/Alpha-VLLM/Lumina-Next-T2I)) which uses a 2B Next-DiT model as the backbone and Gemma-2B as the text encoder. Try it out at [demo1](http://106.14.2.150:10020/) \u0026 [demo2](http://106.14.2.150:10021/) \u0026 [demo3](http://106.14.2.150:10022/). Please refer to the paper [Lumina-Next](assets/lumina-next.pdf) for more details.**\n- [2024-05-10] We released the technical report on [arXiv](https://arxiv.org/abs/2405.05945).\n- [2024-05-09] We released `Lumina-T2A` (Text-to-Audio) Demos. [Examples](#text-to-audio-generation)\n- [2024-04-29] We released the 5B model [checkpoint](https://huggingface.co/Alpha-VLLM/Lumina-T2I) and demo built upon it for text-to-image generation.\n- [2024-04-25] Support 720P video generation with arbitrary aspect ratio. [Examples](#text-to-video-generation)\n- [2024-04-19]  Demo examples released.\n- [2024-04-05] Code released for `Lumina-T2I`.\n- [2024-04-01] We release the initial version of `Lumina-T2I` for text-to-image generation.\n\n## 🚀 Quick Start\n\n\u003e [!Warning]\n\u003e **Since we are updating the code frequently, please pull the latest code:**\n\u003e\n\u003e ```bash\n\u003e git pull origin main\n\u003e ```\n\n### Fast Demo\n\nWe have supported Lumina-Next in the [diffusers](https://github.com/huggingface/diffusers). \n\n\u003e [!Note]\n\u003e You should install the development version of diffusers (`main` branch) before diffusers releasing the new version.\n\u003e ```bash\n\u003e pip install git+https://github.com/huggingface/diffusers\n\nand you can try the code below:\n\n```python\nfrom diffusers import LuminaText2ImgPipeline\nimport torch\n\npipeline = LuminaText2ImgPipeline.from_pretrained(\n\"/mnt/hdd1/xiejunlin/checkpoints/Lumina-Next-SFT-diffusers\", torch_dtype=torch.bfloat16\n).to(\"cuda\")\n\nimage = pipeline(prompt=\"Upper body of a young woman in a Victorian-era outfit with brass goggles and leather straps. Background shows an industrial revolution ciyscape with smoky skies and tall, metal structures\", height=1024, width=768).images[0]\n```\n\nFor more details about training and inference of Lumina framework, please refer to [Lumina-T2I](./lumina_t2i/README.md#Installation), [Lumina-Next-T2I](./lumina_next_t2i/README.md#Installation), and [Lumina-Next-T2I-Mini](./lumina_next_t2i_mini/README.md#Installation). We highly recommend you to use the **[Lumina-Next-T2I-Mini](./lumina_next_t2i_mini/README.md#Installation)** for training and inference, which is an extremely simplified version of Lumina-Next-T2I with full functionalities.\n\n### GUI Demo\n\nIn order to quickly get you guys using our model, we built different versions of the GUI demo site.\n\n#### Lumina-Next-T2I model demo:\n\nImage Generation: [[node1](http://106.14.2.150:10020/)] [[node2](http://106.14.2.150:10021/)] [[node3](http://106.14.2.150:10022/)]\n\nImage Compositional Generation: [[node1](http://106.14.2.150:10023/)]\n\nMusic Generation: [[node1](http://139.196.83.164:8000)]\n\n\u003c!-- \u003e [!Warning] --\u003e\n\u003c!-- \u003e **Lumina-T2X employs FSDP for training large diffusion models. FSDP shards parameters, optimizer states, and gradients across GPUs. Thus, at least 8 GPUs are required for full fine-tuning of the Lumina-T2X 5B model. Parameter-efficient Finetuning of Lumina-T2X shall be released soon.** --\u003e\n\n### Installation\nUsing `Lumina-T2X` as a library, using installation command on your environment:\n\n```bash\npip install git+https://github.com/Alpha-VLLM/Lumina-T2X\n```\n\n### Development\nIf you want to contribute to the code, you should run command below to install `pre-commit` library:\n\n```bash\ngit clone https://github.com/Alpha-VLLM/Lumina-T2X\n\ncd Lumina-T2X\npip install -e \".[dev]\"\npre-commit install\npre-commit\n```\n\n## 📑 Open-source Plan\n\n- [X] Lumina-Text2Image (Demos✅, Training✅, Inference✅, Checkpoints✅, Diffusers✅)\n- [ ] Lumina-Text2Video (Demos✅)\n- [X] Lumina-Text2Music (Demos✅, Inference✅, Checkpoints✅)\n- [X] Lumina-Text2Audio (Demos✅, Inference✅, Checkpoints✅)\n\n## 📜 Index of Content\n\n- [$\\\\textbf{Lumina-T2X}$: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers](#textbflumina-t2x-transforming-text-into-any-modality-resolution-and-duration-via-flow-based-large-diffusion-transformers)\n  - [📰 News](#-news)\n  - [🚀 Quick Start](#-quick-start)\n    - [GUI Demo](#gui-demo)\n      - [Lumina-Next-T2I model demo:](#lumina-next-t2i-model-demo)\n    - [Installation](#installation)\n    - [Development](#development)\n  - [📑 Open-source Plan](#-open-source-plan)\n  - [📜 Index of Content](#-index-of-content)\n  - [Introduction](#introduction)\n  - [📽️ Demo Examples](#️-demo-examples)\n    - [Demos of Lumina-Next-SFT](#demos-of-lumina-next-sft)\n    - [Demos of Lumina-T2I](#demos-of-lumina-t2i)\n      - [Panorama Generation](#panorama-generation)\n    - [Text-to-Video Generation](#text-to-video-generation)\n    - [Text-to-3D Generation](#text-to-3d-generation)\n      - [Point Cloud Generation](#point-cloud-generation)\n    - [Text-to-Audio Generation](#text-to-audio-generation)\n    - [Text-to-music Generation](#text-to-music-generation)\n    - [Multilingual Generation](#multilingual-generation)\n  - [⚙️ Diverse Configurations](#️-diverse-configurations)\n  - [Contributors](#contributors)\n  - [📄 Citation](#-citation)\n\n## Introduction\n\nWe introduce the $\\textbf{Lumina-T2X}$ family, a series of text-conditioned Diffusion Transformers (DiT) capable of transforming textual descriptions into vivid images, dynamic videos, detailed multi-view 3D images, and synthesized speech. At the core of Lumina-T2X lies the **Flow-based Large Diffusion Transformer (Flag-DiT)**—a robust engine that supports up to **7 billion parameters** and extends sequence lengths to **128,000** tokens. Drawing inspiration from Sora, Lumina-T2X integrates images, videos, multi-views of 3D objects, and speech spectrograms within a spatial-temporal latent token space, and can generate outputs at **any resolution, aspect ratio, and duration**.\n\n🌟 **Features**:\n\n- **Flow-based Large Diffusion Transformer (Flag-DiT)**: Lumina-T2X adopts the **flow matching** formulation and is equipped with many advanced techniques, such as RoPE, RMSNorm, and KQ-norm, **demonstrating faster training convergence, stable training dynamics, and a simplified pipeline**.\n- **Any Modalities, Resolution, and Duration within One Framework**:\n  1. $\\textbf{Lumina-T2X}$ can **encode any modality, including mages, videos, multi-views of 3D objects, and spectrograms into a unified 1-D token sequence at any resolution, aspect ratio, and temporal duration.**\n  2. By introducing the `[nextline]` and `[nextframe]` tokens, our model can **support resolution extrapolation**, i.e., generating images/videos with out-of-domain resolutions **not encountered during training**, such as images from 768x768 to 1792x1792 pixels.\n- **Low Training Resources**: Our empirical observations indicate that employing larger models,\n  high-resolution images, and longer-duration video clips can **significantly accelerate the convergence**\n  **speed** of diffusion transformers. Moreover, by employing meticulously curated text-image and text-video pairs featuring high aesthetic quality frames and detailed captions, our $\\textbf{Lumina-T2X}$ model is learned to generate high-resolution images and coherent videos with minimal computational demands. Remarkably, the default Lumina-T2I configuration, equipped with a 5B Flag-DiT and a 7B LLaMA as the text encoder, **requires only 35% of the computational resources compared to Pixelart-**$\\alpha$.\n\n![framework](https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/60d2f248-67b1-43ef-a530-c75530cf26c5)\n\n## 📽️ Demo Examples\n\n### Demos of Lumina-Next-SFT\n\n![github_banner](https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/926adf8c-3f34-44ed-8ff6-5eb650b9712c)\n\n### Demos of Visual Anagrams\n\n![](https://github.com/user-attachments/assets/7a200023-6e85-4209-96f1-49e0ddadf021)\n\n![](https://github.com/user-attachments/assets/8006da1f-18be-45a0-b292-e1f2ef1e029a)\n\n### Demos of Lumina-T2I\n\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/27bd36a8-8411-47dd-a3a7-3607c1d5d644\" width=\"90%\"/\u003e\n \u003cbr\u003e\n\u003c/p\u003e\n\n#### Panorama Generation\n\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"https://github.com/Alpha-VLLM/Lumina-T2X/assets/86041420/88b75b4e-5e16-4ea3-aba8-134904dd3381\" width=\"90%\"/\u003e\n \u003cbr\u003e\n\u003c/p\u003e\n\n### Text-to-Video Generation\n\n**720P Videos:**\n\n**Prompt:** The majestic beauty of a waterfall cascading down a cliff into a serene lake.\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/17187de8-7a07-49a8-92f9-fdb8e2f5e64c\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/0a20bb39-f6f7-430f-aaa0-7193a71b256a\n\n**Prompt:** A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/7bf9ce7e-f454-4430-babe-b14264e0f194\n\n**360P Videos:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/d7fec32c-3655-4fd1-aa14-c0cb3ace3845\n\n### Text-to-3D Generation\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/cd061b8d-c47b-4c0c-b775-2cbaf8014be9\n\n#### Point Cloud Generation\n\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"https://github.com/Alpha-VLLM/Lumina-T2X/assets/86041420/742237ad-be47-4a7d-aa11-b3aaba07a75a\" width=\"90%\"/\u003e\n \u003cbr\u003e\n\u003c/p\u003e\n\n### Text-to-Audio Generation\n\n\u003e [!Note]\n\u003e **Attention: Mouse over the playbar and click the audio button on the playbar to unmute it.**\n\n\u003c!-- \u003e 🌟🌟🌟 **We recommend visiting the Lumina website to try it out! [🌟 visit](https://lumina-t2-x-web.vercel.app/docs/demos/demo-of-audio)** --\u003e\n\n**Prompt:** Semiautomatic gunfire occurs with slight echo\n\n**Generated Audio:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/25f2a6a8-0386-41e8-ab10-d1303554b944\n\n**Groundtruth:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/6722a68a-1a5a-4a44-ba9c-405372dc27ef\n\n**Prompt:** A telephone bell rings\n\n**Generated Audio:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/7467dd6d-b163-4436-ac5b-36662d1f9ddf\n\n**Groundtruth:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/703ea405-6eb4-4161-b5ff-51a93f81d013\n\n**Prompt:** An engine running followed by the engine revving and tires screeching\n\n**Generated Audio:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/5d9dd431-b8b4-41a0-9e78-bb0a234a30b9\n\n**Groundtruth:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/9ca4af9e-cee3-4596-b826-d6c25761c3c1\n\n**Prompt:** Birds chirping with insects buzzing and outdoor ambiance\n\n**Generated Audio:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/b776aacb-783b-4f47-bf74-89671a17d38d\n\n**Groundtruth:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/a11333e4-695e-4a8c-8ea1-ee5b83e34682\n\n### Text-to-music Generation\n\n\u003e [!Note]\n\u003e **Attention: Mouse over the playbar and click the audio button on the playbar to unmute it.**\n\u003e For more details check out [this](./lumina_music/README.md)\n\n**Prompt:** An electrifying ska tune with prominent saxophone riffs, energetic e-guitar and acoustic drums, lively percussion, soulful keys, groovy e-bass, and a fast tempo that exudes uplifting energy.\n\n**Generated Music:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/86041420/fef8f6b9-1e77-457e-bf4b-fb0cccefa0ec\n\n**Prompt:** A high-energy synth rock/pop song with fast-paced acoustic drums, a triumphant brass/string section, and a thrilling synth lead sound that creates an adventurous atmosphere.\n\n**Generated Music:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/86041420/1f796046-64ab-44ed-a4d8-0ebc0cfc484f\n\n**Prompt:** An uptempo electronic pop song that incorporates digital drums, digital bass and synthpad sounds.\n\n**Generated Music:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/86041420/4768415e-436a-4d0e-af53-bf7882cb94cd\n\n**Prompt:** A medium-tempo digital keyboard song with a jazzy backing track featuring digital drums, piano, e-bass, trumpet, and acoustic guitar.\n\n**Generated Music:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/86041420/8994a573-e776-488b-a86c-4398a4362398\n\n**Prompt:** This low-quality folk song features groovy wooden percussion, bass, piano, and flute melodies, as well as sustained strings and shimmering shakers that create a passionate, happy, and joyful atmosphere.\n\n**Generated Music:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/86041420/e0b5d197-589c-47d6-954b-b9c1d54feebb\n\n### Multilingual Generation\n\nWe present three multilingual capabilities of Lumina-Next-2B.\n\n**Generating Images conditioned on Chinese poems:**\n\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"https://github.com/Alpha-VLLM/Lumina-T2X/assets/86041420/9aa79d67-e304-4867-81f3-cfc934c625d9\" width=\"90%\"/\u003e\n \u003cbr\u003e\n\u003c/p\u003e\n\n**Generating Images with multilingual prompts:**\n\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"https://github.com/Alpha-VLLM/Lumina-T2X/assets/86041420/7c62bb94-42e4-4525-a298-9e25475b511d\" width=\"90%\"/\u003e\n \u003cbr\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"https://github.com/Alpha-VLLM/Lumina-T2X/assets/86041420/07fc8138-e67c-4c9f-bc01-e749a6507ada\" width=\"90%\"/\u003e\n \u003cbr\u003e\n\u003c/p\u003e\n\n**Generating Images with emojis:**\n\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"https://github.com/Alpha-VLLM/Lumina-T2X/assets/86041420/980b4999-9d1c-4fbd-a695-88b6b675f34b\" width=\"90%\"/\u003e\n \u003cbr\u003e\n\u003c/p\u003e\n\n\u003c!--\n**Prompt:** Water trickling rapidly and draining\n\n**Generated Audio:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/88fcf0e1-b71a-4e94-b9a6-138db6a670f0\n\n**Groundtruth:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/6fb9963f-46a5-4020-b160-f9a004528d7e\n\n**Prompt:** Thunderstorm sounds while raining\n\n**Generated Audio:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/fad8baf3-d80b-4915-ba31-aab13db5ce06\n\n**Groundtruth:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/c01a7e6e-3421-4a28-93c5-831523ec061d\n\n**Prompt:** Birds chirping repeatedly\n\n**Generated Audio:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/0fa673a3-f9de-487b-8812-1f96a335e913\n\n**Groundtruth:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/718289f9-a93e-4ea9-b7db-a14c2b209b28\n\n**Prompt:** Several large bells ring\n\n**Generated Audio:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/362fde84-e4ae-4152-aeb5-4355155c8719\n\n**Groundtruth:**\n\nhttps://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/da93e13d-6462-48d2-b6dc-af6ff0c4d07d\n\n--\u003e\n\n\u003c!-- For more audio demos visit [lumina website - audio demos](https://lumina-t2-x-web.vercel.app/docs/demos/demo-of-audio) --\u003e\n\n\u003c!-- ### More examples --\u003e\n\n\u003c!-- For more demos visit [this website](https://lumina-t2-x-web.vercel.app/docs/demos) --\u003e\n\n\u003c!-- ### High-res. Image Editing\n\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/55981976-c989-4f07-982a-1e567c7078ef\" width=\"90%\"/\u003e\n \u003cbr\u003e\n \u003cimg src=\"https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/a1ac7190-c49c-4d8b-965c-9ccf83a4f6a7\" width=\"90%\"/\u003e\n\u003c/p\u003e\n\n### Compositional Generation\n\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/8c8eb921-134c-4f55-918a-0ad07f9a47f4\" width=\"90%\"/\u003e\n \u003cbr\u003e\n\u003c/p\u003e\n\n### Resolution Extrapolation\n\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/e37e2db7-3ead-451e-ba18-b375eb773578\" width=\"90%\"/\u003e\n \u003cbr\u003e\n \u003cimg src=\"https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/9da47c34-5e09-48d3-9c48-78663fd01cc8\" width=\"100%\"/\u003e\n\u003c/p\u003e\n\n### Consistent-Style Generation\n\n\u003cp align=\"center\"\u003e\n \u003cimg src=\"https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/6403417a-42c6-4048-9419-375d211e14bb\" width=\"90%\"/\u003e\n \u003cbr\u003e\n\u003c/p\u003e --\u003e\n\n## ⚙️ Diverse Configurations\n\nWe support diverse configurations, including text encoders, DiTs of different parameter sizes, inference methods, and VAE encoders.AAdditionally, we offer features such as 1D-RoPE, image enhancement, and more.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/221de325-d9fb-4b7e-a97c-4b24cd2df0fc\" width=\"100%\"/\u003e\n \u003cbr\u003e\n\u003c/p\u003e\n\n## Contributors\n\nCore member for code developlement and maintence:\n\nDongyang Liu, Le Zhuo, Junlin Xie, Ruoyi Du, Peng Gao\n\n\u003ca href=\"https://github.com/Alpha-VLLM/Lumina-T2X/graphs/contributors\"\u003e\n  \u003cimg src=\"https://contrib.rocks/image?repo=Alpha-VLLM/Lumina-T2X\" /\u003e\n\u003c/a\u003e\n\n## 📄 Citation\n\n```\n@article{gao2024lumina-next,\n  title={Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT},\n  author={Zhuo, Le and Du, Ruoyi and Han, Xiao and Li, Yangguang and Liu, Dongyang and Huang, Rongjie and Liu, Wenze and others},\n  journal={arXiv preprint arXiv:2406.18583},\n  year={2024}\n}\n```\n\n```\n@article{gao2024lumin-t2x,\n  title={Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers},\n  author={Gao, Peng and Zhuo, Le and Liu, Chris and and Du, Ruoyi and Luo, Xu and Qiu, Longtian and Zhang, Yuhang and others},\n  journal={arXiv preprint arXiv:2405.05945},\n  year={2024}\n}\n\n```\n\n\u003c!--\n## Star History\n\n [![Star History Chart](https://api.star-history.com/svg?repos=Alpha-VLLM/Lumina-T2X\u0026type=Date)](https://star-history.com/#Alpha-VLLM/Lumina-T2X\u0026Date) --\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAlpha-VLLM%2FLumina-T2X","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAlpha-VLLM%2FLumina-T2X","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAlpha-VLLM%2FLumina-T2X/lists"}