https://github.com/modelscope/lite-sora

An initiative to replicate Sora
https://github.com/modelscope/lite-sora

Last synced: 8 months ago
JSON representation

An initiative to replicate Sora

Host: GitHub
URL: https://github.com/modelscope/lite-sora
Owner: modelscope
License: apache-2.0
Created: 2024-03-07T07:15:12.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-04-10T08:35:58.000Z (over 1 year ago)
Last Synced: 2024-04-11T08:41:15.083Z (over 1 year ago)
Language: Python
Homepage:
Size: 9.62 MB
Stars: 70
Watchers: 2
Forks: 4
Open Issues: 2
Metadata Files:
- Readme: README-ZH.md
- License: LICENSE

Awesome Lists containing this project

README

          # Lite-Sora

## 简介

lite-sora项目是一个Sora技术复现的项目，它由华东师范大学和ModelScope社区共同发起，探索Sora背后的视频生成技术的最小复现和简洁实现，我们希望可以提供简洁易读的代码方便大家一起改进实验，不断探索提升开源视频生成技术的上限。

## 技术路线

* [x] 搭建基础架构

  * [ ] 模型

    * [x] Text Encoder（基于 Stable Diffusion XL 中的 [Text Encoder](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/text_encoder_2/model.safetensors)）

    * [x] VideoDiT（基于 [Facebook DiT](https://github.com/facebookresearch/DiT)）

    * [ ] VideoVAE

  * [x] Scheduler（基于 [DDIM](https://arxiv.org/abs/2010.02502)）

  * [x] Trainer（基于 [PyTorch-lightning](https://lightning.ai/docs/pytorch/stable/)）

* [x] 小规模数据集验证

  * [x] [Pixabay100](https://github.com/ECNU-CILAB/Pixabay100)

* [ ] 在大规模数据集上训练 Video Encoder & Decoder

* [ ] 在大规模数据集上训练 VideoDiT

## 使用

### Python 环境搭建

```

conda env create -f environment.yml

conda activate litesora

```

### 下载模型

* `models/text_encoder/model.safetensors`: 来自 Stable Diffusion XL 的 Text Encoder，[下载链接](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/text_encoder_2/model.safetensors)

* `models/denoising_model/model.safetensors`：我们在 [Pixabay100](https://github.com/ECNU-CILAB/Pixabay100) 数据集上训练的模型，该模型可以证明我们的训练代码能够正常拟合训练数据，分辨率为 64*64。**这个模型显然是过拟合的，由于训练数据少，尚不具备泛化能力，仅用于验证训练算法的正确性。** [下载链接](https://huggingface.co/ECNU-CILab/lite-sora-v1-pixabay100/resolve/main/denoising_model/model.safetensors)

* `models/vae/model.safetensors`: Stable Video Diffusion 的 VAE. [下载链接](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt/resolve/main/vae/diffusion_pytorch_model.fp16.safetensors)

### 训练

```python

from litesora.data import TextVideoDataset

from litesora.models import SDXLTextEncoder2

from litesora.trainers.v1 import LightningVideoDiT

import lightning as pl

import torch

if __name__ == '__main__':

    # dataset and data loader

    dataset = TextVideoDataset("data/pixabay100", "data/pixabay100/metadata.json",

                               num_frames=64, height=64, width=64)

    train_loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=1, num_workers=8)

    # model

    model = LightningVideoDiT(learning_rate=1e-5)

    model.text_encoder.load_state_dict_from_diffusers("models/text_encoder/model.safetensors")

    # train

    trainer = pl.Trainer(max_epochs=100000, accelerator="gpu", devices="auto", callbacks=[

        pl.pytorch.callbacks.ModelCheckpoint(save_top_k=-1)

    ])

    trainer.fit(model=model, train_dataloaders=train_loader)

```

训练程序启动后，可开启 `tensorboard` 监视训练进度

```

tensorboard --logdir .

```

### 推理

* 在像素空间生成一个视频。

```python

from litesora.models import SDXLTextEncoder2, VideoDiT

from litesora.pipelines import PixelVideoDiTPipeline

from litesora.data import save_video

import torch

# models

text_encoder = SDXLTextEncoder2.from_diffusers("models/text_encoder/model.safetensors")

denoising_model = VideoDiT.from_pretrained("models/denoising_model/model.safetensors")

# pipeline

pipe = PixelVideoDiTPipeline(torch_dtype=torch.float16, device="cuda")

pipe.fetch_models(text_encoder, denoising_model)

# generate a video

prompt = "woman, flowers, plants, field, garden"

video = pipe(prompt=prompt, num_inference_steps=100)

# save the video (the resolution is 64*64, we enlarge it to 512*512 here)

save_video(video, "output.mp4", upscale=8)

```

* 把一个视频从像素空间编码到隐空间，然后解码它

```python

from litesora.models import SDVAEEncoder, SVDVAEDecoder

from litesora.data import load_video, tensor2video, concat_video, save_video

import torch

from tqdm import tqdm

frames = load_video("data/pixabay100/videos/168572 (Original).mp4",

                    num_frames=1024, height=1024, width=1024, random_crop=False)

frames = frames.to(dtype=torch.float16, device="cpu")

encoder = SDVAEEncoder.from_diffusers("models/vae/model.safetensors").to(dtype=torch.float16, device="cuda")

decoder = SVDVAEDecoder.from_diffusers("models/vae/model.safetensors").to(dtype=torch.float16, device="cuda")

with torch.no_grad():

    print(frames.shape)

    latents = encoder.encode_video(frames, progress_bar=tqdm)

    print(latents.shape)

    decoded_frames = decoder.decode_video(latents, progress_bar=tqdm)

video = tensor2video(concat_video([frames, decoded_frames]))

save_video(video, "video.mp4", fps=24)

```

### 现阶段效果展示

我们在 [Pixabay100](https://github.com/ECNU-CILAB/Pixabay100) 数据集上训练的模型，该模型可以证明我们的训练代码能够正常拟合训练数据，分辨率为 64*64。**这个模型显然是过拟合的，由于训练数据少，尚不具备泛化能力，仅用于验证训练算法的正确性。** [下载链接](https://huggingface.co/ECNU-CILab/lite-sora-v1-pixabay100/resolve/main/denoising_model/model.safetensors)

|airport, people, crowd, busy|beach, ocean, waves, water, sand|bee, honey, insect, beehive, nature|coffee, beans, caffeine, coffee, shop|

|-|-|-|-|

|![](assets/airport_people_crowd_busy.gif)|![](assets/beach_ocean_waves_water_sand.gif)|![](assets/bee_honey_insect_beehive_nature.gif)|![](assets/coffee_beans_caffeine_coffee_shop.gif)|

|fish, underwater, aquarium, swim|forest, woods, mystical, morning|ocean, beach, sunset, sea, atmosphere|hair, wind, girl, woman, people|

|![](assets/fish_underwater_aquarium_swim.gif)|![](assets/forest_woods_mystical_morning.gif)|![](assets/ocean_beach_sunset_sea_atmosphere.gif)|![](assets/hair_wind_girl_woman_people.gif)|

|reeds, grass, wind, golden, sunshine|sea, ocean, seagulls, birds, sunset|woman, flowers, plants, field, garden|wood, anemones, wildflower, flower|

|![](assets/reeds_grass_wind_golden_sunshine.gif)|![](assets/sea_ocean_seagulls_birds_sunset.gif)|![](assets/woman_flowers_plants_field_garden.gif)|![](assets/wood_anemones_wildflower_flower.gif)|

我们采用 [Stable-Video-Diffusion](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) 中的 VAE 模型来做视频的编码和解码。我们的代码支持非常长的高分辨率视频！

https://github.com/modelscope/lite-sora/assets/35051019/dc205719-d0bc-4bca-b117-ff5aa19ebd86

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/modelscope/lite-sora

Awesome Lists containing this project

README