{"id":13405503,"url":"https://github.com/Stability-AI/generative-models","last_synced_at":"2025-03-14T10:30:50.581Z","repository":{"id":176403772,"uuid":"656933936","full_name":"Stability-AI/generative-models","owner":"Stability-AI","description":"Generative Models by Stability AI","archived":false,"fork":false,"pushed_at":"2024-09-04T22:00:56.000Z","size":54884,"stargazers_count":24343,"open_issues_count":299,"forks_count":2709,"subscribers_count":258,"default_branch":"main","last_synced_at":"2024-10-14T10:41:25.561Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Stability-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-CODE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-22T00:36:35.000Z","updated_at":"2024-10-14T10:04:15.000Z","dependencies_parsed_at":"2024-09-20T19:01:09.927Z","dependency_job_id":"de65f41b-3d66-44b4-a805-1b386bd195c3","html_url":"https://github.com/Stability-AI/generative-models","commit_stats":null,"previous_names":["stability-ai/generative-models"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fgenerative-models","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fgenerative-models/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fgenerative-models/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Stability-AI%2Fgenerative-models/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Stability-AI","download_url":"https://codeload.github.com/Stability-AI/generative-models/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243561874,"owners_count":20311191,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-30T19:02:04.053Z","updated_at":"2025-03-14T10:30:50.574Z","avatar_url":"https://github.com/Stability-AI.png","language":"Python","funding_links":[],"categories":["Python","Accelerate","Table of Contents \u003c!-- omit in toc --\u003e","\u003cspan id=\"video\"\u003eVideo\u003c/span\u003e","🤖 AI \u0026 Machine Learning","HarmonyOS","其他_机器视觉","Repos","2. Open Foundation Models","Classic Papers","App","Research \u0026 Data Analysis","AI API"],"sub_categories":["Open-source Toolboxes and Foundation Models","\u003cspan id=\"tool\"\u003eLLM (LLM \u0026 Tool)\u003c/span\u003e","Windows Manager","网络服务_其他","World Models \u0026 Simulation"],"readme":"# Generative Models by Stability AI\n\n![sample1](assets/000.jpg)\n\n## News\n\n\n**July 24, 2024**\n- We are releasing **[Stable Video 4D (SV4D)](https://huggingface.co/stabilityai/sv4d)**, a video-to-4D diffusion model for novel-view video synthesis. For research purposes:\n    - **SV4D** was trained to generate 40 frames (5 video frames x 8 camera views) at 576x576 resolution, given 5 context frames (the input video), and 8 reference views (synthesised from the first frame of the input video, using a multi-view diffusion model like SV3D) of the same size, ideally white-background images with one object.\n    - To generate longer novel-view videos (21 frames), we propose a novel sampling method using SV4D, by first sampling 5 anchor frames and then densely sampling the remaining frames while maintaining temporal consistency.\n    - To run the community-build gradio demo locally, run `python -m scripts.demo.gradio_app_sv4d`.\n    - Please check our [project page](https://sv4d.github.io), [tech report](https://sv4d.github.io/static/sv4d_technical_report.pdf) and [video summary](https://www.youtube.com/watch?v=RBP8vdAWTgk) for more details.\n\n**QUICKSTART** : `python scripts/sampling/simple_video_sample_4d.py --input_path assets/sv4d_videos/test_video1.mp4 --output_folder outputs/sv4d` (after downloading [sv4d.safetensors](https://huggingface.co/stabilityai/sv4d) and [sv3d_u.safetensors](https://huggingface.co/stabilityai/sv3d) from HuggingFace into `checkpoints/`)\n\nTo run **SV4D** on a single input video of 21 frames:\n- Download SV3D models (`sv3d_u.safetensors` and `sv3d_p.safetensors`) from [here](https://huggingface.co/stabilityai/sv3d) and SV4D model (`sv4d.safetensors`) from [here](https://huggingface.co/stabilityai/sv4d) to `checkpoints/`\n- Run `python scripts/sampling/simple_video_sample_4d.py --input_path \u003cpath/to/video\u003e`\n    - `input_path` : The input video `\u003cpath/to/video\u003e` can be\n      - a single video file in `gif` or `mp4` format, such as `assets/sv4d_videos/test_video1.mp4`, or\n      - a folder containing images of video frames in `.jpg`, `.jpeg`, or `.png` format, or\n      - a file name pattern matching images of video frames.\n    - `num_steps` : default is 20, can increase to 50 for better quality but longer sampling time.\n    - `sv3d_version` : To specify the SV3D model to generate reference multi-views, set `--sv3d_version=sv3d_u` for SV3D_u or `--sv3d_version=sv3d_p` for SV3D_p.\n    - `elevations_deg` : To generate novel-view videos at a specified elevation (default elevation is 10) using SV3D_p (default is SV3D_u), run `python scripts/sampling/simple_video_sample_4d.py --input_path assets/sv4d_videos/test_video1.mp4 --sv3d_version sv3d_p --elevations_deg 30.0`\n    - **Background removal** : For input videos with plain background, (optionally) use [rembg](https://github.com/danielgatis/rembg) to remove background and crop video frames by setting `--remove_bg=True`. To obtain higher quality outputs on real-world input videos with noisy background, try segmenting the foreground object using [Clipdrop](https://clipdrop.co/) or [SAM2](https://github.com/facebookresearch/segment-anything-2) before running SV4D.\n    - **Low VRAM environment** : To run on GPUs with low VRAM, try setting `--encoding_t=1` (of frames encoded at a time) and `--decoding_t=1` (of frames decoded at a time) or lower video resolution like `--img_size=512`.\n\n  ![tile](assets/sv4d.gif)\n\n\n**March 18, 2024**\n- We are releasing **[SV3D](https://huggingface.co/stabilityai/sv3d)**, an image-to-video model for novel multi-view synthesis, for research purposes:\n    - **SV3D** was trained to generate 21 frames at resolution 576x576, given 1 context frame of the same size, ideally a white-background image with one object.\n    - **SV3D_u**: This variant generates orbital videos based on single image inputs without camera conditioning..\n    - **SV3D_p**: Extending the capability of **SVD3_u**, this variant accommodates both single images and orbital views allowing for the creation of 3D video along specified camera paths.\n    - We extend the streamlit demo `scripts/demo/video_sampling.py` and the standalone python script `scripts/sampling/simple_video_sample.py` for inference of both models.\n    - Please check our [project page](https://sv3d.github.io), [tech report](https://sv3d.github.io/static/paper.pdf) and [video summary](https://youtu.be/Zqw4-1LcfWg) for more details.\n\nTo run **SV3D_u** on a single image:\n- Download `sv3d_u.safetensors` from https://huggingface.co/stabilityai/sv3d to `checkpoints/sv3d_u.safetensors`\n- Run `python scripts/sampling/simple_video_sample.py --input_path \u003cpath/to/image.png\u003e --version sv3d_u`\n\nTo run **SV3D_p** on a single image:\n- Download `sv3d_p.safetensors` from https://huggingface.co/stabilityai/sv3d to `checkpoints/sv3d_p.safetensors`\n1. Generate static orbit at a specified elevation eg. 10.0 : `python scripts/sampling/simple_video_sample.py --input_path \u003cpath/to/image.png\u003e --version sv3d_p --elevations_deg 10.0`\n2. Generate dynamic orbit at a specified elevations and azimuths: specify sequences of 21 elevations (in degrees) to `elevations_deg` ([-90, 90]), and 21 azimuths (in degrees) to `azimuths_deg` [0, 360] in sorted order from 0 to 360. For example: `python scripts/sampling/simple_video_sample.py --input_path \u003cpath/to/image.png\u003e --version sv3d_p --elevations_deg [\u003clist of 21 elevations in degrees\u003e] --azimuths_deg [\u003clist of 21 azimuths in degrees\u003e]`\n\nTo run SVD or SV3D on a streamlit server:\n`streamlit run scripts/demo/video_sampling.py`\n\n  ![tile](assets/sv3d.gif)\n\n\n**November 30, 2023**\n- Following the launch of SDXL-Turbo, we are releasing [SD-Turbo](https://huggingface.co/stabilityai/sd-turbo).\n\n**November 28, 2023**\n- We are releasing SDXL-Turbo, a lightning fast text-to image model.\n  Alongside the model, we release a [technical report](https://stability.ai/research/adversarial-diffusion-distillation)\n    - Usage:\n        - Follow the installation instructions or update the existing environment with `pip install streamlit-keyup`.\n        - Download the [weights](https://huggingface.co/stabilityai/sdxl-turbo) and place them in the `checkpoints/` directory.\n        - Run `streamlit run scripts/demo/turbo.py`.\n\n  ![tile](assets/turbo_tile.png)\n\n\n**November 21, 2023**\n- We are releasing Stable Video Diffusion, an image-to-video model, for research purposes:\n    - [SVD](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid): This model was trained to generate 14\n      frames at resolution 576x1024 given a context frame of the same size.\n      We use the standard image encoder from SD 2.1, but replace the decoder with a temporally-aware `deflickering decoder`.\n    - [SVD-XT](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt): Same architecture as `SVD` but finetuned\n      for 25 frame generation.\n    - You can run the community-build gradio demo locally by running `python -m scripts.demo.gradio_app`.\n    - We provide a streamlit demo `scripts/demo/video_sampling.py` and a standalone python script `scripts/sampling/simple_video_sample.py` for inference of both models.\n    - Alongside the model, we release a [technical report](https://stability.ai/research/stable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets).\n\n  ![tile](assets/tile.gif)\n\n**July 26, 2023**\n\n- We are releasing two new open models with a\n  permissive [`CreativeML Open RAIL++-M` license](model_licenses/LICENSE-SDXL1.0) (see [Inference](#inference) for file\n  hashes):\n    - [SDXL-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0): An improved version\n      over `SDXL-base-0.9`.\n    - [SDXL-refiner-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0): An improved version\n      over `SDXL-refiner-0.9`.\n\n![sample2](assets/001_with_eval.png)\n\n**July 4, 2023**\n\n- A technical report on SDXL is now available [here](https://arxiv.org/abs/2307.01952).\n\n**June 22, 2023**\n\n- We are releasing two new diffusion models for research purposes:\n    - `SDXL-base-0.9`: The base model was trained on a variety of aspect ratios on images with resolution 1024^2. The\n      base model uses [OpenCLIP-ViT/G](https://github.com/mlfoundations/open_clip)\n      and [CLIP-ViT/L](https://github.com/openai/CLIP/tree/main) for text encoding whereas the refiner model only uses\n      the OpenCLIP model.\n    - `SDXL-refiner-0.9`: The refiner has been trained to denoise small noise levels of high quality data and as such is\n      not expected to work as a text-to-image model; instead, it should only be used as an image-to-image model.\n\nIf you would like to access these models for your research, please apply using one of the following links:\n[SDXL-0.9-Base model](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9),\nand [SDXL-0.9-Refiner](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9).\nThis means that you can apply for any of the two links - and if you are granted - you can access both.\nPlease log in to your Hugging Face Account with your organization email to request access.\n**We plan to do a full release soon (July).**\n\n## The codebase\n\n### General Philosophy\n\nModularity is king. This repo implements a config-driven approach where we build and combine submodules by\ncalling `instantiate_from_config()` on objects defined in yaml configs. See `configs/` for many examples.\n\n### Changelog from the old `ldm` codebase\n\nFor training, we use [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/), but it should be easy to use other\ntraining wrappers around the base modules. The core diffusion model class (formerly `LatentDiffusion`,\nnow `DiffusionEngine`) has been cleaned up:\n\n- No more extensive subclassing! We now handle all types of conditioning inputs (vectors, sequences and spatial\n  conditionings, and all combinations thereof) in a single class: `GeneralConditioner`,\n  see `sgm/modules/encoders/modules.py`.\n- We separate guiders (such as classifier-free guidance, see `sgm/modules/diffusionmodules/guiders.py`) from the\n  samplers (`sgm/modules/diffusionmodules/sampling.py`), and the samplers are independent of the model.\n- We adopt the [\"denoiser framework\"](https://arxiv.org/abs/2206.00364) for both training and inference (most notable\n  change is probably now the option to train continuous time models):\n    * Discrete times models (denoisers) are simply a special case of continuous time models (denoisers);\n      see `sgm/modules/diffusionmodules/denoiser.py`.\n    * The following features are now independent: weighting of the diffusion loss\n      function (`sgm/modules/diffusionmodules/denoiser_weighting.py`), preconditioning of the\n      network (`sgm/modules/diffusionmodules/denoiser_scaling.py`), and sampling of noise levels during\n      training (`sgm/modules/diffusionmodules/sigma_sampling.py`).\n- Autoencoding models have also been cleaned up.\n\n## Installation:\n\n\u003ca name=\"installation\"\u003e\u003c/a\u003e\n\n#### 1. Clone the repo\n\n```shell\ngit clone https://github.com/Stability-AI/generative-models.git\ncd generative-models\n```\n\n#### 2. Setting up the virtualenv\n\nThis is assuming you have navigated to the `generative-models` root after cloning it.\n\n**NOTE:** This is tested under `python3.10`. For other python versions, you might encounter version conflicts.\n\n**PyTorch 2.0**\n\n```shell\n# install required packages from pypi\npython3 -m venv .pt2\nsource .pt2/bin/activate\npip3 install -r requirements/pt2.txt\n```\n\n#### 3. Install `sgm`\n\n```shell\npip3 install .\n```\n\n#### 4. Install `sdata` for training\n\n```shell\npip3 install -e git+https://github.com/Stability-AI/datapipelines.git@main#egg=sdata\n```\n\n## Packaging\n\nThis repository uses PEP 517 compliant packaging using [Hatch](https://hatch.pypa.io/latest/).\n\nTo build a distributable wheel, install `hatch` and run `hatch build`\n(specifying `-t wheel` will skip building a sdist, which is not necessary).\n\n```\npip install hatch\nhatch build -t wheel\n```\n\nYou will find the built package in `dist/`. You can install the wheel with `pip install dist/*.whl`.\n\nNote that the package does **not** currently specify dependencies; you will need to install the required packages,\ndepending on your use case and PyTorch version, manually.\n\n## Inference\n\nWe provide a [streamlit](https://streamlit.io/) demo for text-to-image and image-to-image sampling\nin `scripts/demo/sampling.py`.\nWe provide file hashes for the complete file as well as for only the saved tensors in the file (\nsee [Model Spec](https://github.com/Stability-AI/ModelSpec) for a script to evaluate that).\nThe following models are currently supported:\n\n- [SDXL-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)\n  ```\n  File Hash (sha256): 31e35c80fc4829d14f90153f4c74cd59c90b779f6afe05a74cd6120b893f7e5b\n  Tensordata Hash (sha256): 0xd7a9105a900fd52748f20725fe52fe52b507fd36bee4fc107b1550a26e6ee1d7\n  ```\n- [SDXL-refiner-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0)\n  ```\n  File Hash (sha256): 7440042bbdc8a24813002c09b6b69b64dc90fded4472613437b7f55f9b7d9c5f\n  Tensordata Hash (sha256): 0x1a77d21bebc4b4de78c474a90cb74dc0d2217caf4061971dbfa75ad406b75d81\n  ```\n- [SDXL-base-0.9](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9)\n- [SDXL-refiner-0.9](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9)\n- [SD-2.1-512](https://huggingface.co/stabilityai/stable-diffusion-2-1-base/blob/main/v2-1_512-ema-pruned.safetensors)\n- [SD-2.1-768](https://huggingface.co/stabilityai/stable-diffusion-2-1/blob/main/v2-1_768-ema-pruned.safetensors)\n\n**Weights for SDXL**:\n\n**SDXL-1.0:**\nThe weights of SDXL-1.0 are available (subject to\na [`CreativeML Open RAIL++-M` license](model_licenses/LICENSE-SDXL1.0)) here:\n\n- base model: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/\n- refiner model: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/\n\n**SDXL-0.9:**\nThe weights of SDXL-0.9 are available and subject to a [research license](model_licenses/LICENSE-SDXL0.9).\nIf you would like to access these models for your research, please apply using one of the following links:\n[SDXL-base-0.9 model](https://huggingface.co/stabilityai/stable-diffusion-xl-base-0.9),\nand [SDXL-refiner-0.9](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-0.9).\nThis means that you can apply for any of the two links - and if you are granted - you can access both.\nPlease log in to your Hugging Face Account with your organization email to request access.\n\nAfter obtaining the weights, place them into `checkpoints/`.\nNext, start the demo using\n\n```\nstreamlit run scripts/demo/sampling.py --server.port \u003cyour_port\u003e\n```\n\n### Invisible Watermark Detection\n\nImages generated with our code use the\n[invisible-watermark](https://github.com/ShieldMnt/invisible-watermark/)\nlibrary to embed an invisible watermark into the model output. We also provide\na script to easily detect that watermark. Please note that this watermark is\nnot the same as in previous Stable Diffusion 1.x/2.x versions.\n\nTo run the script you need to either have a working installation as above or\ntry an _experimental_ import using only a minimal amount of packages:\n\n```bash\npython -m venv .detect\nsource .detect/bin/activate\n\npip install \"numpy\u003e=1.17\" \"PyWavelets\u003e=1.1.1\" \"opencv-python\u003e=4.1.0.25\"\npip install --no-deps invisible-watermark\n```\n\nTo run the script you need to have a working installation as above. The script\nis then useable in the following ways (don't forget to activate your\nvirtual environment beforehand, e.g. `source .pt1/bin/activate`):\n\n```bash\n# test a single file\npython scripts/demo/detect.py \u003cyour filename here\u003e\n# test multiple files at once\npython scripts/demo/detect.py \u003cfilename 1\u003e \u003cfilename 2\u003e ... \u003cfilename n\u003e\n# test all files in a specific folder\npython scripts/demo/detect.py \u003cyour folder name here\u003e/*\n```\n\n## Training:\n\nWe are providing example training configs in `configs/example_training`. To launch a training, run\n\n```\npython main.py --base configs/\u003cconfig1.yaml\u003e configs/\u003cconfig2.yaml\u003e\n```\n\nwhere configs are merged from left to right (later configs overwrite the same values).\nThis can be used to combine model, training and data configs. However, all of them can also be\ndefined in a single config. For example, to run a class-conditional pixel-based diffusion model training on MNIST,\nrun\n\n```bash\npython main.py --base configs/example_training/toy/mnist_cond.yaml\n```\n\n**NOTE 1:** Using the non-toy-dataset\nconfigs `configs/example_training/imagenet-f8_cond.yaml`, `configs/example_training/txt2img-clipl.yaml`\nand `configs/example_training/txt2img-clipl-legacy-ucg-training.yaml` for training will require edits depending on the\nused dataset (which is expected to stored in tar-file in\nthe [webdataset-format](https://github.com/webdataset/webdataset)). To find the parts which have to be adapted, search\nfor comments containing `USER:` in the respective config.\n\n**NOTE 2:** This repository supports both `pytorch1.13` and `pytorch2`for training generative models. However for\nautoencoder training as e.g. in `configs/example_training/autoencoder/kl-f4/imagenet-attnfree-logvar.yaml`,\nonly `pytorch1.13` is supported.\n\n**NOTE 3:** Training latent generative models (as e.g. in `configs/example_training/imagenet-f8_cond.yaml`) requires\nretrieving the checkpoint from [Hugging Face](https://huggingface.co/stabilityai/sdxl-vae/tree/main) and replacing\nthe `CKPT_PATH` placeholder in [this line](configs/example_training/imagenet-f8_cond.yaml#81). The same is to be done\nfor the provided text-to-image configs.\n\n### Building New Diffusion Models\n\n#### Conditioner\n\nThe `GeneralConditioner` is configured through the `conditioner_config`. Its only attribute is `emb_models`, a list of\ndifferent embedders (all inherited from `AbstractEmbModel`) that are used to condition the generative model.\nAll embedders should define whether or not they are trainable (`is_trainable`, default `False`), a classifier-free\nguidance dropout rate is used (`ucg_rate`, default `0`), and an input key (`input_key`), for example, `txt` for\ntext-conditioning or `cls` for class-conditioning.\nWhen computing conditionings, the embedder will get `batch[input_key]` as input.\nWe currently support two to four dimensional conditionings and conditionings of different embedders are concatenated\nappropriately.\nNote that the order of the embedders in the `conditioner_config` is important.\n\n#### Network\n\nThe neural network is set through the `network_config`. This used to be called `unet_config`, which is not general\nenough as we plan to experiment with transformer-based diffusion backbones.\n\n#### Loss\n\nThe loss is configured through `loss_config`. For standard diffusion model training, you will have to\nset `sigma_sampler_config`.\n\n#### Sampler config\n\nAs discussed above, the sampler is independent of the model. In the `sampler_config`, we set the type of numerical\nsolver, number of steps, type of discretization, as well as, for example, guidance wrappers for classifier-free\nguidance.\n\n### Dataset Handling\n\nFor large scale training we recommend using the data pipelines from\nour [data pipelines](https://github.com/Stability-AI/datapipelines) project. The project is contained in the requirement\nand automatically included when following the steps from the [Installation section](#installation).\nSmall map-style datasets should be defined here in the repository (e.g., MNIST, CIFAR-10, ...), and return a dict of\ndata keys/values,\ne.g.,\n\n```python\nexample = {\"jpg\": x,  # this is a tensor -1...1 chw\n           \"txt\": \"a beautiful image\"}\n```\n\nwhere we expect images in -1...1, channel-first format.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FStability-AI%2Fgenerative-models","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FStability-AI%2Fgenerative-models","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FStability-AI%2Fgenerative-models/lists"}