{"id":19845984,"url":"https://github.com/vchitect/litegen","last_synced_at":"2025-05-01T21:30:55.496Z","repository":{"id":257062619,"uuid":"848394396","full_name":"Vchitect/LiteGen","owner":"Vchitect","description":" A light-weight and high-efficient training framework for accelerating diffusion tasks.","archived":false,"fork":false,"pushed_at":"2024-09-14T06:54:28.000Z","size":160,"stargazers_count":35,"open_issues_count":1,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-09-25T15:03:14.891Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Vchitect.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-27T17:20:25.000Z","updated_at":"2024-09-25T03:08:53.000Z","dependencies_parsed_at":"2024-09-14T17:18:26.792Z","dependency_job_id":"04aad172-0455-4c0a-9cd3-13870482735a","html_url":"https://github.com/Vchitect/LiteGen","commit_stats":null,"previous_names":["vchitect/litegen"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vchitect%2FLiteGen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vchitect%2FLiteGen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vchitect%2FLiteGen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vchitect%2FLiteGen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Vchitect","download_url":"https://codeload.github.com/Vchitect/LiteGen/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224278457,"owners_count":17285080,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T13:10:00.034Z","updated_at":"2024-11-12T13:10:00.168Z","avatar_url":"https://github.com/Vchitect.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LiteGen\n A light-weight and high-efficient training framework for accelerating diffusion tasks.\n\n## 📜 About\n\nLiteGen is a lightweight and high efficient training acceleration framework specifically designed for diffusion tasks, and has been applied and validated on video generation project [Vchitech-2.0](https://github.com/Vchitect/Vchitect-2.0). This framework integrates multiple training optimization techniques and offers a user-friendly interface, allowing researchers and developers to easily scale from single-GPU setups to multi-node, multi-GPU environments.\n\n## ✨ Support Features\n\n- vae:\n  - DP vae\n  - Sliced vae\n  - vae.encode compile\n- ema model:\n  - sharded EMA (Exponential Moving Average)\n- text encoder:\n  - sharded text encoder\n- distributed optimization\n  - DDP\n  - ZeRO1,2,3\n  - Sequence Parallel (Ulysses implementation for Vchitect-2.0 Model)\n- memory optimization\n  - Grad activation checkpointing\n  - selective checkpointing\n\nWe also provide easy-to-use interfaces for common operations such as model loading and saving, etc. LiteGen allows users to focus on generative algorithm development and training without getting bogged down in implementation details.\n\n## 🔨 Usage\n\nImplementing LiteGen's optimizations involves two straightforward steps:\n\n1. **Configuration**: Adjust the relevant fields in your config file to enable desired optimizations.\n2. **Integration**: Utilize the api from the `LiteGen` instance in your codebase. These simple steps allow you seamlessly integrating optimizations into your existing workflow.\n\n### Quick Start Guide\n\nFollow these steps to integrate LiteGen into your project:\n\n1. **Create LiteGen**\n\nCreate a `LiteGen` instance using your configuration file:\n\n``` python\nfrom litegen import LiteGen\ngen = LiteGen(config)\n```\n\n2. **Initialize Components**\n\nUse the `initialize` function to set up your training environment. This versatile function accepts various components, and returns the optimized version in the same order as the input arguments.\n\n```python\nmodel, optimizer, text_encoder, dataloader, vae_encode = gen.initialize(\n    model,          # Your trainable model (only one is accepted)\n    optimizer,      # Optimizer for the model\n    text_encoder,   # Untrainable models (e.g., encoders in diffusion tasks)\n    dataset,        # Your dataset\n    vae.encode      # Computing functions (e.g., VAE encoding)\n)\n```\n\nThe two steps described above constitute the minimal code changes required to implement LiteGen's optimizations. This approach allows for quick integration while leveraging LiteGen's performance enhancements.\n\nIn the following sections, we provide a detailed explanation of the specific optimizations LiteGen offers and how to configure the corresponding key-value pairs in the config file.\n\n### Optimizations\n\n#### DDP or ZeRO for the Trainable Model\n\nLiteGen offers flexibility in choosing between Distributed Data Parallel (DDP) and different stages of ZeRO (Zero Redundancy Optimizer) for your trainable model. This choice is controlled by the `zero_degree` field in the configuration file:\n\n- `zero_degree = 0 or None`: Uses DDP\n- `zero_degree = 1`: Implements ZeRO Stage 1\n- `zero_degree = 2`: Implements ZeRO Stage 2\n- `zero_degree = 3`: Implements ZeRO Stage 3\n\nWhen using ZeRO Stage 3, you can enable grouped ZeRO3 by setting `group_zero` to `True`. This option limits communication within a single node, potentially reducing inter-node communication overhead and enhancing training performance.\n\nExample configuration:\n\n```yaml\nzero_degree: 3\ngroup_zero: True\n```\n\n**Note**: While the `initialize` interface supports optimizing multiple models passed in any order, it only supports one trainable model. The function determines if a model is trainable by checking if any of its parameters have `requires_grad=True`. For all non-trainable models passed to the function, ensure you set `.requires_grad_(False)` beforehand.\n\n#### Selective Activation Checkpointing\n\nLiteGen incorporates activation checkpointing, a common optimization technique for reducing memory usage, and simplifies its usage. Furthermore, when sufficient memory is available, we allow for selective application of activation checkpointing to specific modules, thereby reducing performance overhead.\n\nExample configuration:\n\n```yaml\nselective_ratio: 0.2    # Ratio of modules that do NOT use activation checkpointing.\n                        # 0: All blocks in the model use activation checkpointing.\n                        # 1: No blocks in the model use activation checkpointing.\n```\n\n**Note:**\n\n1. Activation checkpointing only applies to the trainable model.\n2. Implement `get_fsdp_wrap_module_list()` in your model class to specify modules for checkpointing.\n3. If not implemented, LiteGen automatically detects and applies checkpointing to repetitive module structures in the model (e.g., repeated transformer blocks in DiT models).\n\n#### Activation Offload\n\nTo further conserve GPU memory, we have implemented CPU offloading for activations in our system. This technique effectively overlaps computation and communication, significantly reducing memory usage during training with minimal additional performance overhead.\n\nYou can enable this feature using the following configuration:\n\n```yaml\nac_offload: True\n```\n\n**Note:** This feature, like selective activation checkpointing, applies to the trainable model and relies on the `get_fsdp_wrap_module_list()` method or automatic detection of repetitive modules.\n\n#### Sequence Parallel\n\nLiteGen supports the DeepSpeed Ulysses implementation of Sequence Parallel. Through module conversion mechanisms and PyTorch hooks, we enable the transformation from a serial model to a Sequence Parallel model with minimal modifications.\n\nYou can configure this feature as follows:\n\n```yaml\nsp_size: 8  # Sequence parallel degree\n```\n\n**Note:**\n\nSequence Parallel inherently requires scatter and gather operations on tensors within certain modules. Therefore, LiteGen's implementation necessitates a Sequence Parallel version of the AttentionProcessor for the Attention class, as well as a conversion mapping in the ModuleConverter from serial AttentionProcessor to its Sequence Parallel counterpart.\n\nWe have successfully implemented Sequence Parallel support for the Vchitect-2.0 model using LiteGen. For reference, you can find the relevant code in the [Vchitect-2.0](https://github.com/Vchitect/Vchitect-2.0) repository.\n\n\n#### Sharded Encoder\n\nIn addition to optimizing the trainable model, LiteGen also supports parameter sharding for inference-only (untrainable) models, such as text encoders in diffusion tasks. This feature further reduces memory usage across your entire pipeline.\n\nConfigure this feature as follows:\n\n```yaml\nencoder:\n  fsdp: True        # Enable parameter sharding\n  group: False      # When True, sharding is limited to within a node, reducing inter-node communication overhead\n```\n\n**Note:** The system identifies models for sharded encoder optimization by checking for the absence of parameters with `requires_grad=True`. To utilize this optimization, set `requires_grad_(False)` on the relevant model before calling LiteGen's `initialize()` method.\n\nHere's an example of optimizing both a trainable model and an untrainable model simultaneously:\n\n```python\ndit = load_dit_model()\ntext_encoder = load_text_encoders()\ndit.train()\ntext_encoder.requires_grad_(False)\ndit, text_encoder = gen.initialize(dit, text_encoder)\n```\n\n#### EMA Model\n\nThe Exponential Moving Average (EMA) model is a common technique used to smooth parameter updates and achieve better training results. LiteGen integrates EMA Model functionality with support for parameter sharding to conserve GPU memory, while providing an easy-to-use interface.\n\nExample configuration:\n\n```yaml\nema:\n  enable: True   # Enable EMA\n  sharded: True  # Use parameter sharding for the EMA model\n```\n\n**User Interface:** LiteGen provides a simple method to update the EMA model:\n\n```python\ngen.update_ema(model, decay=0.9999)\n```\n\n#### Checkpoint saving and loading\n\nLiteGen provides convenient interfaces for saving and loading checkpoints, enabling easy resumption of training without dealing with the intricacies of distributed checkpoints.\n\n**Saving Checkpoints**\n\nWe offer three separate interfaces for saving model, optimizer, and EMA model states:\n\n```python\ngen.save_model(output_folder=None, filename=None, step=None)\ngen.save_optimizer(output_folder=None, filename=None, step=None)\ngen.save_ema(output_folder=None, filename=None, step=None)\n```\n\n1. Specify `output_folder` and `filename` to determine the checkpoint file location.\n2. If `output_folder` is unspecified but `filename` is provided, the system uses the `result_dir` defined in the config.\n3. Without a specified `filename`, the system uses the `exp_name` from the config as the checkpoint prefix:\n   * Model: `[exp_name].pth`\n   * Optimizer: `[exp_name].optim_state.pth`\n   * EMA model: `[exp_name].ema.pth`\n4. We recommend specifying the current `step` when saving checkpoints. If `step` is provided without a `filename`, the system appends the step information to the `exp_name` prefix: `[exp_name]_step[StepNum]`.\n\n**Loading Checkpoints**\n\nLiteGen supports various checkpoint loading modes, configurable in the config file. Listed in order of increasing priority:\n\n1. `init_from`:\n\n   * Used for loading initial model weights.\n   * Suitable for starting a new fine-tuning process (step=0).\n   * Loads only the model state dict, not optimizer state or EMA weights.\n\n   Example:\n\n   ```yaml\n   init_from: 'path_to_the_init_model/model.pth'\n   ```\n\n2. `resume_from`:\n\n   * Used to resume training from a specific checkpoint.\n   * Automatically loads corresponding optimizer state.\n   * For EMA model resumption, specify the EMA checkpoint path separately.\n\n   Example:\n\n   ```yaml\n   resume_from: 'path_to_the_resumed_model/model_10.pth'\n   ema:\n     enable: True\n     resume_from: 'path_to_the_resumed_ema/ema_10.pth'\n   ```\n\n   Note: `resume_from` takes precedence over `init_from` if both are specified.\n\n3. `auto_resume`:\n\n   * Automatically finds and resumes from the latest saved checkpoint.\n   * Enable by setting `auto_resume: True` in the config.\n   * Works best with `save_model()`, `save_optimizer()`, and `save_ema()` calls that only specify the `step`.\n   * Searches for the most recent checkpoint (based on step number) in the `result_dir` with the `exp_name` prefix.\n\n   Example:\n\n   ```yaml\n   auto_resume: True\n   ```\n\n   Note: `auto_resume` has higher priority than `resume_from`. It only activates when `resume_from` is empty or unspecified to avoid confusion.\n\n\n### Config file\n\n\nAs a lightweight training framework, `LiteGen` does not impose restrictions on the type of configuration files, nor does it provide built-in config file parsing. The only requirement for config files is that, once parsed, they should allow access to the necessary system fields using the `config.key` syntax.\n\nWe recommend using YAML format for configuration files and parsing them as follows:\n\n1. Define a YAML file. Example:\n\n```yaml\nexp_name: 'video_generation_exp1'\nresults_dir: 'path_to_the_results_dir'\n\n# Checkpoint loading\ninit_from: 'path_to_the_init_model/model.pth'\nauto_resume: False\n\n# Model optimization\nselective_ratio: 0\nac_offload: True\nzero_degree: 3\ngroup_zero: True\n\n... # Other arguments\n```\n\n2. Parse the config file in Python:\n\n```python\nimport argparse\nimport yaml\nfrom easydict import EasyDict\n\nparser = argparse.ArgumentParser()\nparser.add_argument('--config', type=str, default='configs/config.yaml', help='config file')\nargs = parser.parse_args()\nwith open(args.config) as f:\n    cfg = yaml.load(f, Loader=yaml.FullLoader)\nconfig = EasyDict(cfg)\n```\n\n3. Initialize the LiteGen instance using the config:\n\n```python\ngen = LiteGen(config)\n```\n\nHere we outline essential configuration fields for the LiteGen. While default values of part of fields are available, explicit configuration is recommended to prevent errors and ambiguity.\n\n```yaml\n# experiment and filepath\nexp_name: experiement_name\ncheckpoint_dir: path_to_load_checkpoint\ninit_from: filepath to the pretrained model checkpoint\nresume_from: filepath to the model checkpoint\nauto_resume: whether to auto resume the checkpoint, True or False\n\n# precision\nprecision: parameter precision, one of ['tf32', 'fp32', 'bf16', 'fp16']\ngrad_precision: gradient precision to reduce in FSDP, one of ['tf32', 'fp32', 'bf16', 'fp16']\nallow_tf32: whether to enable tf32. True or False.\n\n# optimizer\nlr: learning rate\nweight_decay: weight decay\nfused_optimizer: whether to use fused optimizer implementation. True or False.\n\n# ddp strategy\nzero_degree: degree of ZeRO, one of [0,1,2,3]\ngroup_zero: whether to do zero communication within the node. True or False\n\n# sequence parallel\nsp_size: sequence parallel degree\n\n# module convert\nfused_layernorm: whether to use fused layernorm implementation. True or False.\n\n# activation optimization\nac_offload: whether to enable activation offload to reduce GPU memory. True or False.\nselective_ratio: ratio NOT use activation checkpoint, a float number ranging in 0~1\n\n# ema\nema:\n  enable: whether to enable ema. True or False.\n  sharded: whether to use shareded ema to reduce GPU memory. True or False.\n  resume_from:  path to resume the ema checkpoint. a filepath str or None.\n\n# encoder\nencoder:\n  fsdp: whether to use fsdp to optimize the encoder memory usage. True or False.\n  group: whether to use fsdp within the node for encoder. True or False.\n\n# training settings\nglobal_seed: global random seed\nmax_steps: max steps number\nnum_workers: number of workers of dataloader\npin_memory: whether to enable pin_memory for dataloader\nglobal_batch_size: total samples used across all ranks in one optimizer step.\n```\n\nAdditionally, users can define custom configuration fields to meet specific requirements for algorithm construction and training script needs.\n\n\n## 🚀 Performance\n\nLiteGen implements Sequence Parallel and Activation Offload techniques, which effectively reduce memory usage and enable training on long sequences for Diffusion tasks. We conducted tests on NVIDIA A100 GPUs to determine the maximum supported sequence length when training Vchitect-2.0. All other optimizations remained the same. The results are as follows:\n\n![Sequence_length](assets/imgs/sequence_length.jpg)\n(AO: Activation Offload, SP: Sequence Parallel)\n\nResults demonstrate that with all memory optimizations enabled, LiteGen supports training on sequences up to 1.63 million tokens in length using 8x NVIDIA A100 GPUs. This corresponds to approximately 150 seconds of video at 760x460 resolution.\n\n## 🔑 License\n\nThis code is licensed under Apache-2.0. The framework is fully open for academic research and also allows free commercial usage. To apply for a commercial license or for other questions or collaborations, please contact yangzhenyu@pjlab.org.cn.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvchitect%2Flitegen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvchitect%2Flitegen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvchitect%2Flitegen/lists"}