{"id":50510422,"url":"https://github.com/songmzhang/KDFlow","last_synced_at":"2026-06-19T14:00:35.563Z","repository":{"id":341013171,"uuid":"1168545491","full_name":"songmzhang/KDFlow","owner":"songmzhang","description":"A user-friendly \u0026 efficient knowledge distillation framework for LLMs, supporting off-policy, on-policy (OPD), cross-tokenizer, multimodal, and on-policy self-distillation.","archived":false,"fork":false,"pushed_at":"2026-06-04T07:44:40.000Z","size":4235,"stargazers_count":187,"open_issues_count":1,"forks_count":15,"subscribers_count":3,"default_branch":"main","last_synced_at":"2026-06-04T09:17:35.588Z","etag":null,"topics":["cross-tokenizer-distillation","distillation","knowledge-distillation","large-languge-models","on-policy-distillation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/songmzhang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-27T14:13:42.000Z","updated_at":"2026-06-04T07:44:43.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/songmzhang/KDFlow","commit_stats":null,"previous_names":["songmzhang/kdflow"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/songmzhang/KDFlow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songmzhang%2FKDFlow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songmzhang%2FKDFlow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songmzhang%2FKDFlow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songmzhang%2FKDFlow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/songmzhang","download_url":"https://codeload.github.com/songmzhang/KDFlow/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/songmzhang%2FKDFlow/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34534278,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-19T02:00:06.005Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cross-tokenizer-distillation","distillation","knowledge-distillation","large-languge-models","on-policy-distillation"],"created_at":"2026-06-02T20:00:26.252Z","updated_at":"2026-06-19T14:00:35.541Z","avatar_url":"https://github.com/songmzhang.png","language":"Python","funding_links":[],"categories":["🛠️ Frameworks \u0026 Toolkits"],"sub_categories":["🔁 Iterative Self-Bootstrapping"],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"figures/logo.png\" alt=\"KDFlow Logo\" width=\"60%\"\u003e\n\n  ### **A User-friendly and Efficient Framework for LLM Knowledge Distillation**\n\n  [![Release](https://img.shields.io/github/v/release/songmzhang/KDFlow)](https://github.com/songmzhang/KDFlow/releases)\n  [![Documentation](https://img.shields.io/badge/docs-readthedocs-blue?logo=readthedocs\u0026logoColor=white)](https://kdflow.readthedocs.io/)\n  [![Docker](https://img.shields.io/badge/Docker-Hub-2496ED?logo=docker\u0026logoColor=white)](https://hub.docker.com/repository/docker/songmzhang/kdflow/tags)\n  [![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)\n  [![arXiv](https://img.shields.io/badge/arXiv-2603.01875-b31b1b?logo=arxiv)](https://arxiv.org/abs/2603.01875)\n  [![WeChat](https://img.shields.io/badge/WeChat-Group-07C160?logo=wechat\u0026logoColor=white)](#-wechat-group)\n  [![Stars](https://img.shields.io/github/stars/songmzhang/KDFlow?style=social)](https://github.com/songmzhang/KDFlow)\n\n\u003c/div\u003e\n\n---\n\n## 🔥 News\n\n- **[2026/06]** 🐳 New Docker image based on **sglang 0.5.12 + CUDA 12.9** is now available on [Docker Hub](https://hub.docker.com/repository/docker/songmzhang/kdflow/tags) — **recommended** going forward.\n- **[2026/05]** 🪄 Support **EMA teacher update** for on-policy self-distillation, enabled via `--use_ema_teacher True` and `--teacher_ema_decay \u003cfloat\u003e` (default `0.999`).\n- **[2026/04]** ⚡ Support dynamic batch size (enabled via `--use_dynamic_bsz True` and `--max_token_len_per_gpu \u003cN\u003e`), which accelerates training by almost **60% to 100%**.\n- **[2026/04]** 🎉 KDFlow v0.1.3 has been released, now supporting weight synchronization from student to teacher in on-policy self-distillation (controlled by `--teacher_update_freq`, defaults to `1` meaning the teacher is synced every global step when student and teacher share the same model path).\n- **[2026/04]** 🐳 The docker image for KDFlow is now available on [Docker Hub](https://hub.docker.com/repository/docker/songmzhang/kdflow/tags), and the corresponding Dockerfile is also provided in `docker/`.\n- **[2026/03]** 🎉 KDFlow v0.1.2 has been released, supporting multi-node TP/PP for extremely large teacher models.\n- **[2026/03]** 💬 We have created a KDFlow WeChat group! Welcome to [join us](#-wechat-group) for discussion and communication!\n- **[2026/03]** 🎉 KDFlow v0.1.1 released! Now supports **vision-language (multimodal) models** and **Qwen3.5 series**.\n\n---\n\n## 📑 Table of Contents\n\n- [🔥 News](#-news)\n- [✨ Key Features](#-key-features)\n- [🚀 Quick Start](#-quick-start)\n  - [Installation](#installation)\n  - [Off-Policy Knowledge Distillation](#off-policy-knowledge-distillation)\n  - [On-Policy Knowledge Distillation](#on-policy-knowledge-distillation)\n  - [Cross-Tokenizer Knowledge Distillation](#cross-tokenizer-knowledge-distillation)\n  - [Supervised Fine-Tuning (SFT)](#supervised-fine-tuning-sft)\n- [⚙️ Arguments](#️-arguments)\n- [🧩 Extending KDFlow](#-extending-kdflow)\n  - [Adding a Custom KD Algorithm](#adding-a-custom-kd-algorithm)\n  - [Adding a Custom KD Loss](#adding-a-custom-kd-loss)\n- [🔑 Design Highlights](#-design-highlights)\n- [🙏 Acknowledgement](#-acknowledgement)\n- [📖 Citation](#-citation)\n- [📄 License](#-license)\n- [💬 WeChat Group](#-wechat-group)\n- [⭐ Star History](#-star-history)\n\n---\n\n## ✨ Key Features\n\n- **Decoupled Infrastructure** - Using SGLang \u0026 FSDP2 for teacher inference and student training respectively.\n- **Off-Policy Knowledge Distillation** — Distill from pre-collected teacher hidden states on static datasets.\n- **On-Policy Knowledge Distillation** — Student-generated rollout responses are used for teacher forward and distillation training in a closed loop.\n- **Cross-Tokenizer Distillation** — Native support for distilling between models with different tokenizers (e.g., Llama → Qwen).\n- **SFT Training (Black-box KD)** — Supervised fine-tuning on collected dataset.\n- **MultiModal Support** — Support distillation with vision-language models (e.g., Qwen3-VL).\n- **Colocate Mode** — Teacher and student models **share the same GPUs** via sleep/wakeup mechanism, maximizing GPU utilization.\n- **Teacher on SGLang** — Teacher inference is powered by SGLang Engine, enabling high-throughput prefilling and flexible parallel strategies.\n- **Pluggable KD Algorithms** — Built-in support for Vanilla KD and DSKD (Dual-Space Knowledge Distillation), with easy registration of custom algorithms.\n- **Multiple Loss Functions** — Torch compiled KL divergence, Reverse KL divergence, JS divergence, Adaptive KL (AKL), etc.\n- **LoRA Support** — Optional LoRA fine-tuning for the student model.\n- **Wand\u0026b Integration** — Built-in wand\u0026b logging for experiment tracking.\n- **High Training Efficiency** — Achieves **1.4x to 6x** faster distillation compared to mainstream knowledge distillation frameworks.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/architecture.png\" alt=\"KDFlow Architecture\" width=\"80%\"\u003e\n\u003c/p\u003e\n\n---\n\n## 🚀 Quick Start\n\n### Installation\n\nInstall from source:\n\n```bash\ngit clone https://github.com/songmzhang/KDFlow.git\ncd KDFlow\npip install -e ./\n# install flash attention after torch installation\npip install flash_attn==2.8.3 --no-build-isolation\n```\n\nUse the prebuilt Docker image from Docker Hub (**recommended**):\n\n```bash\ndocker pull songmzhang/kdflow:sgl0512-torch211-cu129\n```\n\n\u003e To support Qwen3.5, please use the latest version of SGLang which supports transformers v5.3.0.\n\n\u003e Older `sgl059-torch291-cu128` images are kept as legacy.\n\u003e\n\u003e ⚠️ **VLM users:** `sglang==0.5.9` has a known VLM compatibility bug tracked in [sglang#19335](https://github.com/sgl-project/sglang/issues/19335) and [kdflow#9](https://github.com/songmzhang/KDFlow/issues/9). For source installs, please pin `sglang\u003e=0.5.10`.\n\n### Off-Policy Knowledge Distillation\nLLMs:\n```bash\nbash ./examples/off_policy_kd/run_qwen3_30b_a3b_to_4b.sh\n```\nVLMs:\n```bash\nbash ./examples/off_policy_kd/run_qwen3_vl_30b_a3b_to_4b.sh\n```\n\n### On-Policy Knowledge Distillation\nLLMs:\n```bash\nbash ./examples/on_policy_kd/run_qwen3_30b_a3b_to_4b.sh\n```\nVLMs:\n```bash\nbash ./examples/on_policy_kd/run_qwen3_vl_30b_a3b_to_4b.sh\n```\n\n### Cross-Tokenizer Knowledge Distillation\n\n#### Off-Policy\n\nUse SimpleCrossTokenizerKD (suggested):\n```bash\nbash ./examples/cross_tokenizer_kd/run_qwen3_30b_a3b_to_llama3_2_3b_offpolicy_simple_ctkd.sh\n```\n\nor DSKD:\n\n```bash\nbash ./examples/cross_tokenizer_kd/run_qwen3_30b_a3b_to_llama3_2_3b_offpolicy.sh\n```\n\n#### On-Policy\n\nUse SimpleCrossTokenizerKD (suggested):\n```bash\nbash ./examples/cross_tokenizer_kd/run_qwen3_30b_a3b_to_llama3_2_3b_onpolicy_simple_ctkd.sh\n```\n\nor DSKD:\n\n```bash\nbash ./examples/cross_tokenizer_kd/run_qwen3_30b_a3b_to_llama3_2_3b_onpolicy.sh\n```\n\n### Supervised Fine-Tuning (SFT)\n\n```bash\nbash ./examples/sft/run_qwen3_4b.sh\n```\n\n---\n\n## ⚙️ Arguments\n\n### Model Arguments\n\n| Argument | Default | Description |\n|---|---|---|\n| `--student_name_or_path` | `None` | Student model name or path |\n| `--teacher_name_or_path` | `None` | Teacher model name or path |\n| `--attn_implementation` | `flash_attention_2` | Attention implementation |\n| `--use_liger_kernel` | `False` | Use Liger Kernel for student model |\n| `--lora_rank` | `0` | LoRA rank (0 = disabled) |\n| `--lora_alpha` | `16` | LoRA alpha |\n| `--target_modules` | `all-linear` | LoRA target modules |\n| `--lora_dropout` | `0.0` | LoRA dropout |\n| `--ring_attn_size` | `1` | Ring attention group size for context parallelism |\n| `--enable_thinking` | `False` | Enable thinking mode |\n| `--disable_fast_tokenizer` | `False` | Disable fast tokenizer |\n\n### Training Arguments\n\n| Argument | Default | Description |\n|---|---|---|\n| `--num_nodes` | `1` | Number of training nodes |\n| `--num_gpus_per_node` | `8` | GPUs per node |\n| `--num_epochs` | `1` | Number of training epochs |\n| `--train_batch_size` | `128` | Global training batch size |\n| `--micro_train_batch_size` | `1` | Per-GPU micro batch size |\n| `--learning_rate` | `1e-6` | Learning rate |\n| `--lr_scheduler` | `cosine_with_min_lr` | LR scheduler type |\n| `--lr_warmup_ratio` | `0.05` | Warmup ratio |\n| `--min_lr` | `1e-8` | Minimum learning rate |\n| `--max_norm` | `1.0` | Gradient clipping max norm |\n| `--weight_decay` | `0.0` | Weight decay |\n| `--adam_betas` | `(0.9, 0.98)` | Adam optimizer betas |\n| `--backend` | `fsdp2` | Training backend |\n| `--gradient_checkpointing` | `False` | Enable gradient checkpointing |\n| `--enable_sleep` | `False` | Enable sleep mode for all components (student, teacher, rollout) |\n| `--eval_steps` | `-1` | Evaluate every N steps (-1 = disabled) |\n| `--save_steps` | `-1` | Save checkpoint every N steps (-1 = disabled) |\n| `--save_path` | `./ckpt/` | Model save path |\n| `--ckpt_path` | `./ckpt/checkpoints_distill` | Checkpoint save path |\n| `--seed` | `42` | Random seed |\n| `--bf16` | `False` | Enable bfloat16 training |\n| `--use_dynamic_bsz` | `False` | Enable dynamic batch size based on token count per GPU |\n| `--max_token_len_per_gpu` | `0` | Maximum total token count per micro-batch when `use_dynamic_bsz` is True |\n\n### FSDP Arguments\n\n| Argument | Default | Description |\n|---|---|---|\n| `--fsdp_size` | `-1` | FSDP shard size for HSDP (-1 = full sharding) |\n| `--cpu_offload` | `False` | Offload Adam optimizer states to CPU |\n\n### Distillation Arguments\n\n| Argument | Default | Description |\n|---|---|---|\n| `--kd_ratio` | `0.5` | KD loss weight: `loss = (1 - kd_ratio) * CE + kd_ratio * KD` |\n| `--kd_temperature` | `1.0` | Temperature for softmax in KD |\n| `--kd_algorithm` | `vanilla_kd` | KD algorithm (`vanilla_kd` / `dskd`) |\n| `--kd_loss_fn` | `kl` | Divergence function (`kl` / `rkl` / `jsd` / `akl`) |\n| `--teacher_tp_size` | `8` | Teacher tensor parallel size |\n| `--teacher_ep_size` | `1` | Teacher expert parallel size (MoE models) |\n| `--teacher_pp_size` | `1` | Teacher pipeline parallel size |\n| `--teacher_dp_size` | `1` | Teacher data parallel size |\n| `--teacher_forward_n_batches` | `1` | Teacher forward N batches at once |\n| `--teacher_mem_fraction_static` | `0.4` | SGLang static memory fraction for teacher |\n| `--teacher_offload_tags` | `all` | Offload tags for SGLang |\n| `--teacher_quantization` | `None` | Teacher model quantization |\n| `--dskd_token_align` | `eta` | Token alignment strategy for DSKD (`eta` / `cma`) |\n| `--dskd_topk_vocab` | `-1` | Top-k vocab tokens for DSKD projector init (-1 = all) |\n| `--dskd_projector_lr` | `1e-4` | Learning rate for DSKD projectors |\n| `--jsd_beta` | `0.5` | Beta for Jensen-Shannon Divergence |\n| `--skew_lambda` | `0.1` | Lambda for Skewed KL/RKL |\n| `--adaptive_alpha` | `0.5` | Alpha for Adaptive KL Divergence |\n| `--hrl_topk` | `5` | Top-k for Hierarchical Ranking Loss |\n| `--teacher_update_freq` | `1` | Teacher weight update frequency (in global steps) for on-policy self-distillation |\n\n### Rollout Arguments (On-Policy)\n\n| Argument | Default | Description |\n|---|---|---|\n| `--rollout_num_engines` | `0` | Number of SGLang rollout engines (0 = off-policy) |\n| `--rollout_tp_size` | `1` | Tensor parallel per rollout engine |\n| `--rollout_batch_size` | `32` | Prompts per rollout iteration |\n| `--n_samples_per_prompt` | `1` | Number of responses per prompt |\n| `--generate_max_len` | `2048` | Max generation length |\n| `--temperature` | `1.0` | Sampling temperature |\n| `--top_p` | `1.0` | Top-p sampling |\n| `--rollout_mem_fraction_static` | `0.6` | GPU memory utilization per rollout engine |\n| `--print_rollout_sample` | `False` | Print a rollout sample after each rollout |\n\n### Data Arguments\n\n| Argument | Default | Description |\n|---|---|---|\n| `--train_dataset_path` | `None` | Training dataset path |\n| `--train_dataset_probs` | `None` | Sampling probabilities for multiple datasets |\n| `--train_split` | `train` | Train split name |\n| `--eval_dataset_path` | `None` | Evaluation dataset path |\n| `--eval_split` | `eval` | Eval split name |\n| `--input_key` | `messages` | Dataset input key |\n| `--output_key` | `None` | Dataset output key |\n| `--image_key` | `None` | Image key for multimodal datasets |\n| `--teacher_input_key` | `None` | Input key for teacher prompt (for self-distillation/context distillation) |\n| `--label_key` | `None` | Label key in dataset |\n| `--apply_chat_template` | `True` | Apply tokenizer chat template |\n| `--max_len` | `4096` | Max sequence length |\n| `--prompt_max_len` | `2048` | Max prompt length |\n| `--max_samples` | `1e8` | Max number of samples to load |\n| `--packing_samples` | `False` | Pack sequences for efficiency |\n| `--preprocess_num_workers` | `8` | Number of workers for data preprocessing |\n\n### Logging Arguments\n\n| Argument | Default | Description |\n|---|---|---|\n| `--logging_steps` | `10` | Log every N steps |\n| `--use_wandb` | `False` | Enable W\u0026B logging |\n| `--wandb_org` | `None` | W\u0026B organization name |\n| `--wandb_project` | `None` | W\u0026B project name |\n| `--wandb_group` | `None` | W\u0026B group name |\n| `--wandb_run_name` | `None` | W\u0026B run name |\n| `--wandb_mode` | `online` | W\u0026B mode (`online` / `offline` / `disabled`) |\n| `--wandb_dir` | `None` | Directory to store W\u0026B offline logs |\n\n---\n\n## 🧩 Extending KDFlow\n\n### Adding a Custom KD Algorithm\n\nCreate a new file in `kdflow/algorithms/` and register it:\n\n```python\nimport torch\nfrom kdflow.loss import LOSS_DICT\nfrom kdflow.algorithms import register_algorithm\n\n\n@register_algorithm(\"my_custom_kd\")\nclass MyCustomKD:\n    def __init__(self, strategy, student_model, teacher_lm_head, **kwargs):\n        self.strategy = strategy\n        self.student = student_model\n        self.teacher_lm_head = teacher_lm_head\n        self.loss_fn = LOSS_DICT[strategy.args.kd.loss_fn]\n\n    def training_step(self, micro_batch):\n        # Access student inputs\n        student_input_ids = micro_batch[\"stu_input_ids\"]\n        student_attn_mask = micro_batch[\"stu_attn_mask\"]\n        student_loss_mask = micro_batch[\"stu_loss_mask\"].bool()\n        teacher_hiddens = micro_batch[\"teacher_hiddens\"]\n        avg_token_num = micro_batch[\"avg_micro_batch_token_num\"]\n\n        # Student forward\n        output = self.student(student_input_ids, attention_mask=student_attn_mask, return_output=True)\n        student_logits = output[\"logits\"][student_loss_mask]\n\n        # Teacher logits from hidden states + lm_head\n        teacher_logits = self.teacher_lm_head(teacher_hiddens.to(self.teacher_lm_head.weight))\n\n        # Compute your custom loss\n        kd_loss = self.loss_fn(student_logits, teacher_logits, temperature=1.0)\n        kd_loss = kd_loss.sum() / avg_token_num\n\n        return {\"loss\": kd_loss, \"kd_loss\": kd_loss}\n```\n\nThen use it with `--kd_algorithm my_custom_kd`.\n\n### Adding a Custom KD Loss\n\nCreate a new file in `kdflow/loss/` and register it:\n\n```python\nimport torch\nimport torch.nn.functional as F \n\nfrom kdflow.loss import register_loss\n\n\n@register_loss(\"my_custom_loss\")\n@torch.compile()\ndef compute_kl_div(\n    student_logits,\n    teacher_logits, \n    temperature=1.0,\n    reduction=\"none\",\n    **kwargs\n):\n    student_logits = student_logits / temperature\n    teacher_logits = teacher_logits / temperature\n    log_probs = torch.log_softmax(student_logits, -1, dtype=torch.float32)\n    target_probs = torch.softmax(teacher_logits, -1, dtype=torch.float32)\n    kl_div = F.kl_div(log_probs, target_probs, reduction=reduction).sum(-1)\n    \n    return kl_div\n```\n\nThen use it with `--kd_loss_fn my_custom_loss`.\n\n---\n\n## 🔑 Design Highlights\n\n### GPU Co-location via Sleep/Wakeup\n\nKDFlow enables teacher and student to **share the same GPUs** through a sleep/wakeup mechanism:\n\n1. **Teacher phase**: Teacher model weights are loaded on GPU, student optimizer states are offloaded to CPU.\n2. **Student phase**: Student optimizer states are reloaded to GPU, teacher model weights are offloaded to CPU.\n\nThis allows running large teacher models (e.g., 200B+ parameters) on the same hardware as the student without requiring separate GPU pools.\n\n### Hidden States Transfer via Shared Memory\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/cost.png\" alt=\"Knowledge transfer cost\" width=\"80%\"\u003e\n\u003c/p\u003e\n\nInstead of transferring full teacher logits (which can be enormous for large vocabularies), KDFlow:\n\n1. Extracts **hidden states** from the teacher's last layer via SGLang.\n2. Transfers them to the student via **shared memory** (zero-copy).\n3. Computes teacher logits **on the student side** using only the teacher's `lm_head` weights.\n\nThis dramatically reduces memory and communication overhead.\n\n### Token-Based Teacher Load Balancing\n\nThe `TeacherActorGroup` uses a **greedy token-based load balancing** strategy to distribute micro-batches across teacher actors, ensuring even workload distribution when sequence lengths vary.\n\n---\n\n## 🙏 Acknowledgement\n\nKDFlow is built upon the shoulders of outstanding open-source projects. We sincerely thank:\n\n- [SGLang](https://github.com/sgl-project/sglang) — We deeply appreciate its support for extracting hidden states from model inference and its exceptional inference efficiency, which are critical to KDFlow's teacher inference pipeline.\n- [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) — We gratefully adopt its well-designed abstractions for model wrapping and distributed training strategy, which form the foundation of our training infrastructure.\n- [slime](https://github.com/THUDM/slime) — We appreciate its elegant implementation of Ray placement group initialization and the weight update mechanism for SGLang, which greatly inspired our design of on-policy distillation.\n\n---\n\n## 📖 Citation\n\nIf you find KDFlow useful in your research or work, please consider citing our paper:\n\n```bibtex\n@article{zhang2026kdflow,\n  title={KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models},\n  author={Zhang, Songming and Zhang, Xue and Zhang, Tong and Hu, Bojie and Chen, Yufeng and Xu, Jinan},\n  journal={arXiv preprint arXiv:2603.01875},\n  year={2026}\n}\n```\n\n---\n\n## 📄 License\n\nThis project is licensed under the [MIT License](LICENSE).\n\n---\n\n## 💬 WeChat Group\n\nWelcome to join our WeChat group for discussion and communication!\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/wechat.jpg\" alt=\"WeChat Group QR Code\" width=\"300\"\u003e\n\u003c/p\u003e\n\n---\n\n## ⭐ Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=songmzhang/KDFlow\u0026type=date\u0026legend=top-left)](https://www.star-history.com/#songmzhang/KDFlow\u0026type=date\u0026legend=top-left)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsongmzhang%2FKDFlow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsongmzhang%2FKDFlow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsongmzhang%2FKDFlow/lists"}