{"id":28748689,"url":"https://github.com/lapp0/distily","last_synced_at":"2025-06-16T19:09:35.939Z","repository":{"id":251560545,"uuid":"837744364","full_name":"lapp0/distily","owner":"lapp0","description":"Distily: Language Model Distillation Toolkit and Library","archived":false,"fork":false,"pushed_at":"2024-09-25T11:07:25.000Z","size":350,"stargazers_count":7,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-05T15:45:57.739Z","etag":null,"topics":["bitnet","distillation","knowledge-distillation","language-model","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lapp0.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-03T22:30:33.000Z","updated_at":"2025-01-31T22:04:09.000Z","dependencies_parsed_at":"2024-08-09T23:30:36.053Z","dependency_job_id":"dce9153f-a8cc-4fe6-938a-e6f492c48f96","html_url":"https://github.com/lapp0/distily","commit_stats":null,"previous_names":["lapp0/distily"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lapp0/distily","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lapp0%2Fdistily","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lapp0%2Fdistily/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lapp0%2Fdistily/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lapp0%2Fdistily/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lapp0","download_url":"https://codeload.github.com/lapp0/distily/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lapp0%2Fdistily/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260221558,"owners_count":22976867,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bitnet","distillation","knowledge-distillation","language-model","transformer"],"created_at":"2025-06-16T19:09:35.373Z","updated_at":"2025-06-16T19:09:35.922Z","avatar_url":"https://github.com/lapp0.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Distily\n\n\n#### In one command, distill an existing LLM into a smaller or different architecture.\n\n\n## Install\n\n```\npip install -U \"git+https://github.com/lapp0/distily.git\"\n```\n\n## Features\nDistily allows you to distill a model with\n- Quantized weights: e.g. TriLM, bitnet\n- Distinct architecture: State-Space models such as Mamba, Mixture-of-Experts (MoE)\n- Modified architecture: Decrease (or increase) the\n  - number of layers\n  - width and depth of attention heads and dense layer.\n  - the number of attention and KV heads.\n\n## Usage\n\n**Minimal Example: `distily_gpt2`**\n\nCommand to create a distilled `gpt2` with only 6 layers:\n```\npython3 -m distily.run \\\n    --teacher_model_name_or_path gpt2 \\\n    --output_dir distily_gpt2 \\\n    --hub_model_id \"distily/distily_gpt2\" \\\n    --push_to_hub True \\\n    --student_model_config {\"n_layers\": 6} \\\n    --student_model_as_bitnet True\n```\n\nThe [Resulting `distily_gpt2` Model](https://huggingface.co/distily/distily_gpt2) has (TODO: explain metrics).\n\nFor more examples, review the [Examples](./docs/examples.md) documentation.\n\n#### Note on Hub Credentials\nTo push to hub, you must prepare your hub token\n```\nHF_WRITE=\u003cyour hub token\u003e python3 -c \"from huggingface_hub.hf_api import HfFolder; HfFolder.save_token('${HF_WRITE}')\"\n```\n\n## Further Reading\n\nTODO: commit the linked docs once complete\n\n**Using Distily**\n- How Distillation Works: [The Distily Recipe](./docs/recipe.md)\n- [Quickstart / Examples](./docs/using.md)\n- [Parameter Selection](./docs/params.md)\n\n**Available Models**\n- [Official Distily Models](./docs/official_models.md)\n- [All HF Models Created With Distily](https://huggingface.co/models?library=Distily)\n\n\n**Contributing**\n- [Contributing Guidelines](./docs/contributing.md)\n\n## Roadmap\n\n#### Improved performance / sampling efficiency:\n- [X] Standard knowledge distillation using logits.\n- [x] Distill using intermediate features including hidden states and attentions.\n- [ ] Implement [Value-Transfer](https://arxiv.org/pdf/2002.10957) (simply distillation loss on v of q,k,v)\n- [ ] Improve sampling efficiency through synthetic data generation.\n- [ ] Implement cross-entropy classification loss (traditional LLM loss function)\n- [ ] Apply projector to logits (https://arxiv.org/pdf/2310.17183)\n- [ ] Apply \"teacher recording\", run teacher inference once, use features dataset any number of times.\n\n#### Distill to a different model shape / size:\n- [x] Distill to model with fewer `num_hidden_layers` by implementing layer mappers.\n- [x] Distill to a model with modified module dimensions and behaviors (e.g., `intermediate_size`, `hidden_act`) by employing projectors.\n- [x] Distill to a model with modified `num_attention_heads` and `num_key_value_heads`.\n\n#### Distill to a different architecture:\n- [x] Distill to Bitnet (b1.58)\n- [ ] Distill to State-Space / Mamba\n- [ ] Distill to MoE\n- [ ] Distill to Parameter Sharing (ALBERT-style) Model\n\n#### Additional Techniques:\n- [ ] [Distill from multiple models at once](https://arxiv.org/pdf/2106.01023)\n- [ ] [Pruning](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flapp0%2Fdistily","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flapp0%2Fdistily","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flapp0%2Fdistily/lists"}