https://github.com/lapp0/distily
Distily: Language Model Distillation Toolkit and Library
https://github.com/lapp0/distily
bitnet distillation knowledge-distillation language-model transformer
Last synced: 4 months ago
JSON representation
Distily: Language Model Distillation Toolkit and Library
- Host: GitHub
- URL: https://github.com/lapp0/distily
- Owner: lapp0
- License: agpl-3.0
- Created: 2024-08-03T22:30:33.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-09-25T11:07:25.000Z (about 1 year ago)
- Last Synced: 2025-04-05T15:45:57.739Z (6 months ago)
- Topics: bitnet, distillation, knowledge-distillation, language-model, transformer
- Language: Python
- Homepage:
- Size: 342 KB
- Stars: 7
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# Distily
#### In one command, distill an existing LLM into a smaller or different architecture.
## Install
```
pip install -U "git+https://github.com/lapp0/distily.git"
```## Features
Distily allows you to distill a model with
- Quantized weights: e.g. TriLM, bitnet
- Distinct architecture: State-Space models such as Mamba, Mixture-of-Experts (MoE)
- Modified architecture: Decrease (or increase) the
- number of layers
- width and depth of attention heads and dense layer.
- the number of attention and KV heads.## Usage
**Minimal Example: `distily_gpt2`**
Command to create a distilled `gpt2` with only 6 layers:
```
python3 -m distily.run \
--teacher_model_name_or_path gpt2 \
--output_dir distily_gpt2 \
--hub_model_id "distily/distily_gpt2" \
--push_to_hub True \
--student_model_config {"n_layers": 6} \
--student_model_as_bitnet True
```The [Resulting `distily_gpt2` Model](https://huggingface.co/distily/distily_gpt2) has (TODO: explain metrics).
For more examples, review the [Examples](./docs/examples.md) documentation.
#### Note on Hub Credentials
To push to hub, you must prepare your hub token
```
HF_WRITE= python3 -c "from huggingface_hub.hf_api import HfFolder; HfFolder.save_token('${HF_WRITE}')"
```## Further Reading
TODO: commit the linked docs once complete
**Using Distily**
- How Distillation Works: [The Distily Recipe](./docs/recipe.md)
- [Quickstart / Examples](./docs/using.md)
- [Parameter Selection](./docs/params.md)**Available Models**
- [Official Distily Models](./docs/official_models.md)
- [All HF Models Created With Distily](https://huggingface.co/models?library=Distily)**Contributing**
- [Contributing Guidelines](./docs/contributing.md)## Roadmap
#### Improved performance / sampling efficiency:
- [X] Standard knowledge distillation using logits.
- [x] Distill using intermediate features including hidden states and attentions.
- [ ] Implement [Value-Transfer](https://arxiv.org/pdf/2002.10957) (simply distillation loss on v of q,k,v)
- [ ] Improve sampling efficiency through synthetic data generation.
- [ ] Implement cross-entropy classification loss (traditional LLM loss function)
- [ ] Apply projector to logits (https://arxiv.org/pdf/2310.17183)
- [ ] Apply "teacher recording", run teacher inference once, use features dataset any number of times.#### Distill to a different model shape / size:
- [x] Distill to model with fewer `num_hidden_layers` by implementing layer mappers.
- [x] Distill to a model with modified module dimensions and behaviors (e.g., `intermediate_size`, `hidden_act`) by employing projectors.
- [x] Distill to a model with modified `num_attention_heads` and `num_key_value_heads`.#### Distill to a different architecture:
- [x] Distill to Bitnet (b1.58)
- [ ] Distill to State-Space / Mamba
- [ ] Distill to MoE
- [ ] Distill to Parameter Sharing (ALBERT-style) Model#### Additional Techniques:
- [ ] [Distill from multiple models at once](https://arxiv.org/pdf/2106.01023)
- [ ] [Pruning](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/)