{"id":19631047,"url":"https://github.com/fasterdecoding/bitdelta","last_synced_at":"2025-04-04T20:09:04.786Z","repository":{"id":222762105,"uuid":"757253898","full_name":"FasterDecoding/BitDelta","owner":"FasterDecoding","description":null,"archived":false,"fork":false,"pushed_at":"2024-12-05T04:09:06.000Z","size":7559,"stargazers_count":195,"open_issues_count":4,"forks_count":15,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-28T19:07:41.021Z","etag":null,"topics":["llm","quantization","serving"],"latest_commit_sha":null,"homepage":"https://fasterdecoding.github.io/BitDelta/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FasterDecoding.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-14T05:29:56.000Z","updated_at":"2025-03-28T02:19:18.000Z","dependencies_parsed_at":"2025-03-28T19:05:46.584Z","dependency_job_id":"93fa240a-abe1-4603-9473-2660691e2aa6","html_url":"https://github.com/FasterDecoding/BitDelta","commit_stats":null,"previous_names":["fasterdecoding/bitdelta"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FasterDecoding%2FBitDelta","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FasterDecoding%2FBitDelta/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FasterDecoding%2FBitDelta/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FasterDecoding%2FBitDelta/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FasterDecoding","download_url":"https://codeload.github.com/FasterDecoding/BitDelta/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247242678,"owners_count":20907134,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","quantization","serving"],"created_at":"2024-11-11T12:07:39.789Z","updated_at":"2025-04-04T20:09:04.767Z","avatar_url":"https://github.com/FasterDecoding.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BitDelta: Your Fine-Tune May Only Be Worth One Bit\n\n[[Paper](https://arxiv.org/abs/2402.10193)][[Blog](https://fasterdecoding.github.io/BitDelta/)]\n\nBitDelta compresses the weight delta between a fine-tuned and base model LLM to 1 bit, enabling accurate and efficient multi-tenant serving.\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"figures/BitDelta.png\" width=\"700\" height=\"auto\"/\u003e\n      \u003ca href=\"https://github.com/FasterDecoding/BitDelta/assets/51351043/b7840fab-0d75-4829-8993-1e5d586698a0\"\u003e\n  \u003c/a\u003e\n\u003c/div\u003e\n\nThe current release supports:\n\n- Llama-2 and Mistral based models.\n- Memory efficient 16-bit + 1-bit Δ Linear in PyTorch\n- Triton kernel for fast inference (TODO: Update repo with faster [BitBLAS](https://github.com/microsoft/BitBLAS) W1A16 kernel)\n- Gradio demo showcasing batched inference over 6 Mistral-7B based models, using only **30 GB** of GPU memory!\n\n## News\n\n- [10/2024] 🔥 BitDelta is accepted to NeurIPS 2024!\n- [02/2024] 🔥 Arxiv release!\n\n\n## Abstract\n\nLarge Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.\n\n## Contents\n\n- [Install](#Install)\n- [Demo](#Demo)\n- [Usage](#Usage)\n- [Citation](#citation)\n\n## Install\n\n1. Clone the repo and navigate to BitDelta:\n\n```\ngit clone https://github.com/FasterDecoding/BitDelta\ncd BitDelta\n```\n\n2. Set up environment:\n\n```bash\nconda create -yn bitdelta python=3.9\nconda activate bitdelta\n\npip install -e .\n```\n\n## Demo\n\nSee [`demo/README.md`](https://github.com/FasterDecoding/BitDelta/blob/main/demo/README.md) for instructions on how to set up the demo.\n\n[BitDelta Demo.webm](https://github.com/FasterDecoding/BitDelta/assets/51351043/b56747df-1108-42f2-ae6f-05e1c460080c)\n\n## Usage\n\nWe provide some scripts in (`./scripts`) so you can compress your own models! As an example, we will compress `lmsys/vicuna-7b-v1.5` with base model `meta-llama/Llama-2-7b-hf`.\n\n### Compress Model\n\nCompress the weight delta and perform scale distillation:\n\n```\nCUDA_VISIBLE_DEVICES=0,1 python \\\n    bitdelta/train.py \\\n    --base_model meta-llama/Llama-2-7b-hf \\\n    --finetuned_model lmsys/vicuna-7b-v1.5 \\\n    --save_dir $MODEL_SAVE_DIR \\\n    --batch_size 4 \\\n    --num_steps 200 \\\n    --save_full_model True\n```\n\nwhere `$MODEL_SAVE_DIR` is specified.\n\nIf `--save_full_model` is specified, the compressed model will also be saved in HuggingFace format at `$MODEL_SAVE_DIR/calibrated_model`. Otherwise, only the delta will be saved.\n\n### Perplexity Check\n\nDouble check the perplexity of the compressed model:\n\n```\nCUDA_VISIBLE_DEVICES=0 python \\\n    bitdelta/eval_ppl.py \\\n    --base_model meta-llama/Llama-2-7b-hf \\\n    --dataset_name wikitext \\\n    --subset wikitext-2-raw-v1 \\\n    --save_dir $PPL_SAVE_DIR \\\n    --num_eval_samples 100 \\\n    --model_diff $MODEL_SAVE_DIR/diff.pt \\\n\n```\n\n### Replicate Results\n\nTo replicate our other results, please use `--save_full_model` to run the model in Llama format for compatibility with eval harnesses.\n\n## Citation\n\nIf you find BitDelta useful, please consider citing:\n\n```\n@misc{liu2024bitdelta,\n      title={BitDelta: Your Fine-Tune May Only Be Worth One Bit},\n      author={James Liu and Guangxuan Xiao and Kai Li and Jason D. Lee and Song Han and Tri Dao and Tianle Cai},\n      year={2024},\n      eprint={2402.10193},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG}\n}\n```\n\n[# Compressing Model Diffs for High-Througput Multi-Model Serving]: #\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffasterdecoding%2Fbitdelta","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffasterdecoding%2Fbitdelta","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffasterdecoding%2Fbitdelta/lists"}