{"id":13754108,"url":"https://github.com/vahe1994/AQLM","last_synced_at":"2025-05-09T22:30:53.202Z","repository":{"id":216786374,"uuid":"742315659","full_name":"Vahe1994/AQLM","owner":"Vahe1994","description":"Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852","archived":false,"fork":false,"pushed_at":"2025-04-16T14:30:10.000Z","size":40015,"stargazers_count":1252,"open_issues_count":8,"forks_count":181,"subscribers_count":17,"default_branch":"main","last_synced_at":"2025-04-30T06:35:15.617Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Vahe1994.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-01-12T07:51:15.000Z","updated_at":"2025-04-28T18:39:16.000Z","dependencies_parsed_at":"2024-12-18T14:11:56.718Z","dependency_job_id":"a900bbb3-30ad-42a9-8343-e0e70fd98b0e","html_url":"https://github.com/Vahe1994/AQLM","commit_stats":{"total_commits":307,"total_committers":14,"mean_commits":"21.928571428571427","dds":0.6905537459283387,"last_synced_commit":"a441a3f0ece4cbaa2a91a3421c95a8b7432e4d99"},"previous_names":["vahe1994/aqlm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vahe1994%2FAQLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vahe1994%2FAQLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vahe1994%2FAQLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Vahe1994%2FAQLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Vahe1994","download_url":"https://codeload.github.com/Vahe1994/AQLM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335332,"owners_count":21892655,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:40.570Z","updated_at":"2025-05-09T22:30:53.189Z","avatar_url":"https://github.com/Vahe1994.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"# AQLM\n\nOfficial PyTorch implementation for [Extreme Compression of Large Language Models via Additive Quantization](https://arxiv.org/pdf/2401.06118.pdf)\n\n**[2025.04]** Released aqlm v1.1.7. Added support for arbitrary 8-dimensional codebooks on GPU, improved accuracy for 1-bit models, e.g. [ISTA-DASLab/Llama-2-7b-AQLM-1Bit-1x8-hf](https://huggingface.co/ISTA-DASLab/Llama-2-7b-AQLM-1Bit-1x8-hf) at ~1 bit achieves WikiText 2 PPL 7.85. To quantize your own models this way, use `num_codebooks=1, nbits_per_codebook=256` as per the tutorial below.\n\n**[2024.11]** [PV-tuning](https://proceedings.neurips.cc/paper_files/paper/2024/hash/091166620a04a289c555f411d8899049-Abstract-Conference.html) was accepted to [NeurIPS'2024](https://neurips.cc/Conferences/2024) for oral presentation!\n\n**[2024.05]** AQLM was accepted to [ICML'2024](https://icml.cc/Conferences/2024)! If you're attending, meet us around [this poster](https://icml.cc/virtual/2024/poster/34964).\n\n**[2024.06]** We released a new paper that extends AQLM with new finetuning algorithm called [PV-tuning](https://arxiv.org/abs/2405.14852).\nWe're also releasing PV-tuned AQLM models [**in this collection**](https://huggingface.co/collections/ISTA-DASLab/aqlmpv-66564dff5d84f00a893ba93f)\n\n**[2024.08]** We have [merged](https://github.com/Vahe1994/AQLM/commit/a441a3f0ece4cbaa2a91a3421c95a8b7432e4d99) the PV-Tuning branch into the main branch.\nTo reproduce results with old finetuning (before Aug 21), use commit [559a366](https://github.com/Vahe1994/AQLM/commit/559a36681398d7189297fccf3b1e59e8e030e942).\n\n## Inference\n\n### Demo\n\nLearn how to run the prequantized models using this Google Colab examples:\n\n| Basic AQLM \u003cbr\u003e generation | Streaming with \u003cbr\u003e GPU/CPU | Inference with CUDA \u003cbr\u003e graphs (3x speedup) | Fine-tuning \u003cbr\u003e with PEFT | Serving with \u003cbr\u003e `vLLM` |\n|:-----------:|:-------:|:---------------:|:----------:|:--------:|\n| \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/colab_example.ipynb\"\u003e\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"AQLM In Colab\"/\u003e\u003c/a\u003e         | \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/streaming_example.ipynb\"\u003e\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"AQLM In Colab\"/\u003e\u003c/a\u003e | \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_cuda_graph.ipynb\"\u003e\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/\u003e\u003c/a\u003e | \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_2bit_training.ipynb\"\u003e\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/\u003e\u003c/a\u003e  | \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_vllm.ipynb\"\u003e\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/\u003e\u003c/a\u003e |\n\n\n### Models\n\nThis repository is currently designed to work with models of `LLaMA`, `Mistral` and `Mixtral` families.\nThe models reported below use **full model fine-tuning** as described in appendix A, with cross-entropy objective with teacher logits.\n\nWe provide a number of prequantized AQLM models without PV-Tuning (scroll down for PV-Tuned models):\n\n| Model      | AQLM scheme | WikiText-2 PPL | MMLU (5-shot) FP16→AQLM | Model size, Gb | Hub link                                                                 |\n|------------|-------------|----------------|---------------|----------------|--------------------------------------------------------------------------|\n| Llama-3-8b | 1x16        | -          | 0.65→0.56 | 4.1            | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-8B-AQLM-2Bit-1x16) |\n| Llama-3-8b-Instruct | 1x16        | -          | 0.66→0.59 | 4.1            | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16) |\n| Llama-3-70b | 1x16        | -          | 0.79→0.75 | 21.9            | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-2Bit-1x16) |\n| Llama-3-70b-Instruct | 1x16        | -          | 0.80→0.76 | 21.9            | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-Instruct-AQLM-2Bit-1x16) |\n| Command-R | 1x16      | -           | 0.68→0.57 | 12.7            | [Link](https://huggingface.co/ISTA-DASLab/c4ai-command-r-v01-AQLM-2Bit-1x16)|\n| Command-R+ | 1x16      | -           | 0.74→0.68 | 31.9            | [Link](https://huggingface.co/ISTA-DASLab/c4ai-command-r-plus-AQLM-2Bit-1x16)|\n| Mistral-7b| 1x16       | 5.40           | - | 2.5            | [Link](https://huggingface.co/ISTA-DASLab/Mistral-7B-v0.1-AQLM-2Bit-1x16-hf)|\n| Mistral-7B-Instruct-v0.2 | 2x8       | -           | 0.59→0.44 | 2.5            | [Link](https://huggingface.co/ISTA-DASLab/Mistral-7B-Instruct-v0.2-AQLM-2Bit-2x8)|\n| Mixtral-8x7b| 1x16       | 3.35           | -| 12.6            | [Link](https://huggingface.co/ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf)|\n| Mixtral-8x7b-Instruct| 1x16       | -           | -| 12.6            | [Link](https://huggingface.co/ISTA-DASLab/Mixtral-8x7B-Instruct-v0_1-AQLM-2Bit-1x16-hf)|\n| Llama-2-7b | 1x16        | 5.92          | 0.46→0.39 | 2.4            | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf) |\n| Llama-2-7b | 2x8         | 6.69          | - | 2.2            | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-7b-AQLM-2Bit-2x8-hf)  |\n| Llama-2-7b | 8x8         | 6.61          | - | 2.2            | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-7b-AQLM-2Bit-8x8-hf)  |\n| Llama-2-13b| 1x16        | 5.22           | 0.55→0.49 | 4.1            | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-13b-AQLM-2Bit-1x16-hf)|\n| Llama-2-13b| 2x8        |  5.63          | - | 3.8            | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-13b-AQLM-2Bit-2x8-hf)|\n| Llama-2-70b| 1x16        | 3.83           | 0.69→0.65 | 18.8           | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-70b-AQLM-2Bit-1x16-hf)|\n| Llama-2-70b| 2x8         | 4.21           | - | 18.2           | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-70b-AQLM-2Bit-2x8-hf) |\n| gemma-2b | 1x16      | -           | - | 1.7            | [Link](https://huggingface.co/ISTA-DASLab/gemma-2b-AQLM-2Bit-1x16-hf)|\n| gemma-2b | 2x8      | -           | - | 1.6            | [Link](https://huggingface.co/ISTA-DASLab/gemma-2b-AQLM-2Bit-2x8-hf)|\n\nYou can also download AQLM models tuned via PV-tuning:\n\n| Model      | AQLM scheme | WikiText-2 PPL | Model size, Gb | Hub link                                                                 |\n|------------|-------------|----------------|----------------|--------------------------------------------------------------------------|\n| Llama-2-7b | 1x16g8        | 5.68          | 2.4            | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-7b-AQLM-PV-2Bit-1x16-hf) |\n| Llama-2-7b | 2x8g8         | 5.90          | 2.2            | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-7b-AQLM-PV-2Bit-2x8-hf)  |\n| Llama-2-7b | 1x16g16     | 9.21          | 1.7            | [Link](https://huggingface.co/justheuristic/Llama-2-7b-AQLM-PV-1Bit-1x16-hf)  |\n| Llama-2-7b | 1x8g8 (**New!**)     | 7.85          | 1.34            | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-7b-AQLM-1Bit-1x8-hf)  |\n| Llama-2-13b| 1x16g8        | 5.05           | 4.1            | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-13b-AQLM-PV-2Bit-1x16-hf)|\n| Llama-2-70b| 1x16g8        | 3.78           | 18.8           | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-70b-AQLM-PV-2Bit-1x16-hf)|\n| Meta-Llama-3-8B | 1x16g8        | 6.99          | 4.1            | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-8B-AQLM-PV-2Bit-1x16) |\n| Meta-Llama-3-8B  | 1x16g16        | 9.43          | 3.9            | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-8B-AQLM-PV-1Bit-1x16) |\n| Meta-Llama-3-70B | 1x16g8        | 4.57           | 21.9           | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-2Bit-1x16)|\n| Meta-Llama-3-70B | 1x16g16        | 8.67           | 13           | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-1Bit-1x16)|\n| Mistral-7B-v0.1 | 1x16g8  | 5.22 | 2.51 | [Link](https://huggingface.co/ISTA-DASLab/Mistral-7B-v0.1-AQLM-PV-2Bit-1x16-hf) |\n| Phi-3-mini-4k-instruct | 1x16g8 | 6.63 | 1.4 | [Link](https://huggingface.co/ISTA-DASLab/Phi-3-mini-4k-instruct-AQLM-PV-2Bit-1x16-hf) |\n\n\n\nNote that models with \"g16\" in their scheme require aqlm inference library v1.1.6 or newer: \n```bash\npip install aqlm[gpu,cpu]\u003e=1.1.6\n```\n\nAbove perplexity is evaluated on **4k** context length for Llama 2 models and **8k** for Mistral/Mixtral and Llama 3. \nPlease also note that token-level perplexity can only be compared within the same model family, but should not be compared between models that use different vocabularies.\nWhile Mistral has a lower perplexity than Llama 3 8B but this does not mean that Mistral is better: Llama's perplexity is computed on a much larger dictionary and has higher per-token perplexity because of that.\n\nFor more evaluation results and detailed explanations, please see our papers: [Egiazarian et al. (2024)](https://arxiv.org/abs/2401.06118) for pure AQLM and [Malinovskii et al. (2024)](https://arxiv.org/abs/2405.14852) for PV-Tuned models.\n\n### Inference kernels\n\nAQLM quantization setpus vary mainly on the number of codebooks used as well as the codebook sizes in bits. The most popular setups, as well as inference kernels they support are:\n \n| Kernel | Number of codebooks | Codebook size, bits | Scheme Notation | Accuracy | Speedup     | Fast GPU inference | Fast CPU inference |\n|---|---------------------|---------------------|----------|-------------|-------------|--------------------|--------------------|\n| Triton | K                   | N                  | KxN     | -        | Up to ~0.7x | ✅                  | ❌                  |\n| CUDA | 1                   | 16                  | 1x16     | Best        | Up to ~1.3x | ✅                  | ❌                  |\n| CUDA | 2                   | 8                   | 2x8      | OK          | Up to ~3.0x | ✅                  | ❌                  |\n| Numba | K                   | 8                   | Kx8      | Good        | Up to ~4.0x | ❌                  | ✅                  |\n\n### Installation\n\n\n\nTo run the models, one would have to install an inference library:\n```bash\npip install aqlm[gpu,cpu]\n```\n, specifying either `gpu`, `cpu` or both based on one's inference setting.\n\n\nThen, one can use the familiar `.from_pretrained` method provided by the [transformers](https://github.com/huggingface/transformers) library:\n```python\nfrom transformers import AutoModelForCausalLM\n\nquantized_model = AutoModelForCausalLM.from_pretrained(\n    \"ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf\",\n    trust_remote_code=True, torch_dtype=\"auto\"\n).cuda()\n```\nNotice that `torch_dtype` should be set to either `torch.float16` or `\"auto\"` on GPU and `torch.float32` on CPU. After that, the model can be used exactly the same as one would use and unquantized model. \n\n\n\n## Quantization\n\n### Dependencies\n\nInstall packages from `requirements.txt`:\n```bash\npip install -r requirements.txt\n```\n\n### Loading / caching datasets and tokenizer\n\nThe script will require downloading and caching locally the relevant tokenizer and the datasets. \nThey will be saved in default Huggingface Datasets directory unless alternative location is provided by env variables.\nSee [relevant Datasets documentation section](https://huggingface.co/docs/datasets/main/en/cache#cache-directory)\n\n### Data\n\nWhen quantizing models with AQLM, we recommend that you use a subset of the original data the model was trained on.\n\nFor Llama-2 models, the closest available dataset is [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample) . To load subset of RedPajama provide \"pajama\" in --dataset argument.\nThis will process nsamples data and tokenize it using provided model tokenizer.\n\nAdditionally we provide tokenized Redpajama for LLama and Solar/Mistral models for 4096 context lengths stored in [Hunggingface](https://huggingface.co/datasets/Vahe1994/AQLM) .\nTo load it, use:\n\n```python\nfrom huggingface_hub import hf_hub_download\n\nhf_hub_download(repo_id=\"Vahe1994/AQLM\", filename=\"data/name.pth\", repo_type=\"dataset\")\n```\n\nTo use downloaded data from HF, place it in data folder(optional) and set correct path to it in \"--dataset\" argument in main.py.\n\n**Warning:** These subsets are already processed with the corresponding model tokenizer. If you want to quantize another model (e.g. mistral/mixtral), please re-tokenize the data with provided script in src/datautils.\n\n### WandB logging\n\nOne can optionally log the data to `Weights and Biases` service (wandb).\nRun `pip install wandb` for W\u0026B logging.\nSpecify `$WANDB_ENTITY`, `$WANDB_PROJECT`, `$WANDB_NAME` environment variables prior to running experiments. use `--wandb` argument to enable logging\n\n### GPU and RAM requirements\nThis code was developed and tested using a several A100 GPU with 80GB GPU RAM. \nYou can use the `--offload activations` option to reduce VRAM usage.\nFor `Language Model Evaluation Harness` evaluation one needs to have enough memory to load whole model  + activation tensors \non one or several devices.\n\n### Quantization time\n\nAQLM quantization takes considerably longer to calibrate than simpler quantization methods such as GPTQ. This only impacts quantization time, not inference time.\n\nFor instance, quantizing a 7B model with default configuration takes about 1 day on a single A100 gpu. Similarly, quantizing a 70B model on a single GPU would take 10-14 days. If you have multiple GPUs with fast interconnect, you can run AQLM multi-gpu to speed up comparison - simply set CUDA_VISIBLE_DEVICES for multiple GPUs. Quantizing 7B model on two gpus reduces quantization time to ~14.5 hours. Similarly, quantizing a 70B model on 8 x A100 GPUs takes 3 days 18 hours.\n\nIf you need to speed up quantization without adding more GPUs, you may also increase `--relative_mse_tolerance` or set `--init_max_points_per_centroid` or limit `--finetune_max_epochs`. \nHowever, that usually comes at a cost of reduced model accuracy.\n\n### Model downloading\nThe code requires the LLaMA model to be downloaded in Huggingface format and saved locally. The scripts below assume that `$TRANSFORMERS_CACHE` variable points to the Huggingface Transformers cache folder.\nTo download and cache the models, run this in the same environment:\n\n```python\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nmodel_name = \"meta-llama/Llama-2-7b-hf\"  # or whatever else you wish to download\ntokenizer = AutoTokenizer.from_pretrained(model_name, torch_dtype=\"auto\")\nmodel = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=\"auto\")\n```\n\n\n### How to quantize a model with AQLM\nThis script compresses the model and then tests its performance in terms of perplexity using WikiText2, C4, and Penn Treebank datasets. \n\nThe command to launch the script should look like this: \n\n```bash\nexport CUDA_VISIBLE_DEVICES=0   # or e.g. 0,1,2,3\nexport MODEL_PATH=\u003cPATH_TO_MODEL_ON_HUB\u003e\nexport DATASET_PATH=\u003cINSERT DATASET NAME OR PATH TO CUSTOM DATA\u003e\nexport SAVE_PATH=/path/to/save/quantized/model/\nexport WANDB_PROJECT=MY_AQ_EXPS\nexport WANDB_NAME=COOL_EXP_NAME\n\npython main.py $MODEL_PATH $DATASET_PATH \\\n --nsamples=1024 \\\n --val_size=128 \\\n --num_codebooks=1 \\\n --nbits_per_codebook=16 \\\n --in_group_size=8 \\\n --relative_mse_tolerance=0.01 \\\n --finetune_batch_size=32 \\\n --finetune_max_epochs=10 \\\n --finetune_early_stop=3 \\\n --finetune_keep_best \\\n --local_batch_size=1 \\\n --offload_activations \\\n --wandb \\\n --resume \\\n --save $SAVE_PATH\n```\n\nMain CLI arguments:\n- `CUDA_VISIBLE_DEVICES` - by default, the code will use all available GPUs. If you want to use specific GPUs (or one GPU), use this variable.\n- `MODEL_PATH` - a path to either Hugging Face hub (e.g. meta-llama/Llama-2-7b-hf) or a local folder with transformers model and a tokenizer.\n- `DATASET_PATH` - either a path to calibration data (see above) or a standard dataset `[c4, ptb, wikitext2]`\n   - for llama-2 models, you can use `DATASET_PATH=./data/red_pajama_n=1024_4096_context_length.pth` for a slice of RedPajama (up to 1024 samples)\n- `--nsamples` - the number of calibration data _sequences_ (train + validation). If this parameter is not set, take all calibration data avaialble.\n- `--val_size` - the number of validation sequences for early stopping on block finetuning. By default equal to 0. Must be smaller than `--nsamples`.\n- `--num_codebooks` - number of codebooks per layer\n- `--nbits_per_codebook` - each codebook will contain 2 ** nbits_per_codebook vectors\n- `--in_group_size` - how many weights are quantized together (aka \"g\" in the arXiv paper)\n- `--finetune_batch_size` - (for fine-tuning only) the total number of sequences used for each optimization step\n- `--local_batch_size` - when accumulating finetune_batch_size, process this many samples per GPU per forward pass (affects GPU RAM usage)\n- `--relative_mse_tolerance`- (for initial calibration) - stop training when (current_epoch_mse / previous_epoch_mse) \u003e (1 - relative_mse_tolerance)\n- `--finetune_max_epochs` - maximal number of passes through calibration data on block tuning.\n- `--finetune_early_stop` -  maximal number of passes through calibration data without improvement on validation.\n- `--offload_activations` -- during calibration, move activations from GPU memory to RAM. This reduces VRAM usage while slowing calibration by ~10% (depending on your hardware). \n- `--save` -- path to save/load quantized model. (see also: `--load`)\n- `--wandb` - if this parameter is set, the code will log results to wandb\n- `--attn_implementation` - specify attention (for transformers \u003e= `4.38`). Sdpa attention sometimes causes issues and it is recommended to use `eager` implementation.\n\nThere are additional hyperparameters aviailable. Run `python main.py --help` for more details on command line arguments, including compression parameters.\n\n\n### Preparing fine-tuning dataset\n\nThis is a script is used to pre-tokenize a subset of RedPajama data for future fine-tuning.\n\n```sh\nTARGET_MODEL=meta-llama/Llama-2-7b-hf  # used for tokenization\nSEQLEN=4096\nDATASET=togethercomputer/RedPajama-Data-1T-Sample\nOUTPUT_PATH=./redpajama_tokenized_llama2\n\nCUDA_VISIBLE_DEVICES=0 HF_HOME=/mnt/LLM OMP_NUM_THREADS=16 torchrun --master-port 3456 --nproc-per-node=1 finetune.py --base_model $TARGET_MODEL --quantized_model ./doesnt_matter --dtype bfloat16 --block_type LlamaDecoderLayer --dataset_name=$DATASET --split train --dataset_config_name plain_text --cache_dir=./cache_dir --trust_remote_code --model_seqlen=$SEQLEN --preprocessing_num_workers=64 --preprocessing_chunk_length 100000 --save_dataset_and_exit $OUTPUT_PATH\n\ntar -cvf tokenized_data_llama2.tar $OUTPUT_PATH   # optionally pack for distribution\n```\n\nThe tokenized dataset is specific the model family (or more specifically, its tokenizer). For instance, Llama-3 8B is compatible with Llama-3 70B, but not with Llama-2 because it uses a different tokenizer.\nTo tokenize the data for another model, you need to set 1) --base_model 2) model_seqlen and 3) the path to --save_dataset_and_exit .\n\nYou can also set --preprocessing_num_workers to something hardware-appropriate. Note that setting --download_num_workers \u003e 1 may cause download errors, possibly due to rate limit. These and other parameters are explained in the script's --help.\nThe job requires 150-200 GiB of disk space to store the dataset sample and preprocessing cache. Both are stored in ./cache_dir and can be deleted afterwards.\n\n### Finetuning\n\n**Note** to reproduce results with old finetuning (before Aug 21), use commit [559a366](https://github.com/Vahe1994/AQLM/commit/559a36681398d7189297fccf3b1e59e8e030e942).\nOld version of finetuning produced worse results than new one even without PV-tuning, but was faster.\n\nThe accuracy of the quantized model can be further improved via finetuning.\n\nTo use our new PV-Tuning algorithm, the command to launch the script should look like this: \n\n```bash\ntorchrun --nproc-per-node=$NUM_GPUS finetune.py \\\n    --base_model $MODEL_PATH \\\n    --quantized_model $QUANTIZED_WEIGHTS_PATH \\\n    --model_seqlen=$SEQLEN \\\n    --block_type LlamaDecoderLayer \\\n    --load_dtype bfloat16 \\\n    --amp_dtype bfloat16 \\\n    --code_dtype uint16 \\\n    --dataset_name=$TOKENIZED_DATASET_PATH \\\n    --split none \\\n    --seed 42 \\\n    --preprocessing_chunk_length 100000 \\\n    --cache_dir=$CACHE_DIR \\\n    --trust_remote_code \\\n    --update_codes \\\n    --update_codebooks_and_scales \\\n    --update_non_quantized_parameters \\\n    --lamb \\\n    --debias \\\n    --lr 3e-4 \\\n    --adam_beta1 0.90 \\\n    --adam_beta2 0.95 \\\n    --max_code_change_per_step 1e-2 \\\n    --code_lr 1e-2 \\\n    --code_beta1 0.0 \\\n    --code_beta2 0.95 \\\n    --beam_size 5 \\\n    --delta_decay 0 \\\n    --batch_size=128 \\\n    --microbatch_size=1 \\\n    --max_epochs 1 \\\n    --gradient_checkpointing \\\n    --print_every_steps=1 \\\n    --verbose_optimizer \\\n    --wandb \\\n    --eval_every_steps=10 \\\n    --keep_best_model \\\n    --save $SAVE_PATH \\\n    --save_every_steps 100 \\\n    --attn_implementation flash_attention_2\n```\n\n### Zero-shot benchmarks via LM Evaluation Harness\n\nTo perform zero-shot evaluation, we adopt [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) framework. Our code works with models in standard `transformers`` format and may (optionally) load\nthe weights of a quantized model via `--aqlm_checkpoint_path` argument.\n\nThe evalution results in PV-Tuning were produced with `lm-eval=0.4.0`. \n\nTo run evaluation make sure that proper version is installed or install it via:\n`pip install lm-eval==0.4.0`. \n\nThe main script for launching the evaluation procedure is `lmeval.py`.\n\n```bash\nexport CUDA_VISIBLE_DEVICES=0,1,2,3  # optional: select GPUs\nexport QUANTIZED_MODEL=\u003cPATH_TO_SAVED_QUANTIZED_MODEL_FROM_MAIN.py\u003e\nexport MODEL_PATH=\u003cINSERT_PATH_TO_ORIINAL_MODEL_ON_HUB\u003e\nexport DATASET=\u003cINSERT DATASET NAME OR PATH TO CUSTOM DATA\u003e\nexport WANDB_PROJECT=MY_AQLM_EVAL\nexport WANDB_NAME=COOL_EVAL_NAME\n\n# for 0-shot evals\npython lmeval.py \\\n    --model hf \\\n    --model_args pretrained=$MODEL_PATH,dtype=float16,parallelize=True \\\n    --tasks winogrande,piqa,hellaswag,arc_easy,arc_challenge \\\n    --batch_size \u003cEVAL_BATCH_SIZE\u003e \\\n    --aqlm_checkpoint_path QUANTIZED_MODEL # if evaluating quantized model\n\n# for 5-shot MMLU\npython lmeval.py \\\n    --model hf \\\n    --model_args pretrained=$MODEL_PATH,dtype=float16,parallelize=True \\\n    --tasks mmlu \\\n    --batch_size \u003cEVAL_BATCH_SIZE\u003e \\\n    --num_fewshot 5 \\\n    --aqlm_checkpoint_path QUANTIZED_MODEL # if evaluating quantized model\n```\n\n### Preparing models for inference\n\nTo convert a model into a _Hugging Face_ compatible format, use `convert_to_hf.py model in_path out_path` with corresponding arguments:\n - `model` - the original pretrained model (corresponds to `MODEL_PATH` of `main.py`, e.g. `meta-llama/Llama-2-7b-hf`).\n - `in_path` - the folder containing an initially quantized model (corresponds to `--save` of `main.py`).\n - `out_path` - the folder to save `transformers` model to.\n\nYou may also specify flags such as `--save_safetensors` to control the saved model format (see `--help` for details).\n\nExample command: `python convert_to_hf.py meta-llama/Llama-2-7b-hf ./path/to/saved/quantization ./converted-llama2-7b-hf  --save_safetensors`\n\n# Instructions for QuIP# finetuning\nInstructions for QuIP# finetuning can be found [here](https://github.com/Vahe1994/AQLM/blob/quip-sharp-patch/QUIP_SHARP_INSTRUCTIONS.md).\n\n## Contributing\n\nIf you want to contribute something substantial (more than a typo), please open an issue first.\nWe use black and isort for all pull requests. Before committing your code run `black . \u0026\u0026 isort .`\n\n## Cite\n\nIf you found this work useful, please consider citing:\n\n```\n@misc{egiazarian2024extreme,\n      title={Extreme Compression of Large Language Models via Additive Quantization}, \n      author={Vage Egiazarian and Andrei Panferov and Denis Kuznedelev and Elias Frantar and Artem Babenko and Dan Alistarh},\n      year={2024},\n      eprint={2401.06118},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG}\n}\n@misc{malinovskii2024pvtuning,\n      title={PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression}, \n      author={Vladimir Malinovskii and Denis Mazur and Ivan Ilin and Denis Kuznedelev and Konstantin Burlachenko and Kai Yi and Dan Alistarh and Peter Richtarik},\n      year={2024},\n      eprint={2405.14852},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvahe1994%2FAQLM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvahe1994%2FAQLM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvahe1994%2FAQLM/lists"}