{"id":20298149,"url":"https://github.com/Qcompiler/MIXQ","last_synced_at":"2025-05-07T20:34:31.561Z","repository":{"id":258908174,"uuid":"824211967","full_name":"Qcompiler/MIXQ","owner":"Qcompiler","description":" MIXQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction","archived":false,"fork":false,"pushed_at":"2024-10-23T04:11:51.000Z","size":41739,"stargazers_count":64,"open_issues_count":0,"forks_count":12,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-10-24T12:41:33.463Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Qcompiler.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-04T15:39:42.000Z","updated_at":"2024-10-24T12:12:32.000Z","dependencies_parsed_at":"2024-10-22T12:32:32.532Z","dependency_job_id":null,"html_url":"https://github.com/Qcompiler/MIXQ","commit_stats":null,"previous_names":["qcompiler/mixq"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qcompiler%2FMIXQ","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qcompiler%2FMIXQ/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qcompiler%2FMIXQ/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Qcompiler%2FMIXQ/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Qcompiler","download_url":"https://codeload.github.com/Qcompiler/MIXQ/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252953717,"owners_count":21830890,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T16:02:17.583Z","updated_at":"2025-05-07T20:34:31.516Z","avatar_url":"https://github.com/Qcompiler.png","language":"HTML","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"# MixQ\n\n\n\nMixQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction\n\nWe use  mixed-precision GEMM for enhancing throughput.\n\nPlease refer to https://github.com/Qcompiler/vllm-mixed-precision for end-to-end text generation.\n\n## Comparision with AWQ\n\nAssuming we have a task that is to compute the PPL(perplexity) of Wikitext2. \nThe dataset wikitext contains 333088 validation data.\n\nFor ```batch size  = 32```, the task is devided into 10409 parts.\n\nAWQ finished the task in 10 minutes with  16.71 it/s.\n\n\u003cimg src=\"figures/awq32.gif\"\u003e\n\nMixQ (W8A8O16)   finished the task in 4.50 minutes with 35.02 it/s.\n\n\u003cimg src=\"figures/mixq32.gif\"\u003e\n\nFor ```batch size  = 512```, the task is devided into 655 parts.\n\nAWQ finished the task in 127 seconds with  5.2 it/s.\n\n\u003cimg src=\"figures/awq512.gif\"\u003e\n\nMixQ (W8A8O16) finished the task in 30 seconds with 21.34 it/s.\n\n\u003cimg src=\"figures/mixq512.gif\"\u003e\n\n\n## Setup\n\nPlease download the mixlib kernel from https://github.com/Qcompiler/QComplier:\n\n```\ngit clone git@github.com:Qcompiler/QComplier.git\ncd EETQ \npython setup.py install\n```\n```\ncd quantkernel\npython setup.py install \n```\n\n## Benchmarking the throughput\n\n\n\nIt is very easy to quantize a LLM and run by MIXQ 4bit or 8bit kernel\n\nRunning the following CMD to quantize the LLM with W8A8O16 kernel: \n\n```\npython examples/basic_quant_mix.py --model_path /mnt/data/checkpoint/Llama-2-7b --quant_file /home/dataset/quant/quant8/Llama-2-7b --w_bit 8\n```\n\nBenchmark the throughput of MIXQ by:\n\n```\npython benchflops.py --model_type mix --model_path /home/dataset/quant/quant8/Llama-2-7b --quant_file /home/dataset/quant/quant8/Llama-2-7b --batch_size 512 --bit 8 \n```\n\nIn  NVIDIA A100-PCIE-40GB, the output is\n\n```\nVersion: mix 8bit \n|   Batch Size |   Decode Length |   Decode tokens/s | Memory (VRAM)    |\n|-------------:|----------------:|------------------:|:-----------------|\n|          512 |            1024 |           10609.8 | 7.86 GB (19.97%) |\n```\n\n\n\n# News !!\n\nWe have integrate the MixedQLinear  designed by QUIK into our framework! The QUIK now is able to support a wide range of LLMs including:\n\n\n- Llama-2 7B/13B/70B\n- Llama-3 8B\n- Falcon 7B/40B\n- ChatGLM 7B\n- QWen2 7B\n\n\n## How to Run\n\nIt is very easy to quantize a LLM and run by QUIK 4bit kernel\n\nRunning the following CMD to quantize the LLM \n\n```\npython examples/basic_quant_quik.py --model_path /mnt/data/checkpoint/Llama-2-7b --quant_file /home/dataset/quant/quantquik4/Llama-2-7b --w_bit 4\n```\n\nBenchmark the throughput of QUIK by:\n\n```\npython  benchflops.py  --model_type quik --model_path   /home/dataset/quant/quantquik4/Llama-2-7b \\\n             --quant_file /home/dataset/quant/quantquik4/quik4/Llama-2-7b \\\n             --batch_size 512 --bit 4\n```\n\nIn  NVIDIA A100-PCIE-40GB, the output is\n\n```\nVersion: quik 4bit\n|   Batch Size |   Decode Length |   Decode tokens/s | Memory (VRAM)    |\n|-------------:|----------------:|------------------:|:-----------------|\n|          512 |            1024 |           8981.17 | 4.88 GB (12.40%) |\n```\n\n\n\n# Tensorrt-LLM implementation of QUIK and MIXQ\n\nWe have supported the end-to-end text generation in TRT-LLM and VLLM!\n\nFor TRT-LLM, please download the NVIDIA TensorRT docker. [TensorRT docker](https://github.com/NVIDIA/TensorRT-LLM). DO NOT USE your local environment!\n\nPlease enter the e2eTRTLLM folder https://github.com/Qcompiler/MixQ_Tensorrt_LLM\n\n```\ngit clone https://github.com/Qcompiler/MixQ_Tensorrt_LLM.git\ndocker pull registry.cn-hangzhou.aliyuncs.com/dongdongchen/dongdong:v1\n```\n\n\n\nPlease Running the docker:\n\n```\nexport name=myname\nbash -c \" nvidia-smi; docker run --rm -it --ipc=host -p 6789:22 \\\n-v /home/${name}/lianxiang/lianxiangTRT/:/code/tensorrt_llm   \\\n-v  /mnt/octave/data/${name}/checkpoint:/dataset    \\\n-v /home/${name}/checkpoint:/code/checkpoint \\\n-v /mnt/octave/data/${name}/lianxiang/checkpoint:/octave/checkpoint \\\n               --ulimit memlock=-1 --ulimit    stack=67108864             \\\n                           --gpus=all       \\\n                       --env 'CCACHE_DIR=/code/tensorrt_llm/cpp/.ccache'            \\\n                            --env 'CCACHE_BASEDIR=/code/tensorrt_llm'              \\\n                                                    --workdir /app/tensorrt_llm     \\\n                                                            --hostname hpc-release \\\n                  --name tensorrt_llm-release-zhanghy                             \\\n                                                           --tmpfs /tmp:exec      \\\n              registry.cn-hangzhou.aliyuncs.com/dongdongchen/dongdong:v1     \"\n \n```\n\n\nAfter starting the docker, set the env :\n\n```\nmodel=Llama-2-7b\nngpu=1\nexport model_dir=/code/tensorrt_llm/checkpoint/${model}\nexport quant_dir=/code/tensorrt_llm/checkpoint/checkpoinmix/tllm_checkpoint_${ngpu}gpu_fp16${model}\nexport out_dir=/code/tensorrt_llm/checkpoint/trt_enginesmix/tllm_checkpoint_${ngpu}gpu_fp16${model}\n```\n\nPlease quantize the model by:\n\n```\nCUDA_VISIBLE_DEVICES=0    python  quantize.py --model_dir  ${model_dir} \\\n--output_dir  ${quant_dir}  --dtype float16 --device  cpu \\\n                               --qformat int8_mix  --calib_size 32 \n```\n\nPlease build the MIXQ model by:\n\n```\nCUDA_VISIBLE_DEVICES=0 trtllm-build --checkpoint_dir ${quant_dir} \\\n   --output_dir ${out_dir} \\\n        --gemm_plugin float16 --mix_precision int8 \n```\n\n\nGenerating the text with MIXQ by:\n\n```\nCUDA_VISIBLE_DEVICES=0  python  summarize.py --test_trt_llm \\\n                    --hf_model_dir ${model_dir} \\\n                    --data_type fp16 \\\n                    --engine_dir ${out_dir}\n```\n\n\n## Building the TRT-LLM  MIXQ plugging  with 4 stage pipline  for Llama-2-70B\n\n\n\n```\nmodel=Llama-2-70b\nngpu=4\nexport model_dir=/code/tensorrt_llm/checkpoint/${model}\nexport quant_dir=/code/tensorrt_llm/checkpoint/checkpoinmix/tllm_checkpoint_${ngpu}gpu_fp16${model}\nexport out_dir=/code/tensorrt_llm/checkpoint/trt_enginesmix/tllm_checkpoint_${ngpu}gpu_fp16${model}\n```\n\nPlease quantize the model by:\n\n```\n CUDA_VISIBLE_DEVICES=0,1,2,3   python  quantize.py --model_dir  ${model_dir} \\\n     --output_dir  ${quant_dir}  --dtype float16 --device  cpu \\\n    --qformat int8_mix  --calib_size 32 --pp_size ${gpu}\n```\n\nPlease build the MIXQ model by:\n\n```\nCUDA_VISIBLE_DEVICES=0,1,2,3 trtllm-build --checkpoint_dir ${quant_dir} \\\n       --output_dir ${out_dir} \\\n           --gemm_plugin float16 --mix_precision int8 \n```\n\n\nGenerating the text with MIXQ by:\n\n```\nCUDA_VISIBLE_DEVICES=0,1,2,3   mpirun -np 4 --allow-run-as-root    python  summarize.py --test_trt_llm \\\n                       --hf_model_dir ${model_dir} \\\n                       --data_type fp16 \\\n                       --engine_dir ${out_dir}\n```\n\n\n## Text generation result\n\n# Llama-2-7B FP16 baseline\n\n\n\nWhen running the ```summarize.py``` of MIXQ (Llama-2-7B in A100, 40GB, PCIE), we get:\n\n\n\u003cimg src=\"figures/textmixq.jpg\"  align = \"center\"  width=\"600\" /\u003e\n\n\n\n\n## Mixed-precision Inference In VLLM\n\nPlease follow the https://github.com/Qcompiler/vllm-mixed-precision for mixed-precision inference.\n\nPlease install the vllm by\n```\npip install vllm==0.6.2\n```\n\n\nPlease install the mixed-precision source code by\n```\ngit clone git@github.com:Qcompiler/vllm-mixed-precision.git\n```\n\nAnd copy the \".so\" from the vllm project\n\n```\ncp -r $PYTHON_PATH/lib/python3.11/site-packages/vllm/*.so  vllm-mixed-precision/vllm/\n```\n\nDelete the vllm==0.6.2\n```\npip uninstall vllm\n```\n\n\n\n## Runing 8-bit mixed-preiciosn infernce in vllm\n\n```\nexport PYTHONPATH=$( pwd )\npython test8bit.py --quant 8\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FQcompiler%2FMIXQ","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FQcompiler%2FMIXQ","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FQcompiler%2FMIXQ/lists"}