{"id":19361035,"url":"https://github.com/fminference/dejavu","last_synced_at":"2025-07-25T15:09:53.619Z","repository":{"id":183832161,"uuid":"647739388","full_name":"FMInference/DejaVu","owner":"FMInference","description":null,"archived":false,"fork":false,"pushed_at":"2024-04-02T21:51:45.000Z","size":6517,"stargazers_count":311,"open_issues_count":27,"forks_count":41,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-30T12:07:52.577Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FMInference.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-31T12:23:16.000Z","updated_at":"2025-03-25T10:48:27.000Z","dependencies_parsed_at":"2024-04-02T22:44:07.625Z","dependency_job_id":null,"html_url":"https://github.com/FMInference/DejaVu","commit_stats":null,"previous_names":["fminference/dejavu"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FMInference%2FDejaVu","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FMInference%2FDejaVu/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FMInference%2FDejaVu/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FMInference%2FDejaVu/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FMInference","download_url":"https://codeload.github.com/FMInference/DejaVu/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247492557,"owners_count":20947545,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T07:20:17.362Z","updated_at":"2025-04-06T14:11:55.277Z","avatar_url":"https://github.com/FMInference.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time\n\nLarge language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, but existing methods either require costly retraining, have to forgo LLM’s in-context learning ability, or do not yield wall-clock time speedup on modern hardware. We hypothesize that contextual sparsity, which are small, input-dependent sets of attention heads and MLP parameters that yield approximately the same output as the dense model for a given input, can address these issues. We show that contextual sparsity exists, that it can be accurately predicted, and that we can exploit it to speed up LLM inference in wall-clock time without compromising LLM’s quality or in-context learning ability. Based on these insights, we propose DejaVu, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that speeds up LLM inference. We validate that DejaVu can reduce the inference latency of OPT-175B by over 2×\n compared to the state-of-the-art FasterTransformer, and over 6×\n compared to the widely used Hugging Face implementation, without compromising model quality. The code is available at https://github.com/FMInference/DejaVu.\n\nPaper Link: https://proceedings.mlr.press/v202/liu23am.html\n\n\nThis repo is consisting of three parts: (1) Training sparsity predictor (2) End-to-End Accuracy Benchmark (3) Generation Latency Benchmark.\n\n## Training sparsity predictor\nWe collect training data by running model inference using Decentralized_FM_alpha. \n\n**Requirements**\n\n\n```\n    pip3 install --pre torch==1.12.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html\n    pip3 install cupy-cuda11x==11.0.0\n    python3 -m cupyx.tools.install_library --cuda 11.x --library nccl\n    pip3 install transformers\n```\n\n**Collect the training data**\n\nTo get started, you need to first collect the training data by runing model inference over c4 \n\n```\nDejaVu/Decentralized_FM_alpha/run_infer_opt_175b_collect_sp_data.sh\n```\nYou need to specify the model checkpoint and data path. To get data, we provide the a script in DejaVu/Decentralized_FM_alpha/c4_train/get_data.py. By default, we sumsample 500 samples in the script. And to convert the model checkpoint from huggingface, we provide a script in DejaVu/Decentralized_FM_alpha/convert_opt_checkpoint.py\n\nAlso, you can specify where to store the training data inside DejaVu/Decentralized_FM_alpha/modules/hf_opt_module_save.py \n\n**Training the sparsity classifier**\n\nAll code related to training sparsity predictor is located in DejaVu/sparse_predictor.\n\nWe provide two script, one for training attention sparsity predictor DejaVu/sparse_predictor/run_c4_att.sh, one for training MLP sparsity predictor DejaVu/sparse_predictor/trainer_mlp.py. \n\nFor detailed instruction, see DejaVu/sparse_predictor/README.md\n\n\n## Accuracy Benchmark\nWe based our accuracy benchmark based on Decentralized_FM_alpha(https://github.com/DS3Lab/Decentralized_FM_alpha)\n\n**Requirements**\n\n```\n    pip3 install --pre torch==1.12.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html\n    pip3 install cupy-cuda11x==11.0.0\n    python3 -m cupyx.tools.install_library --cuda 11.x --library nccl\n    pip3 install transformers\n```\n\n**Perplexity on c4**\n\nTo run evaluation using dense model, run \n```DejaVu/Decentralized_FM_alpha/run_infer_opt_175b_c4.sh```\n\nTo run evaluation using DejaVu model, run\n```DejaVu/Decentralized_FM_alpha/run_infer_opt_175b_c4_sparse.sh```\n\nSimilar to collecting the data, you will need to specify \n(1) the model checkpoint path\n(2) the sparsity predictor checkpoint path\n(3) c4 data path\n\n**Accuracy on downstream task**\n\nWe adopt lm-evaluation-harness for downstream task evaluation. \n\n1. Generate task data\n```\ncd lm-eval-harness-adapter\npython generate_task_data.py --output-file wsc.jsonl --task-name wsc --num-fewshot 0\n```\n\n2. Run evaluation\n```DejaVu/Decentralized_FM_alpha/run_infer_opt_175b_task_sparse.sh```\n\n3. Evaluate model output\n```\ncd lm-eval-harness-adapter\npython evaluate_task_result.py --result-file output_wsc.jsonl --task-name wsc --num-fewshot 0 --model-type opt\n```\n\n## Generation Latency\nWe provide pytorch based implementation that exploits cuda graph. \n\n**Requirements**\n\nFor best performance, please use docker. We provide a dockerfile with all requirement at DejaVu/Dejavu/Dockerfile\n\n**Dense Model Latency Benchmark**\nTo benchmark latency with dense model, run\n\n```torchrun --nproc_per_node=$NUM_GPUs benchmark_generation_opt.py --model-name $MODEL_NAME ```\n\nPlease specify the model checkpoint in DejaVu/Dejavu/benchmarks/benchmark_generation_opt.py with correspondence to $MODEL_NAME\n\n\n**Sparse Model Latency Benchmark**\n\nSparse MLP Block\nTo benchmark latency with sparse MLP block model, run\n\n```torchrun --nproc_per_node=$NUM_GPUs benchmark_generation_opt_mlp_sparse.py --model-name $MODEL_NAME --mlp-K $NUM_ACTIVE_NEURONS```\n\nPlease specify the model checkpoint in DejaVu/Dejavu/benchmarks/benchmark_generation_opt.py with correspondence to $MODEL_NAME\n$NUM_ACTIVE_NEURONS indicate how many neurons to activate in the first fully connected layer in each MLP block. \n\nFor example, for OPT-175B, mlp-k is set to 49152 by default, which will perform dense computation. Set mlp-K 4096 will perform sparse computation. We recommend setting mlp-K to be multiplied by 128.\n\nSparse MLP + Sparse Attention Block\n\nTo benchmark latency with sparse MLP + sparse Attention block model, run\n\n```torchrun --nproc_per_node=$NUM_GPUs benchmark_generation_opt_dejavu.py --model-name $MODEL_NAME --mlp-K $NUM_ACTIVE_NEURONS --att-K1 $NUM_ACTIVE_ATT_1 --att-K2 $NUM_ACTIVE_ATT_2```\n\nPlease specify the model checkpoint in DejaVu/Dejavu/benchmarks/benchmark_generation_opt.py with correspondence to $MODEL_NAME\n$NUM_ACTIVE_NEURONS indicate how many neurons to activate in the first fully connected layer in each MLP block. \n$NUM_ACTIVE_ATT_1 and $NUM_ACTIVE_ATT_2 indicate how many attention head to activate. Our ovservation suggests that first 1/3 and last 1/3 layers($NUM_ACTIVE_ATT_1) are less sparse while middle layers($NUM_ACTIVE_ATT_2) are more sparse. We set up two threshold for different sparsity. \n\n## Citation\n\n```\n@InProceedings{pmlr-v202-liu23am,\n  title = \t {Deja Vu: Contextual Sparsity for Efficient {LLM}s at Inference Time},\n  author =       {Liu, Zichang and Wang, Jue and Dao, Tri and Zhou, Tianyi and Yuan, Binhang and Song, Zhao and Shrivastava, Anshumali and Zhang, Ce and Tian, Yuandong and Re, Christopher and Chen, Beidi},\n  booktitle = \t {Proceedings of the 40th International Conference on Machine Learning},\n  pages = \t {22137--22176},\n  year = \t {2023},\n  editor = \t {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},\n  volume = \t {202},\n  series = \t {Proceedings of Machine Learning Research},\n  month = \t {23--29 Jul},\n  publisher =    {PMLR},\n  pdf = \t {https://proceedings.mlr.press/v202/liu23am/liu23am.pdf},\n  url = \t {https://proceedings.mlr.press/v202/liu23am.html},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffminference%2Fdejavu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffminference%2Fdejavu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffminference%2Fdejavu/lists"}