{"id":23994471,"url":"https://github.com/hhhuang/CAG","last_synced_at":"2025-09-14T01:31:06.646Z","repository":{"id":270234729,"uuid":"901680517","full_name":"hhhuang/CAG","owner":"hhhuang","description":"Cache-Augmented Generation: A Simple, Efficient Alternative to RAG","archived":false,"fork":false,"pushed_at":"2025-05-26T21:00:42.000Z","size":3270,"stargazers_count":1298,"open_issues_count":7,"forks_count":185,"subscribers_count":22,"default_branch":"main","last_synced_at":"2025-05-26T22:18:13.096Z","etag":null,"topics":["cag","llm","rag"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hhhuang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-12-11T05:43:57.000Z","updated_at":"2025-05-26T21:00:48.000Z","dependencies_parsed_at":null,"dependency_job_id":"4b7cc02c-20eb-47c0-838c-37ebb635dc8d","html_url":"https://github.com/hhhuang/CAG","commit_stats":null,"previous_names":["hhhuang/cag"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hhhuang/CAG","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hhhuang%2FCAG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hhhuang%2FCAG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hhhuang%2FCAG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hhhuang%2FCAG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hhhuang","download_url":"https://codeload.github.com/hhhuang/CAG/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hhhuang%2FCAG/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275051483,"owners_count":25396978,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-13T02:00:10.085Z","response_time":70,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cag","llm","rag"],"created_at":"2025-01-07T21:01:32.960Z","updated_at":"2025-09-14T01:31:06.626Z","avatar_url":"https://github.com/hhhuang.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","知识库 RAG","RAG Techniques and Methodologies"],"sub_categories":["大语言对话模型及数据","RAG Survey 2024"],"readme":"# Cache-Augmented Generation (CAG)\n\n\u003cimg src=\"https://github.com/hhhuang/CAG/blob/main/overview.png?raw=true\" width=600 alt=\"Overview of CAG\"\u003e\n\nRetrieval-Augmented Generation (RAG) has emerged as a powerful approach for enhancing language models by integrating external knowledge sources. However, RAG also introduces several challenges, including:  \n- **Retrieval Latency** – Delays caused by real-time retrieval steps.  \n- **Retrieval Errors** – Inaccuracies in selecting relevant documents.  \n- **System Complexity** – Increased architectural and maintenance overhead.  \n\nTo address these limitations, we propose **Cache-Augmented Generation (CAG)**—an alternative paradigm that bypasses real-time retrieval. CAG leverages the extended context windows of modern large language models (LLMs) by preloading all relevant resources into the model’s context and caching its runtime parameters. During inference, the preloaded KV-cache enables the model to generate responses directly, eliminating the need for retrieval.  \n\n**Advantages of CAG**  \n- **Reduced Latency** – Eliminates real-time retrieval, enabling faster inference.  \n- **Improved Reliability** – Minimizes retrieval errors while maintaining context relevance.  \n- **Simplified Design** – Provides a streamlined, retrieval-free alternative to RAG, achieving comparable or superior results with lower complexity.  \n\n**Limitations of CAG**  \n- **Limited Knowledge Size** – CAG requires the entire knowledge source to fit within the context window, making it less suitable for tasks involving extremely large datasets.  \n- **Context Length Constraints** – The performance of LLMs may degrade with very long contexts ([reference](https://arxiv.org/pdf/2404.02060v2)).  \n\nOur [paper](https://arxiv.org/abs/2412.15605), which will be presented at [the ACM Web Conference 2025](https://www2025.thewebconf.org/) as a short paper, investigates the relationship between model performance and context length, providing insights into scenarios where CAG excels.  \n\nThe limitations of CAG are rapidly being addressed by advancements in LLMs with longer context windows and improved capabilities for extracting relevant information from extended inputs. As these models continue to evolve, CAG is expected to handle increasingly complex applications, making it a practical and scalable alternative to traditional RAG.  \n\n---\n\n## Installation \n```bash\npip install -r ./requirements.txt\n```\n\n## Preparation\n\u003e [!IMPORTANT]  \n\u003e download the required `squad` and `hotpotqa` datasets by curl script\n\u003e ```bash\n\u003e sh ./downloads.sh\n\u003e ```\n\n\u003e [!IMPORTANT]\n\u003e create `.env` file by `.env.template` and input the keys required\n\u003e ```bash\n\u003e cp ./.env.template ./.env\n\u003e ```\n\n## Usage\n- `rag.py` is for RAG Experiment\n- `kvcache.py` is for CAG Experiment\n\n## Parameter Usage -- kvcache.py\n- `--kvcache`: \"file\"\n- `--dataset`: \"hotpotqa-train\" or \"squad-train\"\n- `--similarity` \"bertscore\"\n- `--modelname`: \"meta-llama/Llama-3.1-8B-Instruct\"\n- `--maxKnowledge`: \"\", int, select how many document in dataset, explanation in Note\n- `--maxParagraph`: 100\n- `--maxQuestion` int, max question number, explanation in Note\n- `--randomSeed`: \"\", int, a random seed number\n- `--output`: \"\", str, output filepath string\n- `--usePrompt`, add this parameter if not using CAG knowledge cache acceleration \n\n### Example -- kvcache.py\n```bash\npython ./kvcache.py --kvcache file --dataset \"squad-train\" --similarity bertscore \\\n    --maxKnowledge 5 --maxParagraph 100 --maxQuestion 1000  \\\n    --modelname \"meta-llama/Llama-3.1-8B-Instruct\" --randomSeed 0 \\\n    --output \"./result_kvcache.txt\"\n```\n\n## Parameter Usage -- rag.py\n- `--index`: \"openai\" or \"bm25\"\n- `--dataset`: \"hotpotqa-train\" or \"squad-train\"\n- `--similarity` \"bertscore\"\n- `--maxKnowledge`: \"\", int, select how many document in dataset, explanation in Note\n- `--maxParagraph`: 100\n- `--maxQuestion` int, max question number, explanation in Note\n- `--topk`: int, the similarity topk of retrieval\n- `--modelname`: \"meta-llama/Llama-3.1-8B-Instruct\"\n- `--randomSeed`: \"\", int, a random seed number\n- `--output`: \"\", str, output filepath string\n\n### Example -- rag.py\n```bash\npython ./rag.py --index \"bm25\" --dataset \"hotpotqa-train\" --similarity bertscore \\\n    --maxKnowledge 80 --maxParagraph 100 --maxQuestion 80 --topk 3 \\\n    --modelname \"meta-llama/Llama-3.1-8B-Instruct\" --randomSeed  0 \\\n    --output  \"./rag_results.txt\"\n```\n\n### Note:\n#### `--maxKnowledge` parameter notice: \n\u003e [!NOTE]\n\u003e Approximate Tokens count corresponding to knowledge document size of \"squad-train\" and \"hotpotqa-train\" dataset. \n\n\u003e datasets=(\"squad-train\")\n\u003e - when k = 3, tokens = 21,000\n\u003e - when k = 4, tokens = 32,000\n\u003e - when k = 7, tokens = 50,000\n\u003e \n\u003e datasets=(\"hotpotqa-train\")\n\u003e - all k = 7405 article, tokens = 10,038,084 \n\u003e - when k = 1, tokens = 1,400\n\u003e - when k = 16, tokens = 22,400\n\u003e - when k = 24, tokens = 33,667\n\u003e - when k = 32, tokens = 44,800\n\u003e - when k = 48, tokens = 64,000\n\u003e - when k = 64, tokens = 85,000\n\u003e - when k = 80, tokens = 106,000\n\n#### `--maxQuestion` parameter notice:\n\u003e - when using \"squad-train\" dataset, 1 knowledge has average 150 questions\n\u003e - when using \"hotpotqa-train\" dataset, 1 knowledge has 1 question\n\n\u003e [!TIP]\n\u003e Since 1 document in \"hotpoqa-train\" dataset has only 1 question, it may not satisfy large-scale evaluation.\n\u003e Multiple evaluation could be a relatively better approach.\n\u003e \n\n#### `Docker`\n\nTo build the docker image, run\n```bash\n docker build -t my-cag-app .\n ```\n\n and to run the container, run this for GPU users\n\n```bash\ndocker run --gpus all -it --rm my-cag-app\n```\nOR\n```bash\ndocker run -it --rm my-cag-app\n```\nfor CPU users.\n\nif the .env file details were empty while building you will get error similar to this below\n\n```bash\nTraceback (most recent call last):\n  File \"/app/./kvcache.py\", line 35, in \u003cmodule\u003e\n    env = validate_env_variables()\n          ^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/app/./kvcache.py\", line 31, in validate_env_variables\n    raise ValueError(f\"Missing required environment variable: {key}\")\nValueError: Missing required environment variable: HF_TOKEN\n``` \n\nso ensure you populate the `.env` file before building the docker image \n\nNote that the he `CMD` directive in the Dockerfile runs the `kvcache.py ` script by default. You can override this in the docker run command if you'd like to execute other scripts like rag.py. For example:\n\n```bash\ndocker run --gpus all -it --rm my-cag-app python ./rag.py --index \"bm25\" --dataset \"hotpotqa-train\" --similarity bertscore --maxKnowledge 80 --maxParagraph 100 --maxQuestion 80 --topk 3 --modelname \"meta-llama/Llama-3.1-8B-Instruct\" --randomSeed 0 --output \"./rag_results.txt\"\n```\n\n## Citation\n```\n@misc{chan2024dontragcacheaugmentedgeneration,\n      title={Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks}, \n      author={Brian J Chan and Chao-Ting Chen and Jui-Hung Cheng and Hen-Hsen Huang},\n      year={2024},\n      eprint={2412.15605},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2412.15605}, \n}\n```\n\n## Acknowledgments\nThis work was partially supported by National Science and Technology Council (NSTC), Taiwan, under the grant 112-2221-E-001-016-MY3, by Academia Sinica, under the grant 236d-1120205, and by National Center for High-performance Computing (NCHC), National Applied Research Laboratories (NARLabs), and NSTC under the project \"Taiwan's 113th year endeavoring in the promotion of a trustworthy generative AI large language model and the cultivation of literacy capabilities (Trustworthy AI Dialog Engine, TAIDE)\".\nWe sincerely thank [Discover AI](https://www.youtube.com/watch?v=NaEf_uiFX6o) and the many individuals who have introduced, shared, and discussed our work, contributing to its broader visibility and impact.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhhhuang%2FCAG","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhhhuang%2FCAG","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhhhuang%2FCAG/lists"}