{"id":10337645,"url":"https://github.com/zjhellofss/KuiperLLama","last_synced_at":"2025-09-08T03:32:10.220Z","repository":{"id":256478374,"uuid":"791911417","full_name":"zjhellofss/KuiperLLama","owner":"zjhellofss","description":"校招、秋招、春招、实习好项目，带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。","archived":false,"fork":false,"pushed_at":"2025-04-02T14:32:26.000Z","size":2384,"stargazers_count":346,"open_issues_count":3,"forks_count":88,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-05-23T06:09:34.405Z","etag":null,"topics":["cpp","cuda","inference-engine","llama2","llama3","llm","llm-inference","qwen","qwen2"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zjhellofss.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-04-25T15:57:04.000Z","updated_at":"2025-05-23T00:44:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"2081b7bf-f6d9-4f68-b556-720c890ffb9d","html_url":"https://github.com/zjhellofss/KuiperLLama","commit_stats":{"total_commits":164,"total_committers":6,"mean_commits":"27.333333333333332","dds":0.07926829268292679,"last_synced_commit":"4e754feb8ce86af9c3b2ed180b0dcedfca4881b1"},"previous_names":["zjhellofss/kuiperllama"],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/zjhellofss/KuiperLLama","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjhellofss%2FKuiperLLama","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjhellofss%2FKuiperLLama/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjhellofss%2FKuiperLLama/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjhellofss%2FKuiperLLama/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zjhellofss","download_url":"https://codeload.github.com/zjhellofss/KuiperLLama/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjhellofss%2FKuiperLLama/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274128291,"owners_count":25226986,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-08T02:00:09.813Z","response_time":121,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","cuda","inference-engine","llama2","llama3","llm","llm-inference","qwen","qwen2"],"created_at":"2024-05-27T09:04:58.183Z","updated_at":"2025-09-08T03:32:10.209Z","avatar_url":"https://github.com/zjhellofss.png","language":"C++","readme":"# KuiperLLama 动手自制大模型推理框架，支持LLama2/3和Qwen2.5\n\u003e News：新课发布，《动手自制大模型推理框架》，全手写cuda算子，课程框架支持LLama2和3.x以及Qwen2.5模型\n\nHi，各位朋友们好！我是 KuiperInfer 的作者。KuiperInfer 作为一门开源课程，迄今已经在 GitHub 上已斩获 2.5k star。\n如今在原课程的基础上，**我们全新推出了《动手自制大模型推理框架》， 新课程支持Llama系列大模型（包括最新的LLama3.2）以及Qwen2.5系列大模型，同时支持 Cuda 加速和 Int8 量化**，自推出以来便广受好评。\n\n## 《动手自制大模型推理框架》课程目录：\nhttps://tvle9mq8jh.feishu.cn/docx/AGb0dpqwfohQ9oxx4QycqbCjnJh\n## 《动手自制大模型推理框架》课程优势\n\n1. 采用最新的C++ 20标准去写代码，统一、美观的代码风格，良好的错误处理；\n2. 优秀的项目管理形式，我们采用CMake+Git的方式管理项目，接轨大厂；\n3. 授人以渔，教大家怎么设计一个现代C++项目，同时教大家怎么用单元测试和Benchmark去测试验证自己的项目； \n4. CPU算子和CUDA双后端实现，对时新的大模型（LLama3和Qwen系列）有非常好的支持。\n\n\n**如果你对大模型推理感兴趣，想要深入了解并掌握相关技术，想在校招、秋招面试当中脱颖而出，那么这门《动手自制大模型推理框架》课程绝对不容错过。快来加入我们，一起开启学习之旅吧！\n    感兴趣的同学欢迎扫一扫课程下方二维码或者添加微信 lyrry1997 参加课程**\n\n\u003cimg src=\"imgs/me.jpg\"  /\u003e\n\n\n\n## 《动手自制大模型推理框架》课程项目运行效果\n\u003e LLama1.1b fp32模型，视频无加速，运行平台为Nvidia 3060 laptop，速度为60.34 token/s\n\n![](./imgs/do.gif)\n\n\n\n## 第三方依赖\n\u003e 借助企业级开发库，更快地搭建出大模型推理框架\n1. google glog https://github.com/google/glog\n2. google gtest https://github.com/google/googletest\n3. sentencepiece https://github.com/google/sentencepiece\n4. armadillo + openblas https://arma.sourceforge.net/download.html\n5. Cuda Toolkit\n\n\n## 模型下载地址\n1. LLama2 https://pan.baidu.com/s/1PF5KqvIvNFR8yDIY1HmTYA?pwd=ma8r 或 https://huggingface.co/fushenshen/lession_model/tree/main\n\n2. Tiny LLama \n- TinyLLama模型 https://huggingface.co/karpathy/tinyllamas/tree/main\n- TinyLLama分词器 https://huggingface.co/yahma/llama-7b-hf/blob/main/tokenizer.model\n\n3. Qwen2.5/LLama\n   \n   请参考本项目配套课程，课程参加方式请看本文开头。\n\n\n## 模型导出\n```shell\npython export.py llama2_7b.bin --meta-llama path/to/llama/model/7B\n# 使用--hf标签从hugging face中加载模型， 指定--version3可以导出量化模型\n# 其他使用方法请看export.py中的命令行参数实例\n```\n\n\n## 编译方法\n```shell\n  mkdir build \n  cd build\n  # 需要安装上述的第三方依赖\n  cmake ..\n  # 或者开启 USE_CPM 选项，自动下载第三方依赖\n  cmake -DUSE_CPM=ON ..\n  make -j16\n```\n\n## 生成文本的方法\n```shell\n./llama_infer llama2_7b.bin tokenizer.model\n\n```\n\n# LLama3.2 推理\n\n- 以 meta-llama/Llama-3.2-1B 为例，huggingface 上下载模型：\n```shell\nexport HF_ENDPOINT=https://hf-mirror.com\npip3 install huggingface-cli\nhuggingface-cli download --resume-download meta-llama/Llama-3.2-1B --local-dir meta-llama/Llama-3.2-1B --local-dir-use-symlinks False\n```\n- 导出模型：\n```shell\npython3 tools/export.py Llama-3.2-1B.bin --hf=meta-llama/Llama-3.2-1B\n```\n- 编译：\n```shell\nmkdir build \ncd build\n# 开启 USE_CPM 选项，自动下载第三方依赖，前提是需要网络畅通\ncmake -DUSE_CPM=ON -DLLAMA3_SUPPORT=ON .. \nmake -j16\n```\n- 运行：\n```shell\n./build/demo/llama_infer Llama-3.2-1B.bin meta-llama/Llama-3.2-1B/tokenizer.json\n# 和 huggingface 推理的结果进行对比\npython3 hf_infer/llama3_infer.py\n```\n\n# Qwen2.5 推理\n\n- 以 Qwen2.5-0.5B 为例，huggingface 上下载模型：\n```shell\nexport HF_ENDPOINT=https://hf-mirror.com\npip3 install huggingface-cli\nhuggingface-cli download --resume-download Qwen/Qwen2.5-0.5B --local-dir Qwen/Qwen2.5-0.5B --local-dir-use-symlinks False\n```\n- 导出模型：\n```shell\npython3 tools/export_qwen2.py Qwen2.5-0.5B.bin --hf=Qwen/Qwen2.5-0.5B\n```\n- 编译：\n```shell\nmkdir build \ncd build\n# 开启 USE_CPM 选项，自动下载第三方依赖，前提是需要网络畅通\ncmake -DUSE_CPM=ON -DQWEN2_SUPPORT=ON .. \nmake -j16\n```\n- 运行：\n```shell\n./build/demo/qwen_infer Qwen2.5-0.5B.bin Qwen/Qwen2.5-0.5B/tokenizer.json\n# 和 huggingface 推理的结果进行对比\npython3 hf_infer/qwen2_infer.py\n```\n\n## Qwen3推理\n和上面同理，我们先从huggingface仓库中将模型下载到本地。\n1. tools/export_qwen3/load.py中导出为pth，模型的输入`model_name`和输出地址`output_file`依次需要填写；\n2. 导出pth格式的模型后，再用同文件夹下的write_bin.py导出qwen.bin；\n3. 用CMake选项`QWEN3_SUPPORT`重新编译项目，其他步骤就都是一样的了。","funding_links":[],"categories":["Summary","Frameworks"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjhellofss%2FKuiperLLama","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzjhellofss%2FKuiperLLama","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjhellofss%2FKuiperLLama/lists"}