{"id":13753704,"url":"https://github.com/kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference","last_synced_at":"2025-05-09T21:35:35.926Z","repository":{"id":180496162,"uuid":"662891622","full_name":"kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference","owner":"kennethleungty","description":"Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q\u0026A","archived":false,"fork":false,"pushed_at":"2023-11-06T14:03:21.000Z","size":4738,"stargazers_count":960,"open_issues_count":16,"forks_count":213,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-04-30T03:36:14.587Z","etag":null,"topics":["c-transformers","chatgpt","cpu","cpu-inference","deep-learning","document-qa","faiss","langchain","language-models","large-language-models","llama","llama-2","llm","machine-learning","natural-language-processing","nlp","open-source-llm","python","sentence-transformers","transformers"],"latest_commit_sha":null,"homepage":"https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kennethleungty.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-07-06T05:42:43.000Z","updated_at":"2025-04-15T07:12:14.000Z","dependencies_parsed_at":"2023-11-06T23:09:45.032Z","dependency_job_id":null,"html_url":"https://github.com/kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference","commit_stats":null,"previous_names":["kennethleungty/open-source-llm-cpu-inference","kennethleungty/llama-2-open-source-llm-cpu-inference"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kennethleungty%2FLlama-2-Open-Source-LLM-CPU-Inference","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kennethleungty%2FLlama-2-Open-Source-LLM-CPU-Inference/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kennethleungty%2FLlama-2-Open-Source-LLM-CPU-Inference/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kennethleungty%2FLlama-2-Open-Source-LLM-CPU-Inference/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kennethleungty","download_url":"https://codeload.github.com/kennethleungty/Llama-2-Open-Source-LLM-CPU-Inference/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253328999,"owners_count":21891562,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["c-transformers","chatgpt","cpu","cpu-inference","deep-learning","document-qa","faiss","langchain","language-models","large-language-models","llama","llama-2","llm","machine-learning","natural-language-processing","nlp","open-source-llm","python","sentence-transformers","transformers"],"created_at":"2024-08-03T09:01:27.859Z","updated_at":"2025-05-09T21:35:34.869Z","avatar_url":"https://github.com/kennethleungty.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","SDK, Libraries, Frameworks","GitHub projects","Python"],"sub_categories":["大语言对话模型及数据","Python"],"readme":"# Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q\u0026A\n\n### Clearly explained guide for running quantized open-source LLM applications on CPUs using LLama 2, C Transformers, GGML, and LangChain\n\n**Step-by-step guide on TowardsDataScience**: https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8\n___\n## Context\n- Third-party commercial large language model (LLM) providers like OpenAI's GPT4 have democratized LLM use via simple API calls. \n- However, there are instances where teams would require self-managed or private model deployment for reasons like data privacy and residency rules.\n- The proliferation of open-source LLMs has opened up a vast range of options for us, thus reducing our reliance on these third-party providers. \n- When we host open-source LLMs locally on-premise or in the cloud, the dedicated compute capacity becomes a key issue. While GPU instances may seem the obvious choice, the costs can easily skyrocket beyond budget.\n- In this project, we will discover how to run quantized versions of open-source LLMs on local CPU inference for document question-and-answer (Q\u0026A).\n\u003cbr\u003e\u003cbr\u003e\n![Alt text](assets/diagram_flow.png)\n___\n\n## Quickstart\n- Ensure you have downloaded the GGML binary file from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML and placed it into the `models/` folder\n- To start parsing user queries into the application, launch the terminal from the project directory and run the following command:\n`poetry run python main.py \"\u003cuser query\u003e\"`\n- For example, `poetry run python main.py \"What is the minimum guarantee payable by Adidas?\"`\n- Note: Omit the prepended `poetry run` if you are NOT using Poetry\n\u003cbr\u003e\u003cbr\u003e\n![Alt text](assets/qa_output.png)\n___\n## Tools\n- **LangChain**: Framework for developing applications powered by language models\n- **C Transformers**: Python bindings for the Transformer models implemented in C/C++ using GGML library\n- **FAISS**: Open-source library for efficient similarity search and clustering of dense vectors.\n- **Sentence-Transformers (all-MiniLM-L6-v2)**: Open-source pre-trained transformer model for embedding text to a 384-dimensional dense vector space for tasks like clustering or semantic search.\n- **Llama-2-7B-Chat**: Open-source fine-tuned Llama 2 model designed for chat dialogue. Leverages publicly available instruction datasets and over 1 million human annotations. \n- **Poetry**: Tool for dependency management and Python packaging\n\n___\n## Files and Content\n- `/assets`: Images relevant to the project\n- `/config`: Configuration files for LLM application\n- `/data`: Dataset used for this project (i.e., Manchester United FC 2022 Annual Report - 177-page PDF document)\n- `/models`: Binary file of GGML quantized LLM model (i.e., Llama-2-7B-Chat) \n- `/src`: Python codes of key components of LLM application, namely `llm.py`, `utils.py`, and `prompts.py`\n- `/vectorstore`: FAISS vector store for documents\n- `db_build.py`: Python script to ingest dataset and generate FAISS vector store\n- `main.py`: Main Python script to launch the application and to pass user query via command line\n- `pyproject.toml`: TOML file to specify which versions of the dependencies used (Poetry)\n- `requirements.txt`: List of Python dependencies (and version)\n___\n\n## References\n- https://github.com/marella/ctransformers\n- https://huggingface.co/TheBloke\n- https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML\n- https://python.langchain.com/en/latest/integrations/ctransformers.html\n- https://python.langchain.com/en/latest/modules/models/llms/integrations/ctransformers.html\n- https://python.langchain.com/docs/ecosystem/integrations/ctransformers\n- https://ggml.ai\n- https://github.com/rustformers/llm/blob/main/crates/ggml/README.md\n- https://www.mdpi.com/2189676\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkennethleungty%2FLlama-2-Open-Source-LLM-CPU-Inference","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkennethleungty%2FLlama-2-Open-Source-LLM-CPU-Inference","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkennethleungty%2FLlama-2-Open-Source-LLM-CPU-Inference/lists"}