https://github.com/kennethleungty/llama-2-open-source-llm-cpu-inference

Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A
https://github.com/kennethleungty/llama-2-open-source-llm-cpu-inference

c-transformers chatgpt cpu cpu-inference deep-learning document-qa faiss langchain language-models large-language-models llama llama-2 llm machine-learning natural-language-processing nlp open-source-llm python sentence-transformers transformers

Last synced: 5 months ago
JSON representation

Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A

Host: GitHub
URL: https://github.com/kennethleungty/llama-2-open-source-llm-cpu-inference
Owner: kennethleungty
License: mit
Created: 2023-07-06T05:42:43.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-11-06T14:03:21.000Z (almost 2 years ago)
Last Synced: 2025-04-09T08:04:59.955Z (6 months ago)
Topics: c-transformers, chatgpt, cpu, cpu-inference, deep-learning, document-qa, faiss, langchain, language-models, large-language-models, llama, llama-2, llm, machine-learning, natural-language-processing, nlp, open-source-llm, python, sentence-transformers, transformers
Language: Python
Homepage: https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8
Size: 4.52 MB
Stars: 959
Watchers: 14
Forks: 214
Open Issues: 16
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A

### Clearly explained guide for running quantized open-source LLM applications on CPUs using LLama 2, C Transformers, GGML, and LangChain

**Step-by-step guide on TowardsDataScience**: https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8
___
## Context
- Third-party commercial large language model (LLM) providers like OpenAI's GPT4 have democratized LLM use via simple API calls.
- However, there are instances where teams would require self-managed or private model deployment for reasons like data privacy and residency rules.
- The proliferation of open-source LLMs has opened up a vast range of options for us, thus reducing our reliance on these third-party providers.
- When we host open-source LLMs locally on-premise or in the cloud, the dedicated compute capacity becomes a key issue. While GPU instances may seem the obvious choice, the costs can easily skyrocket beyond budget.
- In this project, we will discover how to run quantized versions of open-source LLMs on local CPU inference for document question-and-answer (Q&A).

![Alt text](assets/diagram_flow.png)
___

## Quickstart
- Ensure you have downloaded the GGML binary file from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML and placed it into the `models/` folder
- To start parsing user queries into the application, launch the terminal from the project directory and run the following command:
`poetry run python main.py ""`
- For example, `poetry run python main.py "What is the minimum guarantee payable by Adidas?"`
- Note: Omit the prepended `poetry run` if you are NOT using Poetry

![Alt text](assets/qa_output.png)
___
## Tools
- **LangChain**: Framework for developing applications powered by language models
- **C Transformers**: Python bindings for the Transformer models implemented in C/C++ using GGML library
- **FAISS**: Open-source library for efficient similarity search and clustering of dense vectors.
- **Sentence-Transformers (all-MiniLM-L6-v2)**: Open-source pre-trained transformer model for embedding text to a 384-dimensional dense vector space for tasks like clustering or semantic search.
- **Llama-2-7B-Chat**: Open-source fine-tuned Llama 2 model designed for chat dialogue. Leverages publicly available instruction datasets and over 1 million human annotations.
- **Poetry**: Tool for dependency management and Python packaging

___
## Files and Content
- `/assets`: Images relevant to the project
- `/config`: Configuration files for LLM application
- `/data`: Dataset used for this project (i.e., Manchester United FC 2022 Annual Report - 177-page PDF document)
- `/models`: Binary file of GGML quantized LLM model (i.e., Llama-2-7B-Chat)
- `/src`: Python codes of key components of LLM application, namely `llm.py`, `utils.py`, and `prompts.py`
- `/vectorstore`: FAISS vector store for documents
- `db_build.py`: Python script to ingest dataset and generate FAISS vector store
- `main.py`: Main Python script to launch the application and to pass user query via command line
- `pyproject.toml`: TOML file to specify which versions of the dependencies used (Poetry)
- `requirements.txt`: List of Python dependencies (and version)
___

## References
- https://github.com/marella/ctransformers
- https://huggingface.co/TheBloke
- https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML
- https://python.langchain.com/en/latest/integrations/ctransformers.html
- https://python.langchain.com/en/latest/modules/models/llms/integrations/ctransformers.html
- https://python.langchain.com/docs/ecosystem/integrations/ctransformers
- https://ggml.ai
- https://github.com/rustformers/llm/blob/main/crates/ggml/README.md
- https://www.mdpi.com/2189676

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kennethleungty/llama-2-open-source-llm-cpu-inference

Awesome Lists containing this project

README