Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/abhisheknair10/llama3.cu
Lightweight Llama 3 8B Inference Engine in CUDA C
https://github.com/abhisheknair10/llama3.cu
cuda llama llm-inference
Last synced: 9 days ago
JSON representation
Lightweight Llama 3 8B Inference Engine in CUDA C
- Host: GitHub
- URL: https://github.com/abhisheknair10/llama3.cu
- Owner: abhisheknair10
- License: mit
- Created: 2024-09-04T21:48:54.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-01-18T23:18:26.000Z (12 days ago)
- Last Synced: 2025-01-19T00:20:22.652Z (12 days ago)
- Topics: cuda, llama, llm-inference
- Language: Cuda
- Homepage:
- Size: 1.64 MB
- Stars: 42
- Watchers: 5
- Forks: 7
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Llama3.cu - A Llama 3 (8B) CUDA Inference Engine
##
Llama3.cu is a CUDA native implementation of the LLaMA3 architecture for causal language modeling. Core principles of the transformer architecture from the papers [Attention is All You Need](https://arxiv.org/abs/1706.03762) and [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) are implemented using custom CUDA kernel definitions, enabling scalable parallel processing on Nvidia GPUs.
The models are expected to be downloaded off of HuggingFace. They are stored as BF16 parameter weights in a .safetensor file, which during load time to the CUDA device, is converted to FP16 via a FP32 proxy. Hence, a CUDA device with a minimum of 24GB VRAM must be used.
## Setup and Usage
### Minimum Requirements:
```bash
- 24GB+ VRAM CUDA Device
- HuggingFace account
- Operating System: UNIX or WSL
- CUDA Toolkit (7.5+)
```### Run Inference
1. Run the **[setup-docker.sh](https://github.com/abhisheknair10/Llama3.cu/blob/main/setup-docker.sh)** file to setup your Virtual/Physical Machine to run Docker with access to Nvidia GPUs. Once the shell script has finished executing, make sure to log out of the terminal, and then log back in to run **[run-docker.sh](https://github.com/abhisheknair10/Llama3.cu/blob/main/run-docker.sh)**.
```bash
# Setup Docker
chmod +x setup-docker.sh
./setup-docker.sh
``````bash
# Restart terminal and run
chmod +x run-docker.sh
./run-docker.sh
```2. For this inference engine to work, a SafeTensor formatted file(s) of the Llama3-8b model needs to be stored in the ./model_weights/ folder. Head to the [HuggingFace - meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) repo to get access to the model. Additionally, [Generate a Hugging Face Token](https://huggingface.co/settings/tokens) so that the next step can successfully download the weights files.
3. Once the Docker container has started up, run the following command to store the Hugging Face token as an environment variable, replacing **** with the token you generated.
```bash
export HF_TOKEN=
```4. Next, run the following command to download the model parameters into the target directory.
```bash
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir ./model_weights/ --token $HF_TOKEN
```5. Run Make 🎉.
```bash
make run
```
## Acknowledgments
Non exhaustive list of sources:
1. [**Attention Is All You Need**](https://arxiv.org/abs/1706.03762)
1. [**LLaMA: Open and Efficient Foundation Language Models**](https://arxiv.org/abs/2302.13971)
1. [**RoPE: Rotary Position Embedding for Robust, Efficient Transformer Models**](https://arxiv.org/abs/2104.09864)
1. This project makes use of the [cJSON library by DaveGamble](https://github.com/DaveGamble/cJSON), which is licensed under the MIT License.