https://github.com/J-sephB-lt-n/hosting-7B-llm-on-google-cloud

Speed benchmarking a 7B LLM on different gcloud VMs (using llama.cpp)
https://github.com/J-sephB-lt-n/hosting-7B-llm-on-google-cloud

agent benchmark benchmarking compute-engine google-cloud google-cloud-platform gpu internlm internlm-7b internlm-chat-7b llamacpp llm llm-agent llms nlp python speedtest

Last synced: 23 days ago
JSON representation

Speed benchmarking a 7B LLM on different gcloud VMs (using llama.cpp)

Host: GitHub
URL: https://github.com/J-sephB-lt-n/hosting-7B-llm-on-google-cloud
Owner: J-sephB-lt-n
License: gpl-3.0
Created: 2024-07-22T14:10:34.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-07-23T11:39:58.000Z (about 1 year ago)
Last Synced: 2025-01-01T22:41:43.465Z (9 months ago)
Topics: agent, benchmark, benchmarking, compute-engine, google-cloud, google-cloud-platform, gpu, internlm, internlm-7b, internlm-chat-7b, llamacpp, llm, llm-agent, llms, nlp, python, speedtest
Language: Python
Homepage:
Size: 21.5 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome_ai_agents - Hosting-7B-Llm-On-Google-Cloud - Speed benchmarking a 7B LLM on different gcloud VMs (using llama.cpp) (Building / Benchmarks)
awesome_ai_agents - Hosting-7B-Llm-On-Google-Cloud - Speed benchmarking a 7B LLM on different gcloud VMs (using llama.cpp) (Building / Benchmarks)

README

          # hosting-7B-llm-on-google-cloud

In this repo, I'm seeing how fast the Large Language Model InternLM2.5-7B-Chat (q5_k_m quantized) runs on different Google Cloud Compute Engine Virtual Machines.

On each machine, I run the same 5 queries, which all involve answering questions based on ~1000 words of text taken from a website - you can see the benchmarking code here: [./query_speed_benchmark.py](./query_speed_benchmark.py)

| machine type  | GPU(s)      | specs                        | boot disk size | GCP Image                            | cost per hour | mean inference time (single query) | all inference times (single queries)         |

| ------------- | ----------- | ---------------------------- | -------------- | ------------------------------------ | ------------- | ---------------------------------- | -------------------------------------------- |

| e2-himem-2    | 0           | 2 vCPU, 1 core, 16GB memory  | 10 Gb          |                                      | $0.12         | 15 minutes                         | 878 (I got bored and stopped after this one) |

| e2-himem-4    | 0           | 4 vCPU, 2 core, 32Gb memory  | 10 Gb          |                                      | $0.23         | 7 minutes                          | 418, 440, 422, 419, 435                      |

| e2-himem-8    | 0           | 8 vCPU, 4 core, 64 GB memory | 10 Gb          |                                      | $0.47         | 3.5 minutes                        | 205, 215, 209, 204, 215                      |

| n1-standard-4 | 1 Nvidia T4 | 4 vCPU, 2 core, 15 GB memory | 50 Gb          | Deep Learning VM with CUDA 11.8 M123 | $0.67         | 20 seconds                         | 7, 30, 13, 12, 41                            |

Code used for VM setup:

[./setup_vm.sh](./setup_vm.sh)

Run the benchmark on a virtual machine:

```bash

# launch a local model server #

llama.cpp/llama-server -m './llm_models/model.gguf' --port 6969 --ctx-size 2000 > /dev/null 2>&1 &

# run the benchmark #

python3 query_speed_benchmark.py

# stop the local model server #

pkill llama-server

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/J-sephB-lt-n/hosting-7B-llm-on-google-cloud

Awesome Lists containing this project

README