Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/buddylim/qwen2-in-a-lambda

Deploying Qwen2 (or any other GGUF models) into AWS Lambda
https://github.com/buddylim/qwen2-in-a-lambda

aws genai genai-poc generative-ai gguf lambda llamacpp llm python qwen2 serverless

Last synced: 3 days ago
JSON representation

Deploying Qwen2 (or any other GGUF models) into AWS Lambda

Awesome Lists containing this project

README

        

# Qwen in a Lambda

Updated at 11/09/2024

(Marking the date because of how fast LLM APIs in Python move and may introduce breaking changes by the time anyone else reads this!)

## Intro:

- This is a minor research on how we can put Qwen GGUF model files into AWS Lambda using Docker and SAM CLI

- Adapted from https://makit.net/blog/llm-in-a-lambda-function/
- As of September '24, some required OS packages are not included in the above guide and subsequently in the Dockerfile as potentially the llama-cpp-python @ 0.2.90 does not include the required OS packages (?)
- Who knows if there's anything new and breaking that will appear in the future :shrugs:

## Motivation:

- I wanted to find out if I can reduce my AWS spending by only leveraging on the capabilities of Lambda and not Lambda + Bedrock as both services would incur more costs in the long run.

- The idea was to fit a small language model which wouldn't be as resource intensive relatively speaking and to, hopefully, receive subsecond to second latency on a 128 - 256 mb memory configuration

- I wanted to use also GGUF models to use different levels of quantization to find out which is the best performance / file size to be loaded into memory
- My experimentation lead to me using Qwen2 1.5b Q5_K_M as it had the best "performance" and "latency" locally to receive prompt and spit out JSON structure using llama-cpp

## Prerequisites:

- Docker
- AWS SAM CLI
- AWS CLI
- Python 3.11
- ECR permissions
- Lambda permissions
- Download `qwen2-1_5b-instruct-q5_k_m.gguf` into `qwen_fuction/function/`
- Or download any other .gguf models that you'd like and change your model path in `app.y / LOCAL_PATH`

## Setup Guide:

- Install pip packages under `qwen_function/function/requirements.txt` (preferably in a venv/conda env)
- Run `sam build` / `sam validate`
- Run `sam local start-api` to test locally
- Run `curl --header "Content-Type: application/json" \
--request POST \
--data '{"prompt":"hello"}' \
http://localhost:3000/generate` to prompt the LLM
- Or use your preferred API clients
- Run `sam deploy --guided` to deploy to AWS
- This will deploy a cloudformation stack consisting of an API gateway and a Lambda function

## Metrics

- Localhost - Macbook M3 Pro 32 GB

![alt text](/images/image.png)

- AWS

- Initial config - 128mb, 30s timeout
- Lambda timed out! Cold start was timing out the lambda
- Adjusted config #1 - 512mb, 30s timeout

- Lambda timed out! Cold start was timing out the lambda

- Adjusted config #2 - 512mb, 30s timeout
- Lambda timed out! Cold start was timing out the lambda

![alt text](/images/image-1.png)

- Adjusted config #3 - 3008mb, 30s timeout - cold start

![alt text](/images/image-2.png)

- Adjusted config #3 - 3008mb, 30s timeout - warm start

![alt text](/images/image-3.png)

## Observation

- Referring back to the pricing structure of Lambda,

- [Pricing]()
- 1536 MB / 1.465 s / $0.024638 over 1000 Lambda invocations
- Qwen2 1.5b had me cranking up the memory to 3008mb just to not time out and receive 4 - 11 seconds latency response!
- Claude 3 Haiku / $0.00025 / $0.00125 over 1000 input tokens & 1000 output tokens / Asia - Tokyo

- It may be cheaper to just use a hosted LLM using AWS Bedrock, etc.. on the cloud as the pricing structure for Lambda w/ Qwen does not look more competitive compared to Claude 3 Haiku

- Furthermore, the API gateway timeout is not easily configurable beyond the 30s timeout, depending on your usecase, this may not be very ideal

- Results via local is dependant on your machine specs!! and may heavily skew your perception, expectation vs reality

- Depending on your use case also, the latency per lambda invocation and responses might incur poor user experiences

## Conclusion

All in all, I think this was a fun little experiment even though it didn't quite pan out to the budget & latency requirement via Qwen 1.5b for my side project. Thanks to [@makit](https://github.com/makit) again for the guide!