Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/buddylim/qwen2-in-a-lambda
Deploying Qwen2 (or any other GGUF models) into AWS Lambda
https://github.com/buddylim/qwen2-in-a-lambda
aws genai genai-poc generative-ai gguf lambda llamacpp llm python qwen2 serverless
Last synced: 3 days ago
JSON representation
Deploying Qwen2 (or any other GGUF models) into AWS Lambda
- Host: GitHub
- URL: https://github.com/buddylim/qwen2-in-a-lambda
- Owner: BuddyLim
- Created: 2024-09-11T07:47:28.000Z (17 days ago)
- Default Branch: main
- Last Pushed: 2024-09-11T13:54:35.000Z (17 days ago)
- Last Synced: 2024-09-24T21:38:46.660Z (3 days ago)
- Topics: aws, genai, genai-poc, generative-ai, gguf, lambda, llamacpp, llm, python, qwen2, serverless
- Language: Python
- Homepage:
- Size: 123 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Qwen in a Lambda
Updated at 11/09/2024
(Marking the date because of how fast LLM APIs in Python move and may introduce breaking changes by the time anyone else reads this!)
## Intro:
- This is a minor research on how we can put Qwen GGUF model files into AWS Lambda using Docker and SAM CLI
- Adapted from https://makit.net/blog/llm-in-a-lambda-function/
- As of September '24, some required OS packages are not included in the above guide and subsequently in the Dockerfile as potentially the llama-cpp-python @ 0.2.90 does not include the required OS packages (?)
- Who knows if there's anything new and breaking that will appear in the future :shrugs:## Motivation:
- I wanted to find out if I can reduce my AWS spending by only leveraging on the capabilities of Lambda and not Lambda + Bedrock as both services would incur more costs in the long run.
- The idea was to fit a small language model which wouldn't be as resource intensive relatively speaking and to, hopefully, receive subsecond to second latency on a 128 - 256 mb memory configuration
- I wanted to use also GGUF models to use different levels of quantization to find out which is the best performance / file size to be loaded into memory
- My experimentation lead to me using Qwen2 1.5b Q5_K_M as it had the best "performance" and "latency" locally to receive prompt and spit out JSON structure using llama-cpp## Prerequisites:
- Docker
- AWS SAM CLI
- AWS CLI
- Python 3.11
- ECR permissions
- Lambda permissions
- Download `qwen2-1_5b-instruct-q5_k_m.gguf` into `qwen_fuction/function/`
- Or download any other .gguf models that you'd like and change your model path in `app.y / LOCAL_PATH`## Setup Guide:
- Install pip packages under `qwen_function/function/requirements.txt` (preferably in a venv/conda env)
- Run `sam build` / `sam validate`
- Run `sam local start-api` to test locally
- Run `curl --header "Content-Type: application/json" \
--request POST \
--data '{"prompt":"hello"}' \
http://localhost:3000/generate` to prompt the LLM
- Or use your preferred API clients
- Run `sam deploy --guided` to deploy to AWS
- This will deploy a cloudformation stack consisting of an API gateway and a Lambda function## Metrics
- Localhost - Macbook M3 Pro 32 GB
![alt text](/images/image.png)
- AWS
- Initial config - 128mb, 30s timeout
- Lambda timed out! Cold start was timing out the lambda
- Adjusted config #1 - 512mb, 30s timeout- Lambda timed out! Cold start was timing out the lambda
- Adjusted config #2 - 512mb, 30s timeout
- Lambda timed out! Cold start was timing out the lambda![alt text](/images/image-1.png)
- Adjusted config #3 - 3008mb, 30s timeout - cold start
![alt text](/images/image-2.png)
- Adjusted config #3 - 3008mb, 30s timeout - warm start
![alt text](/images/image-3.png)
## Observation
- Referring back to the pricing structure of Lambda,
- [Pricing]()
- 1536 MB / 1.465 s / $0.024638 over 1000 Lambda invocations
- Qwen2 1.5b had me cranking up the memory to 3008mb just to not time out and receive 4 - 11 seconds latency response!
- Claude 3 Haiku / $0.00025 / $0.00125 over 1000 input tokens & 1000 output tokens / Asia - Tokyo- It may be cheaper to just use a hosted LLM using AWS Bedrock, etc.. on the cloud as the pricing structure for Lambda w/ Qwen does not look more competitive compared to Claude 3 Haiku
- Furthermore, the API gateway timeout is not easily configurable beyond the 30s timeout, depending on your usecase, this may not be very ideal
- Results via local is dependant on your machine specs!! and may heavily skew your perception, expectation vs reality
- Depending on your use case also, the latency per lambda invocation and responses might incur poor user experiences
## Conclusion
All in all, I think this was a fun little experiment even though it didn't quite pan out to the budget & latency requirement via Qwen 1.5b for my side project. Thanks to [@makit](https://github.com/makit) again for the guide!