{"id":15055058,"url":"https://github.com/buddylim/qwen2-in-a-lambda","last_synced_at":"2026-02-28T17:31:35.472Z","repository":{"id":256577348,"uuid":"855635131","full_name":"BuddyLim/qwen2-in-a-lambda","owner":"BuddyLim","description":"Deploying Qwen2 (or any other  GGUF models) into AWS Lambda","archived":false,"fork":false,"pushed_at":"2024-09-11T13:54:35.000Z","size":126,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-12-21T06:50:18.413Z","etag":null,"topics":["aws","genai","genai-poc","generative-ai","gguf","lambda","llamacpp","llm","python","qwen2","serverless"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BuddyLim.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-11T07:47:28.000Z","updated_at":"2024-09-11T16:43:50.000Z","dependencies_parsed_at":"2024-09-11T21:42:23.152Z","dependency_job_id":"2c0a39e3-d3ec-4d4d-9618-08ab99147c93","html_url":"https://github.com/BuddyLim/qwen2-in-a-lambda","commit_stats":null,"previous_names":["buddylim/qwen2-in-a-lambda"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BuddyLim%2Fqwen2-in-a-lambda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BuddyLim%2Fqwen2-in-a-lambda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BuddyLim%2Fqwen2-in-a-lambda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BuddyLim%2Fqwen2-in-a-lambda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BuddyLim","download_url":"https://codeload.github.com/BuddyLim/qwen2-in-a-lambda/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230852588,"owners_count":18290081,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","genai","genai-poc","generative-ai","gguf","lambda","llamacpp","llm","python","qwen2","serverless"],"created_at":"2024-09-24T21:39:22.054Z","updated_at":"2025-10-06T07:31:08.350Z","avatar_url":"https://github.com/BuddyLim.png","language":"Python","readme":"# Qwen in a Lambda\n\nUpdated at 11/09/2024\n\n(Marking the date because of how fast LLM APIs in Python move and may introduce breaking changes by the time anyone else reads this!)\n\n## Intro:\n\n- This is a minor research on how we can put Qwen GGUF model files into AWS Lambda using Docker and SAM CLI\n\n- Adapted from https://makit.net/blog/llm-in-a-lambda-function/\n  - As of September '24, some required OS packages are not included in the above guide and subsequently in the Dockerfile as potentially the llama-cpp-python @ 0.2.90 does not include the required OS packages (?)\n  - Who knows if there's anything new and breaking that will appear in the future :shrugs:\n\n## Motivation:\n\n- I wanted to find out if I can reduce my AWS spending by only leveraging on the capabilities of Lambda and not Lambda + Bedrock as both services would incur more costs in the long run.\n\n- The idea was to fit a small language model which wouldn't be as resource intensive relatively speaking and to, hopefully, receive subsecond to second latency on a 128 - 256 mb memory configuration\n\n- I wanted to use also GGUF models to use different levels of quantization to find out which is the best performance / file size to be loaded into memory\n  - My experimentation lead to me using Qwen2 1.5b Q5_K_M as it had the best \"performance\" and \"latency\" locally to receive prompt and spit out JSON structure using llama-cpp\n\n## Prerequisites:\n\n- Docker\n- AWS SAM CLI\n- AWS CLI\n- Python 3.11\n- ECR permissions\n- Lambda permissions\n- Download `qwen2-1_5b-instruct-q5_k_m.gguf` into `qwen_fuction/function/`\n  - Or download any other .gguf models that you'd like and change your model path in `app.y / LOCAL_PATH`\n\n## Setup Guide:\n\n- Install pip packages under `qwen_function/function/requirements.txt` (preferably in a venv/conda env)\n- Run `sam build` / `sam validate`\n- Run `sam local start-api` to test locally\n- Run `curl --header \"Content-Type: application/json\" \\\n--request POST \\\n--data '{\"prompt\":\"hello\"}' \\\nhttp://localhost:3000/generate` to prompt the LLM\n  - Or use your preferred API clients\n- Run `sam deploy --guided` to deploy to AWS\n- This will deploy a cloudformation stack consisting of an API gateway and a Lambda function\n\n## Metrics\n\n- Localhost - Macbook M3 Pro 32 GB\n\n![alt text](/images/image.png)\n\n- AWS\n\n  - Initial config - 128mb, 30s timeout\n    - Lambda timed out! Cold start was timing out the lambda\n  - Adjusted config #1 - 512mb, 30s timeout\n\n    - Lambda timed out! Cold start was timing out the lambda\n\n  - Adjusted config #2 - 512mb, 30s timeout\n    - Lambda timed out! Cold start was timing out the lambda\n\n![alt text](/images/image-1.png)\n\n- Adjusted config #3 - 3008mb, 30s timeout - cold start\n\n![alt text](/images/image-2.png)\n\n- Adjusted config #3 - 3008mb, 30s timeout - warm start\n\n![alt text](/images/image-3.png)\n\n## Observation\n\n- Referring back to the pricing structure of Lambda,\n\n  - [Pricing](\u003chttps://docs.aws.amazon.com/lambda/latest/operatorguide/computing-power.html#:~:text=Since%20the%20Lambda%20service%20charges,and%20duration%20(in%20seconds)\u003e)\n  - 1536 MB / 1.465 s / $0.024638 over 1000 Lambda invocations\n    - Qwen2 1.5b had me cranking up the memory to 3008mb just to not time out and receive 4 - 11 seconds latency response!\n  - Claude 3 Haiku / $0.00025 / $0.00125 over 1000 input tokens \u0026 1000 output tokens / Asia - Tokyo\n\n- It may be cheaper to just use a hosted LLM using AWS Bedrock, etc.. on the cloud as the pricing structure for Lambda w/ Qwen does not look more competitive compared to Claude 3 Haiku\n\n- Furthermore, the API gateway timeout is not easily configurable beyond the 30s timeout, depending on your usecase, this may not be very ideal\n\n- Results via local is dependant on your machine specs!! and may heavily skew your perception, expectation vs reality\n\n- Depending on your use case also, the latency per lambda invocation and responses might incur poor user experiences\n\n## Conclusion\n\nAll in all, I think this was a fun little experiment even though it didn't quite pan out to the budget \u0026 latency requirement via Qwen 1.5b for my side project. Thanks to [@makit](https://github.com/makit) again for the guide!\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbuddylim%2Fqwen2-in-a-lambda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbuddylim%2Fqwen2-in-a-lambda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbuddylim%2Fqwen2-in-a-lambda/lists"}