{"id":29139400,"url":"https://github.com/trackit/eks-auto-mode-gpu","last_synced_at":"2025-06-30T15:04:36.994Z","repository":{"id":287960935,"uuid":"956040814","full_name":"trackit/eks-auto-mode-gpu","owner":"trackit","description":"Terraform and Helm configuration for hosting Fooocus and DeepSeek-R1 on Amazon EKS with Auto Mode","archived":false,"fork":false,"pushed_at":"2025-05-09T11:51:44.000Z","size":181,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-09T12:37:27.820Z","etag":null,"topics":["deepseek-r1","eks","fooocus","gpu","helm","kubernetes","terraform"],"latest_commit_sha":null,"homepage":"","language":"HCL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit-0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/trackit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-27T15:48:56.000Z","updated_at":"2025-05-09T11:51:48.000Z","dependencies_parsed_at":null,"dependency_job_id":"04679ca1-d2fb-43b7-8af7-7bfcc071be5f","html_url":"https://github.com/trackit/eks-auto-mode-gpu","commit_stats":null,"previous_names":["trackit/eks-auto-mode-gpu"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/trackit/eks-auto-mode-gpu","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trackit%2Feks-auto-mode-gpu","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trackit%2Feks-auto-mode-gpu/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trackit%2Feks-auto-mode-gpu/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trackit%2Feks-auto-mode-gpu/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/trackit","download_url":"https://codeload.github.com/trackit/eks-auto-mode-gpu/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trackit%2Feks-auto-mode-gpu/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262797621,"owners_count":23365889,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deepseek-r1","eks","fooocus","gpu","helm","kubernetes","terraform"],"created_at":"2025-06-30T15:04:35.686Z","updated_at":"2025-06-30T15:04:36.963Z","avatar_url":"https://github.com/trackit.png","language":"HCL","readme":"# Hosting Fooocus and DeepSeek-R1 on Amazon EKS\n\n*This repository is a fork of the [original project by AWS Samples](https://github.com/aws-samples/deepseek-using-vllm-on-eks). It has been modified to include additional features and improvements, including the integration of Fooocus.*\n\n## 📚 Table of Contents\n\n- [Hosting Fooocus and DeepSeek-R1 on Amazon EKS](#hosting-fooocus-and-deepseek-r1-on-amazon-eks)\n- [🚀 Deploying Fooocus on Amazon EKS Auto Mode](#deploying-fooocus-on-amazon-eks-auto-mode)\n- [📈 Scaling Fooocus with sticky sessions](#scaling-fooocus-with-sticky-sessions)\n- [🤖 Deploying DeepSeek-R1 on Amazon EKS Auto Mode](#deploying-deepseek-r1-on-amazon-eks-auto-mode)\n- [💬 Interact with the LLM](#interact-with-the-llm)\n- [🧠 Build a Chatbot UI for the Model](#build-a-chatbot-ui-for-the-model)\n- [📈 Scaling DeepSeek-R1 API on Amazon EKS Auto Mode](#scaling-deepseek-r1-api-on-amazon-eks-auto-mode)\n- [⚠️ Disclaimer](#disclaimer)\n\n## How to deploy the EKS Cluster\n\nCreate the `terraform.tfvars` file according to the `sample.tfvars` and replace the values by the values that you want\n\nSet all to `false`\n\n```hcl\ndeploy_deepseek = false\ndeploy_fooocus = false\nenable_neuron = false\nenable_gpu = false\n```\n\nthen execute `terraform plan -out=\"plan.out\"`\n\nand after `terraform apply \"plan.out\"`\n\nConfigure the kubectl to use the EKS cluster according to the region and name of the cluster\n\n`aws eks --region us-west-2 update-kubeconfig --name eks-automode-gpu`\n\n## Deploying Fooocus on Amazon EKS Auto Mode\n\nBuild the Fooocus container image and push it to Amazon ECR.\n\n`export ECR_REPO=$(terraform output ecr_repository_uri_fooocus | jq -r)`\n\n`docker build -t $ECR_REPO:latest ./fooocus-chart/application/`\n\n`aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REPO`\n\n`docker push $ECR_REPO:latest`\n\nThen update your `terraform.tfvars` file to set the `deploy_fooocus` variable to `true` as well as the `enable_gpu` variable to `true`.\nFooocus does not support Neuron based instances at the moment.\n\n```hcl\ndeploy_fooocus = true\nenable_gpu = true\n```\n\nThen execute `terraform plan -out=\"plan.out\"` and `terraform apply \"plan.out\"`\n\nAfter the deployment is finished, you can check the status of the pods in the `fooocus` namespace.\n\n```bash\nkubectl get pods -n fooocus\n```\nYou should see the `fooocus` pod running.\n\n### Interact with the Fooocus Web UI\nTo access the Fooocus web UI, you need to set up a port-forwarding session to the Fooocus service.\n\n```bash\n# Set up port forwarding to access the Fooocus web UI\nkubectl port-forward svc/fooocus-service -n fooocus 80:80\n```\n\nThen open your web browser and navigate to `http://localhost:7865`.\n\n## Scaling Fooocus with sticky sessions\n\nSet the `enable_autoscaling` to `true` in the `tfvars` file to enable autoscaling. Then plan and apply the changes.\n\n```hcl\nenable_autoscaling = true\n```\n\n``` bash\n# Plan and apply the changes\nterraform plan -out=\"plan.out\"\nterraform apply \"plan.out\"\n```\n\nCheck [this part of the readme](#scaling-deepseek-r1-api-on-amazon-eks-auto-mode) to know how it works.\nThe autoscaling will be done using the same components as the DeepSeek-R1 API.\n\nWe use ALB to route the traffic to the Fooocus service with sticky sessions enabled. Sticky sessions are important to maintain a stateful connexion with cookies, it allow the ALB to route the traffic to the same pod for a given user. So if you want to test different sessions use different browsers or incognito mode.\nFor our DeepSeek-R1 API we use the same ALB but without sticky sessions because the API used is stateless.\n\nTo access the Fooocus web UI, you need to get the URL of the load balancer.\n\n```bash\n# with terraform\nterraform output -raw fooocus_ingress_hostname\n# or with kubectl\nkubectl get ingress -n fooocus\n```\n\nThen open your web browser and navigate to the URL.\n\n## Deploying DeepSeek-R1 on Amazon EKS Auto Mode\n\nFor this tutorial, we’ll use the [***DeepSeek-R1-Distill-Llama-8B***](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) distilled model. \nWhile it requires fewer resources (like GPU) compared to the full [***DeepSeek-R1***](https://huggingface.co/deepseek-ai/DeepSeek-R1) model with 671B parameters, it provides a lighter, though less powerful, option compared to the full model. \n\nIf you'd prefer to deploy the full DeepSeek-R1 model, simply replace the distilled model in the vLLM configuration.\n\n###  PreReqs\n\n- [Check AWS Instance Quota](https://docs.aws.amazon.com/ec2/latest/instancetypes/ec2-instance-quotas.html)\n- [Install kubectl](https://kubernetes.io/docs/tasks/tools/)\n- [Install terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli)\n- [Install finch](https://runfinch.com/docs/getting-started/installation/) or [docker](https://docs.docker.com/get-started/get-docker/) \n\n### Create an Amazon  EKS Cluster w/ Auto Mode using Terraform\nWe'll use Terraform to easily provision the infrastructure, including a VPC, ECR repository, and an EKS cluster with Auto Mode enabled.\n\n``` bash\n# Clone the GitHub repo with the manifests\ngit clone https://github.com/aws-samples/deepseek-using-vllm-on-eks\ncd deepseek-using-vllm-on-eks\n\n# Apply the Terraform configuration\nterraform init\nterraform apply -auto-approve\n\n$(terraform output configure_kubectl | jq -r)\n```\n\n### Deploy  DeepSeek Model\n\nIn this step, we will deploy the **DeepSeek-R1-Distill-Llama-8B** model using vLLM on Amazon EKS. \nWe will walk through deploying the model with the option to enable GPU-based, Neuron-based (Inferentia and Trainium), \nor both, by configuring the parameters accordingly.\n\n#### Configuring Node Pools\nThe `enable_auto_mode_node_pool` parameter can be set to `true` to automatically create node pools when using EKS AutoMode. \nThis configuration is defined in the [nodepool_automode.tf](./nodepool_automode.tf) file. If you're using EKS AutoMode, this will ensure that the appropriate node pools are provisioned.\n\n#### Customizing Helm Chart Values\nTo customize the values used to host your model using vLLM, check the [helm.tf](./helm.tf) file. \nThis file defines the model to be deployed (**deepseek-ai/DeepSeek-R1-Distill-Llama-8B**) and allows you to pass additional parameters to vLLM. \nYou can modify this file to change resource configurations, node selectors, or tolerations as needed.\n\n``` bash\n# Let's start by just enabling the GPU based option:\nterraform apply -auto-approve -var=\"enable_gpu=true\" -var=\"enable_auto_mode_node_pool=true\"\n\n# Check the pods in the 'deepseek' namespace \nkubectl get po -n deepseek\n```\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to deploy with Neuron based Instances\u003c/summary\u003e\n\n  ``` bash\n  # Before Adding Neuron support we need to build the image for the vllm deepseek neuron based deployment.\n  \n  # Let's start by getting the ECR repo name where we'll be pushing the image\n  export ECR_REPO_NEURON=$(terraform output ecr_repository_uri_neuron | jq -r)\n\n  # Now, let's clone the official vLLM repo and use its official container image with the neuron drivers installed\n  git clone https://github.com/vllm-project/vllm\n  cd vllm\n\n  # Building image\n  finch build --platform linux/amd64 -f Dockerfile.neuron -t $ECR_REPO_NEURON:0.1 .\n\n  # Login on ECR repository\n  aws ecr get-login-password | finch login --username AWS --password-stdin $ECR_REPO_NEURON\n\n  # Pushing the image\n  finch push $ECR_REPO_NEURON:0.1\n\n  # Remove vllm repo and container image from local machine\n  cd ..\n  rm -rf vllm\n  finch rmi $ECR_REPO_NEURON:0.1\n\n  # Enable additional nodepool and deploy vLLM DeepSeek model\n  terraform apply -auto-approve -var=\"enable_gpu=true\" -var=\"enable_neuron=true\" -var=\"enable_auto_mode_node_pool=true\"\n  ```\n\u003c/details\u003e\n\nInitially, the pod might be in a **Pending state** while EKS Auto Mode provisions the underlying EC2 instances with the required drivers.\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick if your pod is stuck in a \"pending\" state for several minutes\u003c/summary\u003e\n   \n  ``` bash\n  # Check if the node was provisioned\n  kubectl get nodes -l owner=yourname\n  ```\n  If no nodes are displayed, verify that your AWS account has sufficient service quota to launch the required instances.\n  Check the quota limits for G, P, or Inf instances (e.g., GPU or Neuron based instances).\n  \n  For more information, refer to the [AWS EC2 Instance Quotas documentation](https://docs.aws.amazon.com/ec2/latest/instancetypes/ec2-instance-quotas.html).\n\n  **Note:** Those quotas are based on vCPUs, not the number of instances, so be sure to request accordingly.\n\n\u003c/details\u003e\n\n``` bash\n# Wait for the pod to reach the 'Running' state\nkubectl get po -n deepseek --watch\n\n# Verify that a new Node has been created\nkubectl get nodes -l owner=yourname -o wide\n\n# Check the logs to confirm that vLLM has started \n# Select the command based on the accelerator you choose to deploy.\nkubectl logs deployment.apps/deepseek-gpu-vllm-chart -n deepseek\nkubectl logs deployment.apps/deepseek-neuron-vllm-chart -n deepseek\n```\n\nYou should see the log entry **Application startup complete** once the deployment is ready.\n\n### Interact with the LLM\n\nNext, we can create a local proxy to interact with the model using a curl request.\n\n``` bash\n# Set up a proxy to forward the service port to your local terminal\n# We are exposing Neuron based on port 8080 and GPU based on port 8081\nkubectl port-forward svc/deepseek-neuron-vllm-chart -n deepseek 8080:80 \u003e port-forward-neuron.log 2\u003e\u00261 \u0026\nkubectl port-forward svc/deepseek-gpu-vllm-chart -n deepseek 8080:80 \u003e port-forward-gpu.log 2\u003e\u00261 \u0026\n\n# Send a curl request to the model (change the port according to the accelerator you are using)\ncurl -X POST \"http://localhost:8080/v1/chat/completions\" -H \"Content-Type: application/json\" --data '{\n \"model\": \"deepseek-ai/DeepSeek-R1-Distill-Llama-8B\",\n \"messages\": [\n {\n \"role\": \"user\",\n \"content\": \"What is Kubernetes?\"\n }\n ]\n }'\n```\nThe response may take a few seconds to build, depending on the complexity of the model’s output. \nYou can monitor the progress via the `deepseek-gpu-vllm-chart` or `deepseek-neuron-vllm-chart` deployment logs.\n\n### Build a Chatbot UI for the Model\n\nWhile direct API requests work fine, let’s build a more user-friendly Chatbot UI to interact with the model. The source code for the UI is already available in the GitHub repository.\n\n``` bash\n# Retrieve the ECR repository URI created by Terraform\nexport ECR_REPO=$(terraform output ecr_repository_uri | jq -r)\n\n# Build the container image for the Chatbot UI\nfinch build --platform linux/amd64 -t $ECR_REPO:0.1 chatbot-ui/application/.\n\n# Login to ECR and push the image\naws ecr get-login-password | finch login --username AWS --password-stdin $ECR_REPO\nfinch push $ECR_REPO:0.1\n\n# Update the deployment manifest to use the image\nsed -i \"s#__IMAGE_DEEPSEEK_CHATBOT__#$ECR_REPO:0.1#g\" chatbot-ui/manifests/deployment.yaml\n\n# Generate a random password for the Chatbot UI login\nsed -i \"s|__PASSWORD__|$(openssl rand -base64 12 | tr -dc A-Za-z0-9 | head -c 16)|\" chatbot-ui/manifests/deployment.yaml\n\n# Deploy the UI and create the ingress class required for load balancers\nkubectl apply -f chatbot-ui/manifests/ingress-class.yaml\nkubectl apply -f chatbot-ui/manifests/deployment.yaml\n\n# Get the URL for the load balancer to access the application\necho http://$(kubectl get ingress/deepseek-chatbot-ingress -n deepseek -o json | jq -r '.status.loadBalancer.ingress[0].hostname')\n```\n\nTo access the Chatbot UI, you'll need the username and password stored in a Kubernetes secret.\n\n``` bash\necho -e \"Username=$(kubectl get secret deepseek-chatbot-secrets -n deepseek -o jsonpath='{.data.admin-username}' | base64 --decode)\\nPassword=$(kubectl get secret deepseek-chatbot-secrets -n deepseek -o jsonpath='{.data.admin-password}' | base64 --decode)\"\n```\nAfter logging in, you'll see a new **Chatbot tab** where you can interact with the model!\nIn this tab, you'll notice a dropdown menu that lets you switch between Neuron-based and GPU-based deployments!\n\n![chatbot-ui](/static/images/chatbot.jpg)\n\n## Scaling DeepSeek-R1 API on Amazon EKS Auto Mode\n\nSet the `enable_autoscaling` to `true` in the `tfvars` file to enable autoscaling. Then plan and apply the changes.\n\n```hcl\nenable_autoscaling = true\n```\n\n``` bash\n# Plan and apply the changes\nterraform plan -out=\"plan.out\"\nterraform apply \"plan.out\"\n```\n\nThis will deploy the following resources to handle autoscaling:\n\n- **DCGM Exporter**: A NVIDIA tool for monitoring GPU metrics.\n\n- **Prometheus Operator**: A Kubernetes operator that manages Prometheus instances allowing you to scrape metrics from the DCGM Exporter.\n\n- **Prometheus Adapter**: A component that exposes Prometheus metrics to the Kubernetes API.\n\n- **Horizontal Pod Autoscaler**: A Kubernetes resource that automatically scales the number of pods in a deployment based on observed GPU utilization.\n\nBut it will also deploy the AWS Application Load Balancer Controller, which is responsible for managing the ALB resources in your cluster. When you deploy multiple instances of the same model, the ALB will automatically route traffic to the appropriate instance based on the load.\n\nRun this following command to get the URL of the load balancer:\n\n``` bash\nterraform output -raw deepseek_ingress_hostname\n```\n\nYou can use this curl command to send a request to the model:\n\n``` bash\ncurl -s -X POST \"http://$(terraform output -raw deepseek_ingress_hostname)/v1/chat/completions\" \\\n  -H \"Content-Type: application/json\" \\\n  --data '{\n\t\"model\": \"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\",\n\t\"messages\": [\n  \t{\n    \t\"role\": \"user\",\n    \t\"content\": \"What is Kubernetes?\"\n  \t}\n\t]\n  }' | jq -r '.choices[0].message.content'\n```\n\nOr using the `stress-test.sh` script to send multiple requests to the model.\nMake sure to target the correct model name on the script and on the curl command.\n\n---\n### Disclaimer\n\n**This repository is intended for demonstration and learning purposes only.**\nIt is **not** intended for production use. The code provided here is for educational purposes and should not be used in a live environment without proper testing, validation, and modifications.\n\nUse at your own risk. The authors are not responsible for any issues, damages, or losses that may result from using this code in production.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrackit%2Feks-auto-mode-gpu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftrackit%2Feks-auto-mode-gpu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrackit%2Feks-auto-mode-gpu/lists"}