{"id":24803129,"url":"https://github.com/ajithvcoder/emlo4-session-16-ajithvcoder","last_synced_at":"2025-03-25T05:38:10.399Z","repository":{"id":273937724,"uuid":"918858841","full_name":"ajithvcoder/emlo4-session-16-ajithvcoder","owner":"ajithvcoder","description":null,"archived":false,"fork":false,"pushed_at":"2025-01-23T20:42:17.000Z","size":403,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-23T21:29:05.083Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ajithvcoder.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-19T03:34:36.000Z","updated_at":"2025-01-23T20:42:21.000Z","dependencies_parsed_at":"2025-01-23T21:39:10.614Z","dependency_job_id":null,"html_url":"https://github.com/ajithvcoder/emlo4-session-16-ajithvcoder","commit_stats":null,"previous_names":["ajithvcoder/emlo4-session-16-ajithvcoder"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Femlo4-session-16-ajithvcoder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Femlo4-session-16-ajithvcoder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Femlo4-session-16-ajithvcoder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ajithvcoder%2Femlo4-session-16-ajithvcoder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ajithvcoder","download_url":"https://codeload.github.com/ajithvcoder/emlo4-session-16-ajithvcoder/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245407755,"owners_count":20610232,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-30T05:17:41.587Z","updated_at":"2025-03-25T05:38:10.379Z","avatar_url":"https://github.com/ajithvcoder.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"### EMLOV4-Session-16 Assignment - Kubernetes - IV: IRSA, Volumes, ISTIO \u0026 KServe\n\nDeploy LLM/Diffusuion model with AWS EKS service with front end(Next JS), backend(Fastapi) and network configurations like ISTIO, S3 association along with Serving tools like KServe, with monitoring tools like Kiali, prometheus, grafana.\n\n**Note:**\n- I have used \"OFA-Sys/small-stable-diffusion-v0\" with 256x256 resolution image generation instead of SD3-medium with 1024x1024 as the cost of debugging and developing in g6.2xlarge is very high \n\n- If you take a g6.2xlarge instance it cost 0.4$ per hour even for spot instance so either develop with a small model first with g4dn.xlarge and if eveything works fine go for sd3 models. Else you will end up lossing 5-7 dollars for gpu alone. While developing assignment use a small model + \u003c=256x256 generation image with diffuser. Dont use sd3 or 1024x1024 inference first itself it needs 24GB GPU RAM to load\n\n**Wait paitently see all deletion is successfull in aws cloud formation stack page and then close the system because some times\nthe deletion gets failed so at backend something would be running and it may cost you high**\n**If you triggering a spot instance manually with `peresistent` type ensure that both the spot request is cancelled manually\nand the AWS instance is terminated finally**\n\n### Contents\n\n- [Requirements](#requirements)\n- [Development Method](#development-method)\n    - [Architecture Diagram](#architecture-diagram)\n    - [Installation](#installation)\n    - [Cluster creation and configuration](#cluster-creation-and-configuration)\n    - [Install ISTIO and loadbalancer](#install-istio-and-loadbalancer)\n    - [KServe and Helm Deployment](#kserve-and-helm-deployment)\n    - [Monitoring and visuvalization](#monitoring-and-visuvalization)\n    - [Deletion Procedure](#deletion-procedure)\n- [Learnings](#learnings)\n- [Results Screenshots](#results-screenshots)\n\n### Requirements\n\n- Redo the SD3 Deployment that we did in the class on KServe\n- Create README.mdLinks to an external site.\n    - Write instructions to create the .mar file\n    - Write instruction to deploy the model on KServe\n- What to Submit\n    - Output of kubectl get all -A\n    - Manifest Files used for deployments\n    - Kiali Graph of the Deployment\n    - GPU Usage from Grafana and Prometheus while on LOAD\n    - Logs of your torchserve-predictor\n    - 5 Outputs of the SD3 Model\n        - Make sure you copy the logs of torchserve pod while the model is inferencing\n    - GitHub Repo with the README.md file and manifests and logs\n\n### Architecture Diagram\n\n![](./assets/images/snap_assgn_16_arch.png)\n\nNote: You can refer [class-work](./eks-dev-class-work) and develop the deployments stage by stage similar in session-16 class\n\nRefer: [class-work-readme](./eks-dev-class-work/README.md) for proper usage of classwork files (it gives the commands in proper manner. todo: restructure it)\n\nNote: it took 5$ for doing class work debugging and development and another 7$ for assignment debugging and development as i used g6.2xlarge initially so dont do that mistake.\n\nLocal installations (no need a new ec2 instance for doing below work)\n\n### Installation\n\n**AWS install**\n\n```\ncurl \"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip\" -o \"awscliv2.zip\"\nunzip awscliv2.zip\nsudo ./aws/install\n```\n\n**Provide credentials**\n\n```\naws configure\n```\n\n**EKSCTL Install**\n\n```\n# for ARM systems, set ARCH to: `arm64`, `armv6` or `armv7`\nARCH=amd64\nPLATFORM=$(uname -s)_$ARCH\n\ncurl -sLO \"https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz\"\n\n# (Optional) Verify checksum\ncurl -sL \"\u003chttps://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt\u003e\" | grep $PLATFORM | sha256sum --check\n\ntar -xzf eksctl_$PLATFORM.tar.gz -C /tmp \u0026\u0026 rm eksctl_$PLATFORM.tar.gz\n\nsudo mv /tmp/eksctl /usr/local/bin\n```\n\n**Set the default ssh-gen key in local**\n\nThis default ssh key is used by aws for default ssh login\n\n```\nssh-keygen -t rsa -b 4096\n```\n\n**Install kubectl for aws eks in your local**\n\n```\ncurl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.32.0/2024-12-20/bin/linux/amd64/kubectl\n\nchmod +x ./kubectl\n\nmkdir -p $HOME/bin \u0026\u0026 cp ./kubectl $HOME/bin/kubectl \u0026\u0026 export PATH=$HOME/bin:$PATH\n```\n\n**Docker images to ECR**\n\nBuild and push docker images to AWS ECR \n\nModel server\n\n- K-Serve and Kubernetes inference services takes care of it so no need docker for model server\n\nWeb server\n\n- `docker build -t web-server -f Dockerfile.web-server .`\n\nUI server\n\n- `docker build -t ui-server -f Dockerfile.ui-server .`\n\n**Model file preparation**\n\nTake a new spot instance **manully** `g4dn.xlarge` and check if it works else go for small model or take larger GPU instance\nUse the code inside [model-server/sd3_deploy/test_small_model_infer](./src/model-server/sd3_deploy/test_small_model_infer.py)\n\n- `python test_small_model_infer.py`\n\nUse the code inside [model-server/sd3_deploy](./src/model-server/sd3_deploy/)\n```\n# Take a g4dn.xlarge instance. \n\naws configure\n\npython download_small_model.py \nmkdir -p ../model-store\nsh create_mar.sh\nsh upload_to_s3.sh\n```\nVerify inside s3 if you can see all above files. Dont forget to add config folder while uploading\n\n**Note: Make sure you change your account number in all `.yaml` files**\n\n### Cluster creation And configuration\n\nGo into `src/eks-cluster-config` folder. It takes 7-15 minutes based on number of resources you have listed in `.yaml` file for cluster creation\n\n```\neksctl create cluster -f eks-cluster.yaml\n```\n\n```\n\u003cdebug-facts\u003e\n# only usefull during debugging\n\n# Create a new nodegroup\n# You can comment out a resource and later uncomment it and create the new nodegroup in same cluster\n# This can save some cost of gpu nodes. But make sure to create before doing gpu related installations\neksctl create nodegroup --config-file=eks-cluster.yaml\n\n# Delete nodegroup\neksctl delete nodegroup --cluster basic-cluster --name ng-gpu-spot-1\n\n# Delete cluster, Also check in \"AWS Cloud Formation\" as sometimes even in CLI if its success\nin \"AWS Cloud Formation\" you will get deletion failed.\neksctl delete cluster -f eks-cluster.yaml --disable-nodegroup-eviction\n\n\u003c/debug-facts\u003e\n```\n\nCheck instances which is in EC2\n\nFor this to work the defualt ssh should have been configured and it helps in establishing connection with EC2\n\n```\nssh ec2-user@43.204.212.5\nkubectl config view\nkubectl get all\n```\n\n**IRSA for s3 usage**\n\nEnsure [iam-s3-test-policy.json](./src/eks-cluster-config/iam-s3-test-policy.json) is in your current path\n\n```\n# Associates an IAM OIDC (OpenID Connect) provider with your EKS cluster. This association allows you to enable Kubernetes\n# service accounts in your cluster to use IAM roles for fine-grained permissions.\neksctl utils associate-iam-oidc-provider --region ap-south-1 --cluster basic-cluster --approve\n\n# Create Iam policy\naws iam create-policy --policy-name S3ListTestEMLO --policy-document file://iam-s3-test-policy.json\n\n# Verification\naws iam get-policy-version --policy-arn arn:aws:iam::ACCOUNT_ID:policy/S3ListTestEMLO --version-id v1\n\n# Attach policy to cluster name\neksctl create iamserviceaccount --name s3-list-sa   --cluster basic-cluster   --attach-policy-arn arn:aws:iam::306093656765:policy/S3ListTestEMLO   --approve --region ap-south-1\n```\n\nVerify\n\n- `kubectl get sa`\n- `aws s3 ls mybucket-emlo-mumbai` - \"mybucket-emlo-mumbai\" is my bucket name\n\n**EBS on EKS**\n\n```\n# Do if need: IAM OIDC association\neksctl utils associate-iam-oidc-provider --region ap-south-1 --cluster basic-cluster --approve\n\n# Service account for ebs usage\neksctl create iamserviceaccount \\\n  --name ebs-csi-controller-sa \\\n  --namespace kube-system \\\n  --region ap-south-1 \\\n  --cluster basic-cluster \\\n  --attach-role-arn arn:aws:iam::306093656765:role/AmazonEKS_EBS_CSI_DriverRole\n\n# Create a addon to cluster\neksctl create addon --name aws-ebs-csi-driver --cluster basic-cluster --service-account-role-arn arn:aws:iam::306093656765:role/AmazonEKS_EBS_CSI_DriverRole --region ap-south-1 --force\n```\n\nVerify resouces\n\n- `kubectl get nodes -L node.kubernetes.io/instance-type`\n- `kubectl get sc`\n- `kubectl get sa -n kube-system`\n\n### Install ISTIO and loadbalancer\n\nIstio is an open source service mesh that helps organizations run distributed, microservices-based apps anywhere. Why use Istio? Istio enables organizations to secure, connect, and monitor microservices, so they can modernize their enterprise apps more swiftly and securely.\n\nSetup ISTIO\n\n- `helm repo add istio https://istio-release.storage.googleapis.com/charts`\n- `helm repo update`\n\n- `kubectl create namespace istio-system`\n- `helm install istio-base istio/base --version 1.20.2  --namespace istio-system --wait`\n- `helm install istiod istio/istiod  --version 1.20.2  --namespace istio-system --wait`\n- `kubectl create namespace istio-ingress`\n\n```\n\u003cdebug-facts\u003e\n# Could be used if something in istio-ingress service didnt start or didnt get external ip after long time\n- helm uninstall istio-ingress istio/gateway --namespace istio-ingress\n\u003c/debug-facts\u003e\n```\n\nInstall istio-ingress\n```\nhelm install istio-ingress istio/gateway \\\n  --version 1.20.2 \\\n  --namespace istio-ingress \\\n  --set labels.istio=ingressgateway \\\n  --set service.annotations.\"service\\\\.beta\\\\.kubernetes\\\\.io/aws-load-balancer-type\"=external \\\n  --set service.annotations.\"service\\\\.beta\\\\.kubernetes\\\\.io/aws-load-balancer-nlb-target-type\"=ip \\\n  --set service.annotations.\"service\\\\.beta\\\\.kubernetes\\\\.io/aws-load-balancer-scheme\"=internet-facing \\\n  --set service.annotations.\"service\\\\.beta\\\\.kubernetes\\\\.io/aws-load-balancer-attributes\"=\"load_balancing.cross_zone.enabled=true\" \n```\n- `kubectl rollout restart deployment istio-ingress -n istio-ingress`\n\nVerify\n- `kubectl get deployment.apps/istio-ingress  -n istio-ingress`\n\nNote: Install loadbalancer which is few steps away -\u003e Wait for load balancer to become action then run below command -\u003e Check if istio gets an external ip after load balancer is active else redo above ones and debug it. If `istio` didnt get an external ip or if any of its kubernetes resource status is `imagepullfailed` redo above else you cant get an accessible url.\n\n\n**ALB**\n\n- Assumed policy(`AWSLoadBalancerControllerIAMPolicy`) is already created from session-15 or else refer it\n\n```\neksctl create iamserviceaccount \\\n--cluster=basic-cluster \\\n--namespace=kube-system \\\n--name=aws-load-balancer-controller \\\n--attach-policy-arn=arn:aws:iam::306093656765:policy/AWSLoadBalancerControllerIAMPolicy \\\n--override-existing-serviceaccounts \\\n--region ap-south-1 \\\n--approve\n```\n\n- `helm repo add eks https://aws.github.io/eks-charts`\n- `helm repo update`\n- `helm install aws-load-balancer-controller eks/aws-load-balancer-controller -n kube-system --set clusterName=basic-cluster --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller`\n\nVerify\n\n- `kubectl get pods,svc -n istio-system`\n- `kubectl get pods,svc -n istio-ingress`\n\nIf istio-ingress external ip is not assigned even after load-balancer becomes active use below command after loadbalancer becomes active\n\n- `kubectl rollout restart deployment istio-ingress -n istio-ingress`\n\n**Install Metrics**\n\n- [Metric-Server installaiton reference](https://medium.com/@cloudspinx/fix-error-metrics-api-not-available-in-kubernetes-aa10766e1c2f)\n\n```\n\u003cdebug-facts\u003e\nkubectl delete -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml\nkubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml\nAdditionally Patching is done in above reference\n\u003cdebug-facts\u003e\n```\n\nKubernetes Dashboard\n\n- `helm repo add kubernetes-dashboard https://kubernetes.github.io/dashboard/`\n- `helm upgrade --install kubernetes-dashboard kubernetes-dashboard/kubernetes-dashboard --create-namespace --namespace kubernetes-dashboard`\n\n- `kubectl label namespace default istio-injection=enabled`\n\nInstall the Gateway CRDs\n\n- `kubectl get crd gateways.gateway.networking.k8s.io \u0026\u003e /dev/null || \\\\\n  { kubectl kustomize \"github.com/kubernetes-sigs/gateway-api/config/crd?ref=v1.2.0\" | kubectl apply -f -; }`\n\n### KServe and Helm Deployment\n\nApply [istio-kserve-ingress](./src/eks-cluster-config/istio-kserve-ingress.yaml) to create istio class and assign to kserve\n\n- `kubectl apply -f istio-kserve-ingress.yaml`\n\nInstall Cert Manager\n- `kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.16.2/cert-manager.yaml`\n\n\nInstall KServe using HELM\n\n- `helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd --version v0.14.1`\n\n- `helm install kserve oci://ghcr.io/kserve/charts/kserve \\\n  --version v0.14.1 \\\n  --set kserve.controller.deploymentMode=RawDeployment \\\n  --set kserve.controller.gateway.ingressGateway.className=istio`\n\n\n**GPU installations**\n\n- `helm repo add nvidia https://helm.ngc.nvidia.com/nvidia  \u0026\u0026 helm repo update`\n- `helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --version=v24.9.1`\n- `kubectl -n gpu-operator logs -f $(kubectl -n gpu-operator get pods | grep dcgm | cut -d ' ' -f 1 | head -n 1)`\n\nLinking S3 policy to Cluster\n\n-` eksctl utils associate-iam-oidc-provider --region ap-south-1 --cluster basic-cluster --approve`\n\n- `eksctl create iamserviceaccount \\\n\t--cluster=basic-cluster \\\n\t--name=s3-read-only \\\n\t--attach-policy-arn=arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \\\n\t--override-existing-serviceaccounts \\\n\t--region ap-south-1 \\\n\t--approve`\n\n(not able to move to helm charts but there is a way)\n\n- `kubectl apply -f s3-secret.yaml` \n\n- `kubectl patch serviceaccount s3-read-only -p '{\"secrets\": [{\"name\": \"s3-secret\"}]}'`\n\n**Helm Installation**\n\nNote:\n- Make sure you have updated the istio end point in this file `fastapi-helm/templates/model-server.cm.yml`. istio end point doesnt change untill you reinstall the isitio service in a cluster. So until you dont touch istio services it remains constant.\n- Your `ui-server` end point is dynamic for every helm install as its configured inside `fastapi-helm/templates/ui-server.service.yml`\n\nInstall fastapi-helm\n- helm install fastapi-release-default fastapi-helm --values fastapi-helm/values.yaml\n\nVerify\n- kubectl get all\n\n```\n\u003cdebug-facts\u003e\n# Uninstallation\nhelm uninstall fastapi-release-default\n\n# if uninstallation fails do below\nkubectl delete secret sh.helm.release.v1.fastapi-release-default.v1  -n default\n\n# Want to change a image of a pod ? set image pull policy to \"Always\" update the image in AWS ECR and then delete the pod\n# After deletion the cluster recreates now with new image. This helps in faster debugging without need to deleting entire cluster.\nkubectl delete pods/web-server-669c6cc8f8-zjvsd \n\u003cdebug-facts\u003e\n```\n\nHacks made to make this code work. I was not able to install in any of the existing namespace so only after running below commands i was able to give access to Helm to deploy in preexisting namespaces. `fastapi-release-default` is the helm chart name that user has gives.\n\n```\nkubectl label namespace default app.kubernetes.io/managed-by=Helm\nkubectl annotate namespace default meta.helm.sh/release-name=fastapi-release-default\nkubectl annotate namespace default meta.helm.sh/release-namespace=default\nhelm install fastapi-release-default fastapi-helm --values fastapi-helm/values.yaml --namespace default\n```\n\n**Debugging inside helm pods**\n\nGetting inside a specific pod and having iterative shell\n- `kubectl exec -it ui-server-59cb8d9f96-8559p -- /bin/bash`\n\nCheck if webservice is reachable from ui-service pods. If some service like ui-server, web-server is not able to be reached get inside the specific pod and do a curl to the service\n\n- `curl -X POST \"http://web-server-service/generate_image?text=horseriding\"`\n\nWhile debugging docker files rebuild with no-cache as it may push the same thing again sometimes\n\n- `docker build -t a16/web-server -f Dockerfile.web-server  . --no-cache`\n\nDelete pods but dont worry it recreates automatically to match the replicas\n- `kubectl delete pods/web-server-669c6cc8f8-zjvsd`\n\nUse [fake-server.py](./src/model-server/fake_server.py) to setup a server instead of gpu service and verify network configurations and main codes with local docker runs and debugging.\n\nUse [test_sd3_from_local](./src/model-server/sd3_deploy/test_sd3_from_local.py) for testing the model server. Check the pod logs of torchserve predictor only after model loading and if the torchserve established in 8080 port in logs we can tell that its ready to serve.\n\n### Monitoring and Visuvalization\n\n**Install Kiali, prometheus and grafana**\n\nJust run below in linux terminal\n\n```\nfor ADDON in kiali jaeger prometheus grafana\ndo\n    ADDON_URL=\"https://raw.githubusercontent.com/istio/istio/release-1.20/samples/addons/$ADDON.yaml\"\n    kubectl apply -f $ADDON_URL\ndone\n```\n\nFor promethues to fetch GPU utilization run following file which is [prometheus.yaml](./src/eks-cluster-config/prometheus.yaml)\n\n- `kubectl apply -f prometheus.yaml`\n\nPort forwarding and visuvalization\n\n- `kubectl port-forward svc/kiali 20001:20001 -n istio-system`\n\nGet to the Prometheus UI\n- `kubectl port-forward svc/prometheus 9090:9090 -n istio-system`\n\nVisualize metrics in using Grafana\n- `kubectl port-forward svc/grafana 3000:3000 -n istio-system`\n\nAdd `DCGM_FI_DEV_GPU_UTIL` panel in promethus so that it can fetch those metrics. Now in Grafana download [nvidia-dcgm dashboard packages](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/) json file and upload in new dashboard section in grafana. Now you can see GPU metrics\n\n- ![](./assets/images/snap_grafana_gpu_util.png)\n\nAfter this you can go to the ui-server url and query it may give some output. After you make 5-10 queries you can see some significant metrics in grafana, promethues and kiali.\n\n- ![](./assets/images/snap_result_ui_1.png)\n\n### Deletion Procedure\n\nMay through an error as we have did some hacky change for helm install fastapi-release-default\n- `helm uninstall fastapi-release-default`  \n\nThis will completely uninstall helm's particular resources\n- `kubectl delete secret sh.helm.release.v1.fastapi-release-default.v1  -n default`\n\nDelete cluster and all resources\n- `eksctl delete cluster -f eks-cluster.yaml --disable-nodegroup-eviction`\n\nWait paitently see all deletion is successfull in cloud formation and then close the system because some times\nthe deletion gets failed so at backend something would be running and it may cost you high\n\n### Learnings\n- If u force delete resources in CLI it will be running in background so check cloud formation or ec2 instances. But if you do it in \"AWS Cloud formation\" UI, it will delete all resources completely.\n- I could have moved the service account creation to yaml files and could have done it with helm install command but already the assignment took more time so that could be improved.\n- Usage of small model in the same `diffusers` and `StableDiffusionPipeline`  module could save more cost in debugging and development time.\n- Deletion of pods and auto recreation\n    - `kubectl delete pods/web-server-669c6cc8f8-zjvsd` \n- Debugging a particular pod by getting inside it\n- `kubectl exec -it ui-server-59cb8d9f96-8559p -- /bin/bash`\n- Managing nodegroups through `.yaml` files\n\n    Create nodegroup\n    - `eksctl create nodegroup --config-file=eks-cluster.yaml`\n\n    Delete nodegroup\n    - `eksctl delete nodegroup --cluster basic-cluster --name ng-gpu-spot-1`\n\n- connecting kserve with istio and managing pods\n\n### Results Screenshots\n\n- Ouput of kubectl get all -A\n\n    - [kubectl get all](./assets/logs/kubectl_logs_all.txt)\n\n- Kiali Graph of the Deployment\n\n    ![](./assets/images/snap_kiali_graph_1.png)\n\n    ![](./assets/images/snap_kiali_torchserve.png)\n\n    ![](./assets/images/snap_kiali_services.png)\n\n- GPU Usage from Grafana and Prometheus while on LOAD\n\n    ![](./assets/images/snap_grafana_gpu_util.png)\n    ![](./assets/images/snap_grafana_gpu.png)\n    ![](./assets/images/snap_grafana_1.png)\n    ![](./assets/images/snap_prometheus_on_load.png)\n\n- Logs of your torchserve-predictor\n\n    - [logs of torch serve predictor](./assets/logs/torchserve_logs.txt)\n\n- 5 Outputs of the SD3 Model\n\n    ![](./assets/images/snap_result_ui_1.png)\n    ![](./assets/images/snap_result_ui_2.png)\n    ![](./assets/images/snap_result_ui_3.png)\n    ![](./assets/images/snap_result_ui_4.png)\n    ![](./assets/images/snap_result_ui_5.png)\n\n- [logs of torch serve 5 inference](./assets/logs/torchserve_logs.txt)\n\n- Other logs\n\n    - [all logs](./assets/logs/)\n\n- Other Screenshots\n\n    ![](./assets/images/snap_loadbalancers.png)\n    ![](./assets/images/snap_get_nodes.png)\n    ![](./assets/images/snap_get_all_res.png)\n    ![](./assets/images/snap_get_all_res_2.png)\n\n\n    ![](./assets/images/snap_describe_nodegroup.png)\n    ![](./assets/images/snap_describe_ingress.png)\n\n- Architecture\n\n    ![](./assets/images/snap_assgn_16_arch.png)\n\n### Group Members\n\n1. Ajith Kumar V (myself)\n2. Pravin Sagar\n3. Hema M\n4. Muthukamalan","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fajithvcoder%2Femlo4-session-16-ajithvcoder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fajithvcoder%2Femlo4-session-16-ajithvcoder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fajithvcoder%2Femlo4-session-16-ajithvcoder/lists"}