{"id":20475391,"url":"https://github.com/ashwinpn/containers--ml-and-cloud-computing","last_synced_at":"2025-04-13T12:29:40.968Z","repository":{"id":201600085,"uuid":"313727450","full_name":"ashwinpn/Containers--ML-and-Cloud-Computing","owner":"ashwinpn","description":"Everything about how to deploy projects on the cloud, run ML workloads on the HPC cluster and on the cloud and the efficient configuration and management of related collaborative platforms [e.g. container orchestration].","archived":false,"fork":false,"pushed_at":"2021-01-31T14:01:13.000Z","size":771,"stargazers_count":19,"open_issues_count":0,"forks_count":6,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-27T03:35:10.423Z","etag":null,"topics":["aws","cloud","cloud-computing","containers","deep-learning","deployment","docker","docker-image","dockerfile","google","google-cloud-platform","k8s-cluster","kubernetes","kubernetes-cluster","machine-learning","pytorch","vagrant"],"latest_commit_sha":null,"homepage":"","language":"Dockerfile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashwinpn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-11-17T19:54:27.000Z","updated_at":"2024-11-16T02:51:49.000Z","dependencies_parsed_at":null,"dependency_job_id":"c742bafc-a7ac-4641-a754-a4ec6f6a001e","html_url":"https://github.com/ashwinpn/Containers--ML-and-Cloud-Computing","commit_stats":null,"previous_names":["ashwinpn/containers--ml-and-cloud-computing"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwinpn%2FContainers--ML-and-Cloud-Computing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwinpn%2FContainers--ML-and-Cloud-Computing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwinpn%2FContainers--ML-and-Cloud-Computing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashwinpn%2FContainers--ML-and-Cloud-Computing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashwinpn","download_url":"https://codeload.github.com/ashwinpn/Containers--ML-and-Cloud-Computing/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248713944,"owners_count":21149807,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","cloud","cloud-computing","containers","deep-learning","deployment","docker","docker-image","dockerfile","google","google-cloud-platform","k8s-cluster","kubernetes","kubernetes-cluster","machine-learning","pytorch","vagrant"],"created_at":"2024-11-15T15:15:53.024Z","updated_at":"2025-04-13T12:29:40.932Z","avatar_url":"https://github.com/ashwinpn.png","language":"Dockerfile","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Containers, ML and Cloud Computing\nAshwin Nalwade.\n\n- Containers have become very important to deep learning as it is critical to leverage them to ensure scalability, now that computer scientists are capable of developing ML applications which can run seamlessly on smartphones too.\n\n![cont](https://github.com/ashwinpn/Containers/blob/main/resources/cool.png)\n\n# Technologies\n- Google Cloud Platform [GCP], Amazon Web Services [AWS], IBM Cloud.\n- Docker, Kubernetes, Vagrant, Slurm [For High Perfromance Cluster Computing - Workload Management].\n- Python, Gunicorn, Flask, PyTorch, Pandas, Tensorflow, Jupyter Notebook, CUDA.\n- CSS, HTML, JavaScript, Bash.\n\n# Docker\n\n## Troubleshooting\n- Docker Daemon Permission Denied :\n  ```bash\n  sudo usermod -aG docker ${USER}\n  su -${USER}\n  ```\n- Cannot connect to the docker daemon :\n  ```bash\n  systemctl start docker\n  ```\n- Error processing Tar / No space left on device:\n  ```bash\n  docker rmi -f $(docker images | grep \"\u003cnone\u003e\" | awk \"{print \\$3}\")\n  OR\n  docker system prune, docker image prune\n  ```\n- Conflict =\u003e already in use:\n  ```bash\n  Find CONTAINER_ID using docker ps -a\n  \n  Use docker start \u003cCONTAINER_ID\u003e instead of docker run\n  ```\n- I initially used ```alpine ``` for my dockerfile, but ran into a lot of issues when I started building the docker image. \nSpecifically, I faced networking issues, memory errors and had to spend a lot of time with the ```pip install ``` commands because they were not running smoothly. \nThus, I decided against using alpine as I felt that the issues that came up due to its use defeated the purpose of using containers in the first place - \nwhich is to make our job easier while building environments and running code. Also, I felt that the time spent debugging and dealing with errors does not \ncompensate for the small docker image size which alpine provides.\n## Cases\n\n### \u003cins\u003epytorch_mnist\u003c/ins\u003e\n- Run\n```bash\ndocker build -t asn/pyt .\n\ndocker run -i -t asn/pyt:latest\n```\n- Check container status using\n```bash\ndocker ps -a\n```\n\n# Vagrant\n- Starting up a vagrant environment via a virtualbox.\nRun\n```bash\nvagrant up\n\nvagrant ssh\n\n[If it says \"Machine already provisioned\"]\n\nvagrant provision\n```\n- Configuring the vagrantfile [vagrantfiles are written in Ruby]. Here, we provision docker - which means that we can start working with docker as soon as we start up using ``` vagrant up``` and ```vagrant ssh``` command.\n\n```ruby\nVagrant.configure(\"2\") do |config|\n\n  config.vm.box = \"ubuntu/bionic64\"\n\n  config.vagrant.plugins = \"vagrant-docker-compose\"\n\n  # install docker and docker-compose\n  config.vm.provision :docker\n  config.vm.provision :docker_compose\n\n  config.vm.provider \"virtualbox\" do |vb|\n    vb.customize [\"modifyvm\", :id, \"--ioapic\", \"on\"]\n    vb.customize [\"modifyvm\", :id, \"--memory\", \"8000\"]\n    vb.customize [\"modifyvm\", :id, \"--cpus\", \"2\"]\n  end\n\nend\n```\n\n- Provisioning in vagrant using the shell.\n\nUse the ```-y``` option with ```apt-get``` to enable non-interactive installation. That is, no more \"X amount of data would be downloaded on installing Y. Do you still want to continue? [Y|N]\"\n```ruby\n# -*- mode: ruby -*-\n# vi: set ft=ruby :\n\nVagrant.configure(\"2\") do |config|\n\n  config.vm.box = \"ubuntu/bionic64\"\n\n  \n   config.vm.provider \"virtualbox\" do |vb|\n     # Display the VirtualBox GUI when booting the machine\n     vb.gui = false\n\n     # Customize the amount of memory on the VM:\n     vb.memory = \"2048\"\n   end\n \n   config.vm.provision \"shell\", inline: \u003c\u003c-SHELL\n     sudo apt-get -y upgrade apt-get update\n     sudo apt-get -y install python3.8 python3.8-dev python3.8-distutils python3.8-venv\n     sudo apt-get -y install git-all python-dev \n     curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py\n     python3.8 get-pip.py\n     pip install future nose mock coverage numpy flake8  \n     pip --no-cache-dir install torch torchvision\n   SHELL\nend\n\n```\n\n## Troubleshooting\n\n# Kubernetes\n\n```bash\nkubectl create -f dply.yaml\n\nkubectl get pods\n\nkubectl get deployments\n\nkubectl expose deployment sentiment-inference-service --type=LoadBalancer --port 80 --target-port 8080\n```\n For the training part I used a ```pod``` object, but I also experimented using a job\nobject since in real-life production environments, we deal with batch jobs too.\nFor inference, I used a ```deployment object``` because we can have ```ReplicaSets``` and\ncan thus ensure that the service is reliable [multiple replicas] and can also be\nscaled [That is, our deployments can stay up, remain healthy, and keep\nrunning automatically without the requirement of some manual intervention].\n\nOther advantages are:\n- We can rollout and rollback our services during deployment.\n- We can observe the status of each pod.\n- There is a facility for performing updates in a rolling manner.\n- Unlike a deployment, pods are not rescheduled in the case of termination of the pod or node failure.\n\nWe create a ```PersistentVolumeClaim``` and Kubernetes automatically\nprovisions a persistent disk for us. We request a 2GiB disk and configure the\naccess mode so that it can be mounted in a read-write-once manner by one\nnode. When we create a ```PersistentVolumeClaim``` object, Kubernetes\ndynamically creates a corresponding ```PersistentVolume``` object.\n\nThis ```PersistentVolume``` is backed by an empty and new Compute Engine\npersistent disk. We thus use this disk in a pod by using the claim as a volume.\nSince we use ```PersistentVolumes```, we can transfer data between pods\n[since they have access to the same storage space - depending on the\naccess mode we have selected, either a single pod or multiple pods can have\nReadWrite access at the same time], and along with git, docker, and Docker\nHub, we were able to use the trained model file in the inference service\n\n\n## Running on GCP with Google Kubernetes Engine\n![k8s](https://github.com/ashwinpn/Containers-and-Cloud-Computing/blob/main/resources/hw4_16.JPG)\n\n## Observations and Experimentations\n\n- I worked with kubernetes on my local machine (via ```minikube```), \n\n![](https://github.com/ashwinpn/Containers-and-Cloud-Computing/blob/main/resources/hw4_5.JPG) \n\non the terminal present at the kubernetes website (```Katacoda```), and also on Google\nKubernetes Engine. While working with Kubernetes on my local machine and\nalso on the kubernetes website terminal (Katacoda), there were some issues\nwith the ```LoadBalancer type```. It was always showing the ```EXTERNAL-IP``` as\n```\u003cpending\u003e```\n\n![](https://github.com/ashwinpn/Containers-and-Cloud-Computing/blob/main/resources/hw4_11.JPG)\n\nand the service cannot process any test image without the ```EXTERNAL-IP```, so\nthe progress had stalled. I tried the following to try and get the ```IP```:\n\n![](https://github.com/ashwinpn/Containers-and-Cloud-Computing/blob/main/resources/hw4_12.JPG)\n![](https://github.com/ashwinpn/Containers-and-Cloud-Computing/blob/main/resources/hw4_13.JPG)  \n \nBut, it did not work. Another method that I tried was to add an ```externalIPs```\nspecification in the ```.yaml``` file, but that did not work either. After trying some\nmore commands and going through the kubernetes website, I learned that\nsomething related was mentioned here :\n\nhttps://kubernetes.io/docs/tutorials/stateless-application/expose-external-ip-address/\n\nAnd that’s why I decided to work on Google Kubernetes Engine - did not face\nany problems there (If the external ```IP``` address is still shown as ```\u003cpending\u003e```,\nwait for a minute and enter the same command again, the issue is resolved).\n\n![](https://github.com/ashwinpn/Containers-and-Cloud-Computing/blob/main/resources/hw4_23.JPG)\n  \n- Different types of services - ```ClusterIP, LoadBalancer, ExternalName, NodePort.```\n\n    1] ``` ClusterIP``` is the default kubernetes service, and applications within the cluster can access it, but external applications cannot.\n\n    2]  With ```NodePort```, we can direct any access requests to some particular port that has been opened on all nodes, but we can only have one service corresponding to one port.\n\n    3] With ```LoadBalancer```, we can directly expose the service. This is the one which I used.\n\n    4] Also read about ```ingress```, using which we can manage external access requests without the need for creating ```LoadBalancer``` / exposing all services present on a node.\n\n- Became familiar with ```secrets``` which are useful for dealing with\ncontainers (pulling containers) which are present in private Docker Hub\nrepositories. Using secrets, we can also manage secure communication\nchannels, keep and handle confidential data like ssh keys, passwords,\nand access tokens for OAuth.\n\n## Troubleshooting / Discussion\n\n- \u003cins\u003eThe difference betwen ```kubectl create``` and ```kubectl apply```\u003c/ins\u003e\n\n```kubectl create``` is an Imperative Management approach. Here we tell the Kubernetes API what we want to create/replace/delete, not how we want our K8s cluster to look like. \nOn the other hand, ```kubectl apply``` is an Declarative Management approach, where changes that you may have applied to a live object (i.e. through scale) are \"maintained\" even if other changes are applied to the object. Also, When running in a CI script, we can potentially face trouble with the imperative approach as ```create``` will raise an error if the resource already exists. To put it simply,\n\n```apply``` makes incremental changes to an existing object.\n\n```create``` creates a whole new object (previously non-existing / deleted).\n\n- \u003cins\u003e If pods are stuck in terminiating status \u003c/ins\u003e\n\nDelete pods forcefully using:\n\n```bash\nkubectl delete pod --grace-period=0 --force --namespace \u003cNAMESPACE\u003e \u003cPODNAME\u003e\n```\n\n# Vagrant v/s Docker Comparison\n\nSetting up the project to work in vagrant and docker for the first time took a while for both docker and\nvagrant, but for the subsequent times, starting up and running our project in docker was much more\nfaster than vagrant - however, setting up the project workflow was better in vagrant. \n\nBoth vagrant and docker have dedicated communities if you require any help for troubleshooting any issues, but the\ndocker documentation and forums are more user friendly and docker in general has many resources,\ntutorials, and guides available. For instance, one major disadvantage with vagrant is that there is a\nlesser scope for collaboration [even with Vagrant Cloud] - when working with docker, we can easily\npush our docker images to our Docker Hub repo, and your teammates/others can pull the container\nand run the application. Docker also has support for functions similar to git, where we can track\ndifferent versions of a container, see the changes (diff) between the versions, and provides support for\nupdating via commits. \n\nWhile provisioning can potentially get messy in both vagrant and docker [e.g.\npip install instructions - need to check conflicting dependencies and updates to eliminate probable\nerrors], vagrant based provisioning can be a big headache because I personally had to spend about half\nan hour for debugging the provisioner shell in the vagrantfile, making sure that all packages had the\nsuitable versions to enable working with each other. \n\nRegarding vagrant, while a completely virtualized system provides more isolation, it also comes at the cost of more resources being allocated to it [it\nis heavier], and has minimum sharing capability. With docker we have less isolation [it has sufficient\nsupport to isolate processes from each other though], but the containers are lightweight (require fewer\nresources, and they share the same kernel). \n\nWe cannot run completely different operating systems in containers like we do in virtual machines, however it is possible to run different distros [distributions] of\nLinux since they do share the same kernel. Talking about booting times, when we load a container we\ndon’t have to start a new copy of the operating system like in a virtual machine, so booting times are\nsmaller. Unlike virtual machines, we also do not have to pre-allocate a considerable amount of memory\nand hardware resources when working with containers.\n\n## Summary\n- Docker containers are faster to start (and also to stop). One of the reasons behind this is probably\nthe fact that docker leverages the existing host OS [which already has major system processes\ninitialized], whereas with vagrant we have to load a whole vm image [and also initialize major\nsystem processes].\n\n- With regards to isolation, vagrant fares better. But with docker we can still get isolation at the\nuser level since a docker container runs as an isolated process. Also, we can collaborate better by\nusing Docker Hub.\n\n- Vagrant in general is heavy, as compared to docker which is lightweight because it includes only\nthose libraries which are extremely necessary to the application as a part of the container image.\n\n- Docker will require fewer amount of resources than vagrant, since we only need to load the libraries\nwhich are necessary/essential for the application. Therefore, for a given value for computing\ncapacity, we can have more applications which are running. On the other hand, vagrant has to\nload a whole OS in the memory, and thus it will always lead to more consumption of resources.\n\n## Case Study\nWe derive insights on performance by using the ```profile, cProfile```, and ```pstats``` modules for profiling purposes.\n\n- With docker it took seconds to start up, whereas for vagrant it took minutes.\n\n- The total execution time on docker was ```24m 38s 19ms```, compared to the total execution time on\nvagrant, which was ```27m 08s 34ms```, and thus on docker it was smaller.\n\n- Average training time per epoch on docker was ```2m 27s 78ms```, while the average training time per\nepoch on vagrant was ```2m 42s 79ms```. Best of 3 average training time on docker was ```2m 25s 41ms```,\nand on vagrant it was ```2m 41s 07ms```.\n\n- Average memory usage per epoch on docker was ```peak memory: 450.03 MiB```, whereas for vagrant\nit was ```peak memory: 448.63 MiB```. The difference here was negligible.\n\n- Regarding line-by-line analysis for the application code: for docker we had ```4616095 function calls (4612290 primitive calls) in 11.988 seconds```, and for vagrant we had ```4616095 function calls (4612290 primitive calls) in 16.632 seconds```.\n\n\n# Machine Learning\n## Running GPU profilers on the cloud [Google Colab Pro]\n1. Change runtime type selecting GPU as hardware accelerator\n2. Git clone this repository:\n```\n!git clone https://github.com/ashwinpn/imgnet.git\n```\n3. Change permissions:\n```\n!chmod 755 imgnet/INSTALL.sh\n``` \n4. Install \u003cins\u003ecuda, nvcc, gcc, and g++\u003c/ins\u003e:\n```\n!./imgnet/INSTALL.sh\n```\n5. Add `/usr/local/cuda/bin` to `PATH`:\n```python\nimport os\nos.environ['PATH'] += ':/usr/local/cuda/bin'\n```\n6. Run [Example, imagenet]\n```\n!nvprof python main.py -a alexnet -b 8 --epochs 1 --lr 0.01 \u003cTraining and Validation folders parent directory\u003e\n```\n\n## Accessing GPU's on the NYU Cluster\n- A GPU request [Unless you have already obtained access to specifc ones]\n```bash\n[NYUNetID@log-0 ~]$ srun --gres=gpu:1 --pty /bin/bash\n```\n![gpu](https://github.com/ashwinpn/Containers-and-Cloud-Computing/blob/main/resources/specifications.JPG)\n\n## Activating environments, installing requirements, training on NYU Prince Cluster.\n![nyup](https://github.com/ashwinpn/Containers-and-Cloud-Computing/blob/main/resources/Training%20Time.JPG)\n\n- First, login to the bastion host, and then ssh into the cluster.\n```bash\nssh NYUNetID@gw.hpc.nyu.edu\nssh NYUNetID@prince.hpc.nyu.edu\n```\n\n- You can get acces to three filesystems: ```/home, /scratch, and /archive```.\n\n```Scratch``` is a file system mounted on Prince that is connected to the compute nodes where we can upload files faster. \nNote that the content gets periodically flushed. ```/home``` and ```/scratch``` are separate filesystems in separate places, \nbut you should use ```/scratch``` to store your files.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashwinpn%2Fcontainers--ml-and-cloud-computing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashwinpn%2Fcontainers--ml-and-cloud-computing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashwinpn%2Fcontainers--ml-and-cloud-computing/lists"}