{"id":19831734,"url":"https://github.com/robrohan/skoupidia","last_synced_at":"2025-05-01T16:32:21.529Z","repository":{"id":218159170,"uuid":"744883888","full_name":"robrohan/skoupidia","owner":"robrohan","description":"Scripts to build an on-premisis machine learning lab","archived":false,"fork":false,"pushed_at":"2024-04-25T21:14:07.000Z","size":931,"stargazers_count":6,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-04-26T19:26:15.387Z","etag":null,"topics":["ai","ansible","devops","home-lab","kubernetes","machine-learning","mlops"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/robrohan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null},"funding":{"github":["robrohan"],"patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"lfx_crowdfunding":null,"custom":null}},"created_at":"2024-01-18T07:47:27.000Z","updated_at":"2024-04-25T21:14:11.000Z","dependencies_parsed_at":"2024-01-27T01:30:54.403Z","dependency_job_id":"b5c6c07a-853e-4fa2-8722-7dfe9378c21f","html_url":"https://github.com/robrohan/skoupidia","commit_stats":null,"previous_names":["robrohan/skoupidia"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robrohan%2Fskoupidia","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robrohan%2Fskoupidia/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robrohan%2Fskoupidia/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robrohan%2Fskoupidia/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/robrohan","download_url":"https://codeload.github.com/robrohan/skoupidia/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224266892,"owners_count":17283225,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ansible","devops","home-lab","kubernetes","machine-learning","mlops"],"created_at":"2024-11-12T11:34:19.647Z","updated_at":"2025-05-01T16:32:21.521Z","avatar_url":"https://github.com/robrohan.png","language":"Jupyter Notebook","funding_links":["https://github.com/sponsors/robrohan"],"categories":[],"sub_categories":[],"readme":"# Skoupidia\n\nYour home machine learning cluster made from rubbish and things found 'round the house.\n\n![Oscar the Grouch making Science from Skoupidia](./docs/a_oscar.png)\n![using the terminal](./docs/b_terminal.png)\n![and a bunch of skoupidia!](./docs/c_cluster.jpg)\n\n![system layout](./docs/layout.png)\n\n1. [Create A Kubernetes Cluster](#create-a-kubernetes-cluster)\n1. [Install Kubeflow](#install-kubeflow)\n1. [My Own Personal Setup](#my-own-personal-setup)\n1. [References](#references)\n\n## Create A Kubernetes Cluster\n\n### Create Autobot User\n\nOn each of the servers, create a user named `autobot`. This will be the user ansible will use to install and remove software:\n\n```bash\nsudo adduser autobot\nsudo usermod -aG sudo autobot\n```\n\nAutobot has now been added to the sudo group, but to make ansible less of a pain, remove the need to interactively type the password every time autobot changes to run a command as root.\n\n```bash\nsudo su\necho \"autobot ALL=(ALL) NOPASSWD:ALL\" \u003e\u003e /etc/sudoers.d/peeps\n```\n\n#### Generate an SSH key locally\n\nYou can do this a number of ways, but here is an easy version:\n\n```bash\nssh-keygen -f ~/.ssh/home-key -t rsa -b 4096\n```\n\nor you can use the premade task\n\n```bash\nmake ssh_keys\n```\n\nThis will, by default save the public and private keys into your `~/.ssh` directory.\n\nWe want to take the `.pub` (public) key from our local server, and add the contents of the pub key to the file `~/.ssh/authorized_keys` on each of the servers in the `autobot` users home directory.\n\nThis will allow your local machine (or any machine that has the private key) to be able to automatically login to any of the nodes without typing a password. Again this will be very helpful when running the ansible scripts.\n\nOne way to do this is to copy the file contents to the clipboard:\n\n```bash\nxclip -selection clipboard \u003c ~/.ssh/home-key.pub\n```\n\nThen ssh into the machine as user `autobot`, and paste the clipboard contents into the `~/.ssh/authorized_keys` file (or create the file if it doesn't already exist).\n\n### Run Ansible Setups\n\nWe should now be ready to run the [anisble](./ansible/README.md) code.\n\nThe Anisble scripts will install some prerequisites, docker, and kubeadm and kubectl on the master and the worker nodes.\n\nFollow along with the [README in the ansible directory](./ansible/README.md) to continue.\n\n### Few Difficult to Script Tasks\n\nNow that ansible had run completely, we need to do a few manual steps. Don't skip these!\n\nYou'll need to run these commands on all the nodes (master and workers). After ansible has finished without error.\n\n```bash\nsudo su\nsystemctl stop containerd\nrm /etc/containerd/config.toml\n# it's ok if that errors --^\ncontainerd config default \u003e /etc/containerd/config.toml\n```\n\nYou'll need to use a text editor and set a value from `false` to `true`. \n\nEdit the file `/etc/containerd/config.toml`, and change the setting `SystemdCgroup = false` to `SystemdCgroup = true` in the section:\n\n```toml\n...\n[plugins.\"io.containerd.grpc.v1.cri\".containerd.runtimes.runc.options]\n    ...\n    SystemdCgroup = true\n...\n```\n\n---\n\n**Note:** Adding GPUs to the config.toml https://github.com/NVIDIA/k8s-device-plugin#quick-start\n\n---\n\n\nThis will enable the systemd cgroup driver for the containerd container runtime.\n\n```bash\nsystemctl restart containerd\n```\n\nRun the following to pull down needed configurations for kubeadm\n\n```bash\nkubeadm config images pull\n```\n\n### Run Kube Init\n\nNow, only on the designated **master** node, run\n\n```bash\nkubeadm init --v=5\n```\n\nIf all goes well, the output of this command will give use the commands we can use on the worker nodes to join the cluster.\n\nFor example:\n\n```bash\nkubeadm join 192.168.1.15:6443 --token 3e...l --discovery-token-ca-cert-hash sha256:65...ec \n```\n\nSave copy this command as we'll use this on the worker nodes to have the nodes join the master node.\n\nIt's a good idea to add the kube config to the autobot user so you can do things like `kubectl get nodes`. Run these as the autobot user:\n\n```bash\nmkdir -p $HOME/.kube\nsudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config\nsudo chown $(id -u):$(id -g) $HOME/.kube/config\n```\n\n### Join Workers to Cluster\n\nOn each worker nodes run the join command (note your values will be slightly different)\n\n```bash\nkubeadm join 192.168.1.15:6443 --token 3eo...6il --discovery-token-ca-cert-hash sha256:92a23........e9dee765ec\n```\n\n### Cluster Networking Using Calico\n\nAfter the workers have joined the cluster, we should be able to see them:\n\n```\nautobot@kmaster:~$ kubectl get nodes\nNAME       STATUS     ROLES           AGE     VERSION\nkmaster    NotReady   control-plane   77m     v1.28.6\nkworker1   NotReady   \u003cnone\u003e          25m     v1.28.6\nkworker2   NotReady   \u003cnone\u003e          3m34s   v1.28.6\n```\n\nBut we need to add container networking before they will be in the ready state\n\n```bash\nkubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml\n```\n(or use the version in the repos `./kubernetes/` folder)\n\nAfter applying calico, give the cluster some time to get all the node into the ready state.\n\n```\nautobot@kmaster:~$ kubectl get nodes -w\nNAME       STATUS   ROLES           AGE   VERSION\nkmaster    Ready    control-plane   91m   v1.28.6\nkworker1   Ready    \u003cnone\u003e          40m   v1.28.6\nkworker2   Ready    \u003cnone\u003e          17m   v1.28.6\n```\n\nWe now have a working cluster!\n\n### Add MetalLB\n\nKubernetes only really works correctly with an external load balancer that feeds it IPs. Without that it's difficult to run several workloads on the cluster - for example, there is only one port 80.\n\nMetalLB runs inside of Kubernetes and pretends to be a load balancer. It's great.\n\nFor up to date info see here: https://metallb.org/installation/ but in summary, we are going to install the load balancer:\n\n```bash\nkubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.12/config/manifests/metallb-native.yaml\n```\n\nAfter it is installled, we need to configure what IP addresses it gives out. We'll associate our local IP pool to the local load balancer by editing an applying `./kubernetes/metallb_stage2.yaml`.\n\n### Add Some Storage\n\nThere are many ways to add storage to the cluster. For something more than just playing around in a home lab, have a look into [Minio][minio]. Minio creates a local storage system that is S3 compatible. This will allow any pod to use the storage. We'll actually set that up later. Here though, we are just using local disk storage.\n\nThis is quite specific to how you have disks setup, but here is an example on mine. I have an external USB drive plugged into the master node. It is already formatted and ready to go. First you need to find which `/dev` the USB is plugged into, and then you can create a folder and mount it:\n\n```bash\nlsblk\n```\n\n```bash\nsudo mkdir /mnt/usbd\nsudo mount /dev/sbd1 /mnt/usbd\nsudo mkdir /mnt/usbc\nsudo mount /dev/sdc1 /mnt/usbc\n```\n\nYou should be able to see the contents of the drive now `ls /mnt/usbd`.\n\nOnce you have that working, you can add the listing to the `/etc/fstab` file so that it will mount again if you reboot the server:\n\n```bash\nsudo su\necho \"/dev/sdb1 /mnt/usbd ext4 defaults 0 0\" \u003e\u003e /etc/fstab\necho \"/dev/sdc2 /mnt/usbc ext4 defaults 0 0\" \u003e\u003e /etc/fstab\n```\n\nWith the drive mounted, you can make a PersistentVolume definitions. Have a look at `./kuberenets/oscar/local-pv-volume.yaml` for some examples.\n\n### Add Labels \n\nYou can make sure different pods run on correct nodes by adding a label to the nodes and then using the `affinity` section of the spec. For example I label my slow drive nodes and the nodes with a gpu like so:\n\n```bash\nkubectl label nodes kmaster disktype=hdd\nkubectl label nodes kworker1 disktype=ssd\nkubectl label nodes kworker2 disktype=ssd\nkubectl label nodes kworker1 has_gpu=true\nkubectl label nodes kworker2 has_tpu=true\n```\n\n---\n\n**Note:** use `accelerator=` with a type `example-gpu-x100` to mark nodes hardware\n\n---\n\nYou can see the labels with\n\n```bash\nkubectl get nodes --show-labels\n```\n\nThen, when deploying a service, you can do something like:\n\n```yaml\nspec:\n  affinity:\n    nodeAffinity:\n      requiredDuringSchedulingIgnoredDuringExecution:\n        nodeSelectorTerms:\n        - matchExpressions:\n          - key: disktype\n            operator: In\n            values:\n              - hdd\n```\n\nTo ensure the pod gets the correct hardware.\n\n## Optional - Install Tailscale\n\n(tbd)\n\n```bash\ncurl -fsSL https://tailscale.com/install.sh | sh\n```\n\n## Install Kubeflow\n\n(tbd)\n\n# My Own Personal Setup\n\nIn the folder `./kubernetes/oscar/` you will find the configuration definitions I have for my cluster. You may find them as useful examples.\n\nThe first script creates my namespaces. I use `tools` for things like a [Jellyfin media server](https://jellyfin.org/) and `science` for doing actual ML work.\n\n```bash\nkubectl apply -f https://raw.githubusercontent.com/robrohan/skoupidia/main/kubernetes/oscar/namespaces.yaml\n```\n\nI've already added a mount point on all the nodes for scratch storage (`/mnt/kdata`). See the `local-pv-volume` for details. This will let pods claim some of that data.\n\n```bash\nkubectl apply -f https://raw.githubusercontent.com/robrohan/skoupidia/main/kubernetes/oscar/local-pv-volume.yaml\n```\n\nThose storage options get reused once the pod goes down, and are only useful for temporary storage. For longer term storage I have a few USB drives plugged into various nodes. I used these drives for local media for jellyfin, and when downloading / training machine learning models I want to keep around\n\n```bash\nkubectl apply -f https://raw.githubusercontent.com/robrohan/skoupidia/main/kubernetes/oscar/usb-pv-volume.yaml\n```\n\nDeployment of our local media server\n\n\n```bash\nkubectl apply -f https://raw.githubusercontent.com/robrohan/skoupidia/main/kubernetes/oscar/jellyfin.yaml\n```\n\nDeployment of my custom built Jupyter notebook install.\n\n\n```bash\nkubectl apply -f https://raw.githubusercontent.com/robrohan/skoupidia/main/kubernetes/oscar/jupyter.yaml\n```\n\nOnce installed, you can upload the [Verify](./kubernetes/oscar/VerifyGPU.ipynb) notebook to make sure your GPU is working.\n\n![Jupyter seeing GPU](./docs/jupyter.png)\n\n\nCreate an S3 like (s3 compatible) storage service locally\n\n```bash\nkubectl apply -f https://raw.githubusercontent.com/robrohan/skoupidia/main/kubernetes/oscar/minio.yaml\n```\n\nOnce installed, you can upload the [Verify](./kubernetes/oscar/VerifyMinio.ipynb) notebook to make sure storage is working.\n\n![Minio cli example](./docs/minio.png)\n\nYou can also install the [CLI client](https://min.io/docs/minio/linux/reference/minio-mc.html) on your workstations to interact with the buckets:\n\n```bash\ncurl https://dl.min.io/client/mc/release/linux-amd64/mc --create-dirs -o $HOME/bin/mc\nchmod u+x $HOME/bin/mc\nmc --help\n```\n\nExample usage:\n\n```bash\nmc alias set oscar http://192.168.1.23:80 minio minio123  # setup the client\nmc admin info oscar\nmc mb oscar/test-bucket\nmc ls oscar\n```\n\nThis a just a utility I made to generate random MIDI files for musical inspiration and to feed into some models I am playing with (probably not interesting)\n\n```bash\nkubectl apply -f https://raw.githubusercontent.com/robrohan/skoupidia/main/kubernetes/oscar/songomatic.yaml\n```\n\nIf a volume ever gets stuck, and you want to allow others to claim it, you can un-taint it like this:\n\n```bash\nkubectl patch pv usb-jelly-1-pv-volume -p '{\"spec\":{\"claimRef\": null}}'\n```\n\n# Troubleshooting \n\n## DNS\n\nThis is can be quite frustrating. I think this is just an Ubuntu thing. Inside a worker node, it will use the master node for DNS. The master node will look inside itself to resolve the DNS entry, but if it doesn't know the URL, it'll look in `/etc/resolv.conf` for a name server.\n\nThere are seemingly several process that overwrite that file from time to time which seems to kill resolution. The netplan process, resolvd process, and tailscale can over write it too. This is the process I've found that works, but if it gets to be too much of a problem, you can unlink the /etc/resolv.conf file and just edit it.\n\nOn kmaster node:\n\n```bash\nsudo vi /etc/systemd/resolved.conf\n\nsudo unlink /etc/resolv.conf\nsudo ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf\nsudo systemctl restart systemd-resolved\nsudo systemctl enable systemd-resolved\n\nsudo cat /etc/resolv.conf\n```\n\n![Pods running on the cluster](./docs/pods.png)\n\n# References\n\n- [Build a Kubernetes Home Lab from Scratch step-by-step!](https://www.youtube.com/watch?v=_WW16Sp8-Jw)\n- [How to Install Containerd Container Runtime on Ubuntu 22.04](https://www.howtoforge.com/how-to-install-containerd-container-runtime-on-ubuntu-22-04/)\n- [How to with new stuff](https://v1-28.docs.kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/)\n- [Local kubectl access](https://blog.christianposta.com/kubernetes/logging-into-a-kubernetes-cluster-with-kubectl/)\n\n[minio]: https://medium.com/@karrier_io/minio-s3-compatible-storage-on-kubernetes-74e2cf0902f3\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobrohan%2Fskoupidia","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frobrohan%2Fskoupidia","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobrohan%2Fskoupidia/lists"}