https://github.com/88plug/akash-provider-tools
A collection of tools for setting up / deploying / and managing Kubernetes clusters on Akash.Network
https://github.com/88plug/akash-provider-tools
Last synced: about 1 year ago
JSON representation
A collection of tools for setting up / deploying / and managing Kubernetes clusters on Akash.Network
- Host: GitHub
- URL: https://github.com/88plug/akash-provider-tools
- Owner: 88plug
- Created: 2022-02-20T19:10:53.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2024-05-24T06:18:00.000Z (about 2 years ago)
- Last Synced: 2025-02-15T10:31:43.582Z (over 1 year ago)
- Language: Shell
- Size: 309 KB
- Stars: 3
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# akash-provider-tools
A collection of tools for setting up / deploying / and managing Kubernetes clusters on Akash.Network
# Fix nvidia-smi broken
```
First disable Secure Boot so the driver updates will work - in BIOS required to disable!
then
sudo apt-get update ; sudo apt-get autoremove nvidia* --purge -y ; sudo apt-get install -y nvidia-driver-535 nvidia-cuda-toolkit nvidia-container-runtime
reboot now
```
# Always keep Provider Pod and Node running no matter what
```
Here's the PDB for the "akash-provider":
yaml
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: akash-provider-pdb
namespace: akash-services
annotations:
meta.helm.sh/release-name: akash-provider
meta.helm.sh/release-namespace: akash-services
spec:
selector:
matchLabels:
app: akash-provider
app.kubernetes.io/instance: akash-provider
app.kubernetes.io/name: provider
minAvailable: 1
And here's a PDB for the "akash-node-1":
yaml
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: akash-node-1-pdb
namespace: akash-services
annotations:
meta.helm.sh/release-name: akash-node
meta.helm.sh/release-namespace: akash-services
spec:
selector:
matchLabels:
akash.network/node: "1"
app: akash-node
minAvailable: 1
You should apply both PDBs to ensure protection for both sets of pods:
bash
kubectl apply -f akash-provider-pdb.yaml
kubectl apply -f akash-node-1-pdb.yaml
```
# Delete all Pods + Namespaces in a Terminating state - can cause a stuck cluster:
```
kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.metadata.deletionTimestamp != null) | "-n \(.metadata.namespace) \(.metadata.name)"' | xargs -L 1 kubectl delete pod --force --grace-period=0
kubectl get namespaces -o json | jq -r '.items[] | select(.status.phase=="Terminating") | .metadata.name' | xargs -I {} kubectl patch namespace {} --type json -p '[{"op": "remove", "path": "/metadata/finalizers"}]'
```
# Run akash command until it works
```
#!/bin/bash
while true; do
# Run just the akash command first, capturing both stdout and stderr
output=$(akash query market lease list --node="${AKASH_NODE}" --provider $i --gseq 0 --oseq 0 --page 1 --limit 2000 --state active -o json 2>&1)
# Check if the output contains the word "error"
if [[ ! "$output" =~ "error" ]]; then
# If there are no errors, pipe the output to jq, awk, and save it to summary_leases.log
echo "$output" | etc | > summary_leases.log
break
else
# If there is an error, print a retrying message
echo "Retrying..."
# Optional: sleep for a few seconds before retrying
sleep 3
fi
done
```
# Run mainnet/testnet over Chisel on a single IP:
Check out chisel.sh!
# Limit GPU power usage when nodes boot
```
[Unit]
Description=GPU power limit script
[Service]
ExecStart=/bin/bash /home/akash/gpu-power.sh
User=root
Type=oneshot
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
# Run testnet on a single IP:
Run SOCAT on a public VPS / port forward the alternative ports to the proper port on firewall.
```
[Unit]
Description=Socat Service
After=network.target
[Service]
ExecStartPre=/bin/bash -c "sleep 10" # optional: wait a bit for the network to be ready
ExecStart=/bin/bash -c '/usr/bin/socat TCP-LISTEN:80,fork TCP:136.24.x.x:38472 & \
/usr/bin/socat TCP-LISTEN:443,fork TCP:136.24.x.x:38473 & \
/usr/bin/socat TCP-LISTEN:1317,fork TCP:136.24.x.x:38474 & \
/usr/bin/socat TCP-LISTEN:26656,fork TCP:136.24.x.x:38475 & \
/usr/bin/socat TCP-LISTEN:26657,fork TCP:136.24.x.x:38476 & \
/usr/bin/socat TCP-LISTEN:8443,fork TCP:136.24.x.x:38477'
Restart=always
User=root
Group=root
Environment=PATH=/usr/bin:/usr/local/bin:/sbin:/bin
KillMode=process
[Install]
WantedBy=multi-user.target
```
# Limit User bandwidth on every pod
Create /etc/systemd/system/limits.service
```
[Unit]
Description=Run limits.sh script every 60 seconds
[Service]
ExecStart=/bin/bash -c 'while true; do /root/limits.sh; echo "Sleeping 60 seconds"; sleep 60; done'
Restart=always
[Install]
WantedBy=multi-user.target
```
Create limits.sh to limit specific images
```
#!/bin/bash
export KUBECONFIG=/root/kubeconfig
echo "Starting to patch with 1G down and 1M up per pod"
deployments=$(kubectl get deployments -A | grep -E 'softether|dante|honeygain|cc-worker|pkt|miner|xmrig' | awk '{print $1,$2}')
while read -r namespace deployment; do
echo "Patching $deployment in namespace $namespace"
kubectl patch deployment -n "$namespace" "$deployment" -p '{"spec": {"template":{"metadata":{"annotations":{"kubernetes.io/ingress-bandwidth":"50M"}}}}}'
kubectl patch deployment -n "$namespace" "$deployment" -p '{"spec": {"template":{"metadata":{"annotations":{"kubernetes.io/egress-bandwidth":"50M"}}}}}'
done <<< "$deployments"
```
or create a limits.sh to limit all Akash deployments bandwidth
```
#!/bin/bash
export KUBECONFIG=/root/kubeconfig
echo "Starting to patch with 50M down and 50M up per pod"
namespaces=$(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}')
for namespace in $namespaces; do
echo "Patching deployments in namespace $namespace"
deployments=$(kubectl get deployments -n "$namespace" --selector="akash.network/namespace" -o jsonpath='{.items[*].metadata.name}')
for deployment in $deployments; do
echo "Patching $deployment in namespace $namespace"
kubectl patch deployment -n "$namespace" "$deployment" -p '{"spec": {"template":{"metadata":{"annotations":{"kubernetes.io/ingress-bandwidth":"50M"}}}}}'
kubectl patch deployment -n "$namespace" "$deployment" -p '{"spec": {"template":{"metadata":{"annotations":{"kubernetes.io/egress-bandwidth":"50M"}}}}}'
done
done
```
Enable with
```
systemctl daemon-reload
systemctl enable --now limit.service
systemctl status limits
```
# Upgrade ingress-nginx to new format helm charts
```
Create ingress-nginx-custom.yaml
helm uninstall akash-ingress -n ingress-nginx
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx --version 4.6.0 --namespace ingress-nginx --create-namespace -f ingress-nginx-custom.yaml
kubectl label ingressclass akash-ingress-class akash.network=true
kubectl label ns ingress-nginx app.kubernetes.io/name=ingress-nginx app.kubernetes.io/instance=ingress-nginx
```
# Keep a small node / VPS clean of logs : requires bleachbit
`(crontab -l | grep -q '0 3 \* \* \* bleachbit --clean system.rotated_logs; bleachbit --clean system.cache; journalctl --vacuum-size=1M; bleachbit --clean apt.\*; k3s crictl rmi --prune' || (apt update && apt --assume-yes install bleachbit) && (crontab -l 2>/dev/null; echo '0 3 * * * bleachbit --clean system.rotated_logs; bleachbit --clean system.cache; journalctl --vacuum-size=1M; bleachbit --clean apt.*; k3s crictl rmi -a') | crontab -)`
# Enable Security Updates on a node at 6:00am daily using a cronjob / run once during node setup:
`(crontab -l | grep -q "unattended-upgrades" || (crontab -l ; echo "0 6 * * * unattended-upgrades -d")) | crontab - && if ! dpkg -s unattended-upgrades >/dev/null 2>&1; then apt-get update && apt-get install -y unattended-upgrades; fi && if ! grep -qE '^\"\${distro_id}:\${distro_codename}-security\";' /etc/apt/apt.conf.d/50unattended-upgrades; then sed -i 's/^\/\/\s*\"\${distro_id}:\${distro_codename}-security\"/\"\${distro_id}:\${distro_codename}-security\"\;/' /etc/apt/apt.conf.d/50unattended-upgrades; fi && if ! grep -qE '^\"\${distro_id}:\${distro_codename}-updates\";\s*\"\${distro_id}:\${distro_codename}-security\";' /etc/apt/apt.conf.d/50unattended-upgrades; then sed -i 's/^\/\/\s*\"\${distro_id}:\${distro_codename}-updates\"/\"\${distro_id}:\${distro_codename}-updates\"\;\n\"\${distro_id}:\${distro_codename}-security\"\;/' /etc/apt/apt.conf.d/50unattended-upgrades; fi && unattended-upgrades -d`
# Run payouts on your provider - source code for the Docker is under Dockerfile-payouts
```
docker run -it -v key.pem:/key.pem --env PROVIDER=yourprovider.com --env PASS=replace_with_key_pass cryptoandcoffee/akash-provider-payout:1
```
# Deploy Akash RPC nodes one liner using Helm Charts. Set your DOMAIN= first.
```
DOMAIN=mydomain.com ; helm repo add akash https://ovrclk.github.io/helm-charts ; helm repo update ; kubectl create ns akash-services ; kubectl create ns ingress-nginx ; kubectl label ns ingress-nginx app.kubernetes.io/name=ingress-nginx app.kubernetes.io/instance=ingress-nginx ; helm upgrade --install akash-ingress akash/akash-ingress -n ingress-nginx --set domain=$DOMAIN ; helm upgrade --install akash-node akash/akash-node -n akash-services --set akash_node.api_enable=true --set akash_node.minimum_gas_prices=0uakt --set image.tag="0.16.4" --set state_sync.enabled=true
```
# Enable HPA - never let your provider / node / hostname-operator pods go down! This will migrate them if a host fails.
```
#Setup HPA / Easy !
kubectl patch deployment -n akash-services akash-provider -p='{"spec":{"template":{"spec":{"containers":[{"name":"akash-provider","resources":{"requests":{"cpu":"4000m"}}}]}}}}'
kubectl patch deployment -n akash-services akash-node-1 -p='{"spec":{"template":{"spec":{"containers":[{"name":"akash-node","resources":{"requests":{"cpu":"1750m"}}}]}}}}'
kubectl patch deployment -n akash-services hostname-operator -p='{"spec":{"template":{"spec":{"containers":[{"name":"hostname-operator","resources":{"requests":{"cpu":"500m"}}}]}}}}'
#Default policy
kubectl autoscale deployment -n akash-services akash-provider --min=1 --max=10
kubectl autoscale deployment -n akash-services akash-node-1 --min=1 --max=10
kubectl autoscale deployment -n akash-services hostname-operator --min=1 --max=10
#Scale based on CPU Utilization - if you need it
#kubectl autoscale deployment -n akash-services akash-provider --cpu-percent=50 --min=1 --max=10
#kubectl autoscale deployment -n akash-services akash-node-1 --cpu-percent=50 --min=1 --max=10
#kubectl autoscale deployment -n akash-services hostname-operator --cpu-percent=50 --min=1 --max=10
```
# Cluster status monitoring
The best tool to use for cluster uptime monitoring is [UpDown.io](https://updown.io/r/ygC5V). Here is a reference for how to configure your page: [status.akash.world.](https://status.akash.world). Follow the instructions on UpDown to configure your status url to : `status.providerdomain.com`
# Remove a failed node from your cluster
# Change internal ip of microk8s node
On every node (including the master(s)):
microk8s stop (Stop all nodes before changing configuration files)
Get the VPN IP of the node, e.g. 10.x.y.z. Command ip a show dev tun1 will show info for interface tun1.
Add this to the bottom of /var/snap/microk8s/current/args/kubelet:
--node-ip=10.x.y.z
Add this to the bottom of /var/snap/microk8s/current/args/kube-apiserver:
--advertise-address=10.x.y.z
microk8s start
Now I see the correct values in the INTERNAL-IP column with microk8s kubectl get nodes -o wide.
# Excessive kubernetes master pod restarts
https://platform9.com/kb/kubernetes/excessive-kubernetes-master-pod-restarts
Edit `nano /etc/etcd.env` and update `heartbeat-interval` and `election-timeout` to 100 and 1000.
# Enable DNS over TLS for Akash Provider / Cloudflare Secure DNS
On your Kubernetes cluster you need to update coredns with the Cloudflare config.
In a terminal with access to your cluster with kubectl:
```
KUBE_EDITOR="nano" kubectl edit cm coredns -n kube-system
```
Then change Forward to
forward . tls://1.1.1.1 tls://1.0.0.1 {
tls_servername tls.cloudflare-dns.com
health_check 5s
}
# Backup and Restore Akash Provider from Storj
Use Velero and Storj to create snapshot backups.
```
velero install --provider tardigrade \
--plugins storjlabs/velero-plugin \
--bucket provider-backups \
--backup-location-config accessGrant=replaceme \
--no-secret
```
## Backup command
`velero backup create $(hostname)`
## Restore command
`velero restore create --from-backup $(hostname)`
## Create a daily backup, each living for 90 days (2160 hours).
`velero create schedule $(hostname) --schedule="@every 24h" --ttl 2160h0m0s`
# Withdraw
```
apt-get install -y bc jq
export AKASH_OUTPUT=json
export AKASH_NODE=http://
PROVIDER=
HEIGHT=$(akash query block | jq -r '.block.header.height')
akash query market lease list \
--provider $PROVIDER \
--gseq 0 --oseq 0 \
--state active --page 1 --limit 5000 \
| jq -r '.leases[].lease | [(.lease_id | .owner, (.dseq|tonumber), (.gseq|tonumber), (.oseq|tonumber), .provider)] | @tsv | gsub("\\t";",")' \
| while IFS=, read owner dseq gseq oseq provider; do \
REMAINING=$(akash query escrow blocks-remaining --dseq $dseq --owner $owner | jq -r '.balance_remaining')
## FOR DEBUGGING/INFORMATIONAL PURPOSES
echo "INFO: $owner/$dseq/$gseq/$oseq balance remaining $REMAINING"
if (( $(echo "$REMAINING < 0" | bc -l) )); then
## UNCOMMENT WHEN READY
( sleep 2s; cat key-pass.txt; cat key-pass.txt ) | akash tx market lease withdraw --provider $provider --owner $owner --dseq $dseq --oseq $oseq --gseq $gseq --gas-prices=0.025uakt --gas=auto --gas-adjustment=1.3 -y --from $provider
sleep 10
## TODO: sleep 10 is necessary as a safeguard against account sequence re-use.
## BUG: this script needs NOT to run at the same time provider withdraws the lease.
## FOR DEBUGGING PURPOSES, COMMENT WHEN READY
#echo "INFO: akash tx market lease withdraw --provider $provider --owner $owner --dseq $dseq --oseq $oseq --gseq $gseq --gas-prices=0.025uakt --gas=auto --gas-adjustment=1.3 -y";
fi
done
```
# Zerotier
```
#Vultr setup script
ufw disable
apt-get remove --purge -y ufw
curl -s https://install.zerotier.com | sudo bash
zerotier-cli join xxx
# Ensure that we can forward packets between interfaces
sysctl net.ipv4.ip_forward=1
sed -i 's/#net.ipv4.ip_forward=1/net.ipv4.ip_forward=1/g' /etc/sysctl.conf
# Set up iptables rules
ip link | awk -F: '$0 !~ "lo|vir|wl|^[^0-9]"{print $2;getline}'
# eth0 <== This is our physical ethernet
# ztyou2j6dw <==This is our ZeroTier Virtual Adapter
PHY_IFACE="$(ip link | grep 'enp' | awk '{print substr($2,1,length($2)-1)}')"
ZT_IFACE="$(ip link | grep 'zt' | awk '{print substr($2,1,length($2)-1)}')" # <== This command will grab your ZeroTier interface name
iptables -t nat -A POSTROUTING -o $PHY_IFACE -j MASQUERADE
iptables -A FORWARD -i $PHY_IFACE -o $ZT_IFACE -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -i $ZT_IFACE -o $PHY_IFACE -j ACCEPT
# Make sure the rules are persistent after reboot/poweroff
apt-get install -y iptables-persistent
bash -c iptables-save > /etc/iptables/rules.v4
# Ensure that ZeroTier always comes back up after a reboot
systemctl enable zerotier-one
WORKING!
FIRST NODE: (can be on LAN/ZT)
k3sup install --cluster --user akash --ip 192.168.1.x --k3s-extra-args "--disable servicelb --disable traefik --disable metrics-server --disable-network-policy --flannel-backend=none --flannel-iface ztjlhz343e"
Additional Control Plane/SERVERS: (on WAN/ZT)
k3sup join --user root --ip 45.32.x.x --server-user akash --server-ip 172.24.x.x --server --k3s-extra-args "--disable servicelb --disable traefik --disable metrics-server --disable-network-policy --flannel-backend=none --flannel-iface ztjlhz343e"
AGENTS: (on WAN/ZT)
k3sup join --user root --ip 45.32.x.x --server-user akash --server-ip 172.24.x.x --k3s-extra-args "--flannel-iface ztjlhz343e"
```
# Run an Akash Provider with k3sup + zerotier + helm
1. Setup mysql/postgres server
2. Setup zerotier account and create a new network
3. Join the zerotier network on the machine you plan to run commands from (install plane)
3. Install Ubuntu 22.04 on first server (full control plane)
4. Install k3sup on install plane
Replace Server IP with zerotier IP
Change tls-san to your load balancer
Change node-external-ip to the public IP of the node
Change node-ip to the $SERVER_IP
```
export SERVER_IP=172.22.x.x
export USER=root
export datastore="mysql://user:pass@tcp(dbserver:25060)/databasename"
k3sup install --ip $SERVER_IP --user $USER --datastore $datastore --token yoursupersecretokenthatnobodyknows --no-extras --tls-san balance.x.com --k3s-extra-args '--node-external-ip x.x.x.x --node-ip 172.22.x.x --flannel-iface ztyxa36bu3'
```
to add an agent -
Install Ubuntu 22.04 and run
```
curl -s 'https://raw.githubusercontent.com/zerotier/ZeroTierOne/master/doc/contact%40zerotier.com.gpg' | gpg --import && \
if z=$(curl -s 'https://install.zerotier.com/' | gpg); then echo "$z" | sudo bash; fi
zerotier-cli join YOURZEROTIERNETWORK
```
Replace AGENT_IP with zerotier IP
```
export AGENT_IP=172.22.x.x
export SERVER_IP=balance.bdl.computer
export USER=root
k3sup join --user $USER --ip $AGENT_IP --server-host $SERVER_IP --server-ip x.x.x.x --k3s-extra-args '--node-ip 172.22.x.x --flannel-iface ztyxa36bu3'
```
```