-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extreme instabilities on well populated cluster #13
Comments
leaving on a single container, the system wont' stand a reboot, similar to kubernetes-sigs/kind#2879 (comment) erroring on kube-api calls with |
logs are full of |
suspecting this might be related canonical/microk8s#2790 (comment) |
it all leads here, to the kubeadmConfigPatches:
- |
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
metadata:
name: config
apiServer:
extraArgs:
advertise-address: "0.0.0.0" |
for reference, these refer to https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/#synopsis |
nevermind... I tried to change the network layer to cilium, it is still happening |
nothing, unstable even a plain installation without any config patching... plain |
Do you mean without injecting GPUs? One thing to note is that for GPU support one needs to set the A further note: The default mechanism for injecting GPUs does not support restarting / restoring a container. When you say "but it won't stand a reboot" do you mean the kind node, or the node where kind is running? |
i mean when starting just kind not through nvkind we had to switch in emergency to minikube: that worked out of the box with no further changes to the node, way more stable (i'm not stating it for getting people away from nvkind, i wouldn't be here commenting a ticket... it is to state the condition of the machines we installed on)
wait what? i'm expecting to be able to restart a node and get everything back online operational |
I think the issue is related to creating a pod without GPU but still being able to access the GPU. I am having this issue when I only run 1 pod in my cluster. So it's unrelated to being populated or not. I also did not do any reboots, so it's unrelated to reboots in my case. Events from a pod that's NOT requesting any GPU in my nvkind cluster. It's constantly crashlooping every few minutes (happened 4 times in 86 minutes):
describe pod container output:
|
I just tried out minikube with GPU support and it's working flawlessly. I don't see the same issue there. So it does seem related to kind + GPU support when using nvkind. |
I have nvkind cluster with only Istio installed, restarting over and over again in CrashLoopBackOff with status "Completed", exit status 0. The "pod sandbox changed" messages also do appear in istiod pod events. Happens even when no nvidia-device-plugin is present in cluster. |
Actually, after running |
(wrongly felt) it works for us on single container, but it won't stand a reboot (see: kubernetes-sigs/kind#2879 (comment))
on Debian 12 (installing
bookworm-backports
for nvidia drivers) we have tons of Pods beingkilled completely randomly... having a hard time to isolate the issue if not just mentioning the
OS we bootstrapped from (identical machine specs -> Hetzner's
GEX44
)I initially assumed it did work on Ubuntu Jammy, but the reality was we were running on single container
The text was updated successfully, but these errors were encountered: