extreme instabilities on well populated cluster #13

zeph · 2024-11-15T16:20:48Z

(wrongly felt) it works for us on single container, but it won't stand a reboot (see: kubernetes-sigs/kind#2879 (comment))

on Debian 12 (installing bookworm-backports for nvidia drivers) we have tons of Pods being
killed completely randomly... having a hard time to isolate the issue if not just mentioning the
OS we bootstrapped from (identical machine specs -> Hetzner's GEX44)

I initially assumed it did work on Ubuntu Jammy, but the reality was we were running on single container

The text was updated successfully, but these errors were encountered:

zeph · 2024-11-16T16:03:18Z

leaving on a single container, the system wont' stand a reboot, similar to kubernetes-sigs/kind#2879 (comment)

erroring on kube-api calls with net/http: TLS handshake timeout

zeph · 2024-11-17T15:34:46Z

it now works (after moving GPUs to a separate container)...
but still, struggles with the performance (it's a 192GB RAM machine, not the Hetzner mentioned above)

here's the status of a host after reboot... this is the applied configuration:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
{{- range $gpu := until numGPUs }}
- role: worker
  extraMounts:
    # We inject all NVIDIA GPUs using the nvidia-container-runtime.
    # This requires `accept-nvidia-visible-devices-as-volume-mounts = true` be set
    # in `/etc/nvidia-container-runtime/config.toml`
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/{{ $gpu }}
{{- end }}

zeph · 2024-11-17T15:42:17Z

logs are full of Normal SandboxChanged 25m (x2 over 27m) kubelet Pod sandbox changed, it will be killed and re-created. on the failing containers

zeph · 2024-11-17T15:47:21Z

suspecting this might be related canonical/microk8s#2790 (comment)

zeph · 2024-11-17T16:07:21Z

it all leads here, to the kicker but this setting seems to get ignored...

kubeadmConfigPatches:
- |
  apiVersion: kubeadm.k8s.io/v1beta1
  kind: ClusterConfiguration
  metadata:
    name: config
  apiServer:
    extraArgs:
      advertise-address: "0.0.0.0"

zeph · 2024-11-17T16:07:48Z

for reference, these refer to https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/#synopsis

zeph · 2024-11-17T16:36:56Z

nevermind... I tried to change the network layer to cilium, it is still happening

zeph · 2024-11-17T21:34:19Z

nothing, unstable even a plain installation without any config patching... plain nvkind cluster create loading our full stack, we have tons of pods restarting... it doesn't happen with a vanilla kind create cluster ... I don't know where to look anymore

elezar · 2024-11-19T14:36:01Z

unstable even a plain installation without any config patching

Do you mean without injecting GPUs?

One thing to note is that for GPU support one needs to set the nvidia runtime as the default runtime. In cases where no GPUs are requested, this should be forwarding the request as-is to runc. It could be that this is not behaving as expected.

A further note: The default mechanism for injecting GPUs does not support restarting / restoring a container. When you say "but it won't stand a reboot" do you mean the kind node, or the node where kind is running?

zeph · 2024-11-22T17:02:22Z

@elezar

Do you mean without injecting GPUs?

i mean when starting just kind not through nvkind

we had to switch in emergency to minikube: that worked out of the box with no further changes to the node, way more stable (i'm not stating it for getting people away from nvkind, i wouldn't be here commenting a ticket... it is to state the condition of the machines we installed on)

injecting GPUs does not support restarting / restoring a container

wait what? i'm expecting to be able to restart a node and get everything back online operational
(yes, also in a minimalistic setup as kind/nvkind could get me) -> prototyping/demoing
...so, to answer ur question, both kind and host it is running on

samos123 · 2024-12-03T20:23:57Z

I think the issue is related to creating a pod without GPU but still being able to access the GPU. I am having this issue when I only run 1 pod in my cluster. So it's unrelated to being populated or not. I also did not do any reboots, so it's unrelated to reboots in my case.

Events from a pod that's NOT requesting any GPU in my nvkind cluster. It's constantly crashlooping every few minutes (happened 4 times in 86 minutes):

  Type     Reason          Age                 From     Message
  ----     ------          ----                ----     -------
  Normal   Pulling         58m (x5 over 90m)   kubelet  Pulling image "ollama/ollama:latest"
  Normal   SandboxChanged  58m (x4 over 86m)   kubelet  Pod sandbox changed, it will be killed and re-created.

describe pod container output:

Containers:
  server:
    Container ID:   containerd://ee3c057927cb8da49dbef13837fb85313590ae1d94d532a802f67ed86358dab0
    Image:          ollama/ollama:latest
    Image ID:       docker.io/ollama/ollama@sha256:55977eb618082df0f81ea197a75dc1710e54524f2ef71fa1a8b83cc0148b6e2f
    Port:           8000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 03 Dec 2024 12:03:11 -0800
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 03 Dec 2024 11:56:08 -0800
      Finished:     Tue, 03 Dec 2024 12:02:51 -0800
    Ready:          True
    Restart Count:  12
    Requests:
      cpu:      1
      memory:   2Gi
    Liveness:   http-get http://:http/ delay=900s timeout=3s period=30s #success=1 #failure=3
    Readiness:  http-get http://:http/ delay=0s timeout=2s period=10s #success=1 #failure=3
    Startup:    exec [bash -c /bin/ollama pull qwen2.5-coder:1.5b && /bin/ollama cp qwen2.5-coder:1.5b qwen2.5-coder-1.5b-cpu && /bin/ollama run qwen2.5-coder-1.5b-cpu hi] delay=1s timeout=10800s period=3s #success=1 #failure=10
    Environment:
      OLLAMA_HOST:        0.0.0.0:8000
      OLLAMA_KEEP_ALIVE:  999999h
    Mounts:
      /dev/shm from dshm (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r9mfz (ro)

samos123 · 2024-12-03T22:15:21Z

I just tried out minikube with GPU support and it's working flawlessly. I don't see the same issue there. So it does seem related to kind + GPU support when using nvkind.

LogExE · 2025-02-13T12:08:37Z

I have nvkind cluster with only Istio installed, restarting over and over again in CrashLoopBackOff with status "Completed", exit status 0. The "pod sandbox changed" messages also do appear in istiod pod events. Happens even when no nvidia-device-plugin is present in cluster.

LogExE · 2025-02-13T12:47:20Z

I have nvkind cluster with only Istio installed, restarting over and over again in CrashLoopBackOff with status "Completed", exit status 0. The "pod sandbox changed" messages also do appear in istiod pod events. Happens even when no nvidia-device-plugin is present in cluster.

Actually, after running git pull and rebuilding nvkind Istio is seem to be running okay now.

zeph changed the title ~~extreme instabilities on Debian 12 (bookworm)~~ extreme instabilities on double container Nov 16, 2024

zeph changed the title ~~extreme instabilities on double container~~ extreme instabilities on well populated cluster Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extreme instabilities on well populated cluster #13

extreme instabilities on well populated cluster #13

zeph commented Nov 15, 2024 •

edited

Loading

zeph commented Nov 16, 2024

zeph commented Nov 17, 2024 •

edited

Loading

zeph commented Nov 17, 2024

zeph commented Nov 17, 2024

zeph commented Nov 17, 2024

zeph commented Nov 17, 2024

zeph commented Nov 17, 2024

zeph commented Nov 17, 2024

elezar commented Nov 19, 2024

zeph commented Nov 22, 2024 •

edited

Loading

samos123 commented Dec 3, 2024 •

edited

Loading

samos123 commented Dec 3, 2024

LogExE commented Feb 13, 2025 •

edited

Loading

LogExE commented Feb 13, 2025

extreme instabilities on well populated cluster #13

extreme instabilities on well populated cluster #13

Comments

zeph commented Nov 15, 2024 • edited Loading

zeph commented Nov 16, 2024

zeph commented Nov 17, 2024 • edited Loading

zeph commented Nov 17, 2024

zeph commented Nov 17, 2024

zeph commented Nov 17, 2024

zeph commented Nov 17, 2024

zeph commented Nov 17, 2024

zeph commented Nov 17, 2024

elezar commented Nov 19, 2024

zeph commented Nov 22, 2024 • edited Loading

samos123 commented Dec 3, 2024 • edited Loading

samos123 commented Dec 3, 2024

LogExE commented Feb 13, 2025 • edited Loading

LogExE commented Feb 13, 2025

zeph commented Nov 15, 2024 •

edited

Loading

zeph commented Nov 17, 2024 •

edited

Loading

zeph commented Nov 22, 2024 •

edited

Loading

samos123 commented Dec 3, 2024 •

edited

Loading

LogExE commented Feb 13, 2025 •

edited

Loading