Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extreme instabilities on well populated cluster #13

Open
zeph opened this issue Nov 15, 2024 · 14 comments
Open

extreme instabilities on well populated cluster #13

zeph opened this issue Nov 15, 2024 · 14 comments

Comments

@zeph
Copy link

zeph commented Nov 15, 2024

(wrongly felt) it works for us on single container, but it won't stand a reboot (see: kubernetes-sigs/kind#2879 (comment))

on Debian 12 (installing bookworm-backports for nvidia drivers) we have tons of Pods being
killed completely randomly... having a hard time to isolate the issue if not just mentioning the
OS we bootstrapped from (identical machine specs -> Hetzner's GEX44)

I initially assumed it did work on Ubuntu Jammy, but the reality was we were running on single container

@zeph zeph changed the title extreme instabilities on Debian 12 (bookworm) extreme instabilities on double container Nov 16, 2024
@zeph
Copy link
Author

zeph commented Nov 16, 2024

leaving on a single container, the system wont' stand a reboot, similar to kubernetes-sigs/kind#2879 (comment)

erroring on kube-api calls with net/http: TLS handshake timeout

@zeph
Copy link
Author

zeph commented Nov 17, 2024

it now works (after moving GPUs to a separate container)...
but still, struggles with the performance (it's a 192GB RAM machine, not the Hetzner mentioned above)

Image
here's the status of a host after reboot... this is the applied configuration:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  kubeadmConfigPatches:
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        node-labels: "ingress-ready=true"
  extraPortMappings:
  - containerPort: 80
    hostPort: 80
    protocol: TCP
  - containerPort: 443
    hostPort: 443
    protocol: TCP
{{- range $gpu := until numGPUs }}
- role: worker
  extraMounts:
    # We inject all NVIDIA GPUs using the nvidia-container-runtime.
    # This requires `accept-nvidia-visible-devices-as-volume-mounts = true` be set
    # in `/etc/nvidia-container-runtime/config.toml`
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/{{ $gpu }}
{{- end }}

@zeph
Copy link
Author

zeph commented Nov 17, 2024

logs are full of Normal SandboxChanged 25m (x2 over 27m) kubelet Pod sandbox changed, it will be killed and re-created. on the failing containers

@zeph
Copy link
Author

zeph commented Nov 17, 2024

suspecting this might be related canonical/microk8s#2790 (comment)

@zeph
Copy link
Author

zeph commented Nov 17, 2024

it all leads here, to the kicker but this setting seems to get ignored...

kubeadmConfigPatches:
- |
  apiVersion: kubeadm.k8s.io/v1beta1
  kind: ClusterConfiguration
  metadata:
    name: config
  apiServer:
    extraArgs:
      advertise-address: "0.0.0.0"

@zeph
Copy link
Author

zeph commented Nov 17, 2024

@zeph
Copy link
Author

zeph commented Nov 17, 2024

nevermind... I tried to change the network layer to cilium, it is still happening

@zeph
Copy link
Author

zeph commented Nov 17, 2024

nothing, unstable even a plain installation without any config patching... plain nvkind cluster create loading our full stack, we have tons of pods restarting... it doesn't happen with a vanilla kind create cluster ... I don't know where to look anymore

@zeph zeph changed the title extreme instabilities on double container extreme instabilities on well populated cluster Nov 19, 2024
@elezar
Copy link
Member

elezar commented Nov 19, 2024

unstable even a plain installation without any config patching

Do you mean without injecting GPUs?

One thing to note is that for GPU support one needs to set the nvidia runtime as the default runtime. In cases where no GPUs are requested, this should be forwarding the request as-is to runc. It could be that this is not behaving as expected.

A further note: The default mechanism for injecting GPUs does not support restarting / restoring a container. When you say "but it won't stand a reboot" do you mean the kind node, or the node where kind is running?

@zeph
Copy link
Author

zeph commented Nov 22, 2024

@elezar

Do you mean without injecting GPUs?

i mean when starting just kind not through nvkind

we had to switch in emergency to minikube: that worked out of the box with no further changes to the node, way more stable (i'm not stating it for getting people away from nvkind, i wouldn't be here commenting a ticket... it is to state the condition of the machines we installed on)

injecting GPUs does not support restarting / restoring a container

wait what? i'm expecting to be able to restart a node and get everything back online operational
(yes, also in a minimalistic setup as kind/nvkind could get me) -> prototyping/demoing
...so, to answer ur question, both kind and host it is running on

@samos123
Copy link

samos123 commented Dec 3, 2024

I think the issue is related to creating a pod without GPU but still being able to access the GPU. I am having this issue when I only run 1 pod in my cluster. So it's unrelated to being populated or not. I also did not do any reboots, so it's unrelated to reboots in my case.

Events from a pod that's NOT requesting any GPU in my nvkind cluster. It's constantly crashlooping every few minutes (happened 4 times in 86 minutes):

  Type     Reason          Age                 From     Message
  ----     ------          ----                ----     -------
  Normal   Pulling         58m (x5 over 90m)   kubelet  Pulling image "ollama/ollama:latest"
  Normal   SandboxChanged  58m (x4 over 86m)   kubelet  Pod sandbox changed, it will be killed and re-created.

describe pod container output:

Containers:
  server:
    Container ID:   containerd://ee3c057927cb8da49dbef13837fb85313590ae1d94d532a802f67ed86358dab0
    Image:          ollama/ollama:latest
    Image ID:       docker.io/ollama/ollama@sha256:55977eb618082df0f81ea197a75dc1710e54524f2ef71fa1a8b83cc0148b6e2f
    Port:           8000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 03 Dec 2024 12:03:11 -0800
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 03 Dec 2024 11:56:08 -0800
      Finished:     Tue, 03 Dec 2024 12:02:51 -0800
    Ready:          True
    Restart Count:  12
    Requests:
      cpu:      1
      memory:   2Gi
    Liveness:   http-get http://:http/ delay=900s timeout=3s period=30s #success=1 #failure=3
    Readiness:  http-get http://:http/ delay=0s timeout=2s period=10s #success=1 #failure=3
    Startup:    exec [bash -c /bin/ollama pull qwen2.5-coder:1.5b && /bin/ollama cp qwen2.5-coder:1.5b qwen2.5-coder-1.5b-cpu && /bin/ollama run qwen2.5-coder-1.5b-cpu hi] delay=1s timeout=10800s period=3s #success=1 #failure=10
    Environment:
      OLLAMA_HOST:        0.0.0.0:8000
      OLLAMA_KEEP_ALIVE:  999999h
    Mounts:
      /dev/shm from dshm (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r9mfz (ro)

@samos123
Copy link

samos123 commented Dec 3, 2024

I just tried out minikube with GPU support and it's working flawlessly. I don't see the same issue there. So it does seem related to kind + GPU support when using nvkind.

@LogExE
Copy link

LogExE commented Feb 13, 2025

I have nvkind cluster with only Istio installed, restarting over and over again in CrashLoopBackOff with status "Completed", exit status 0. The "pod sandbox changed" messages also do appear in istiod pod events. Happens even when no nvidia-device-plugin is present in cluster.

@LogExE
Copy link

LogExE commented Feb 13, 2025

I have nvkind cluster with only Istio installed, restarting over and over again in CrashLoopBackOff with status "Completed", exit status 0. The "pod sandbox changed" messages also do appear in istiod pod events. Happens even when no nvidia-device-plugin is present in cluster.

Actually, after running git pull and rebuilding nvkind Istio is seem to be running okay now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants