-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Controller and Webhook remain in pending #23
Comments
Hi @happytreees thanks for using AMD GPU Operator, Please take a look at the example Specifically, if you want to remove the nodeSelector: {}
nodeAffinity:
nodeSelectorTerms: [] you can configure it for AMD GPU Operator controller manager, KMM controller manager and KMM webhook separately. Then use the new
We will discuss the default values in |
Hello @yansun1996 , thanks for the reply. We are already aware of how to workaround this issue, however, there doesn't appear to be a valid reason for the node selectors which is the purpose of opening this issue. The following workaround is being used:
|
thanks for the suggestion @happytreees , that's a good point. We will discuss and provide a fix in the future releases. |
Problem Description
When deploying the gpu-operator on a Kubernetes cluster where the controlplanes are hidden such as in a Managed Kubernetes environment the controller and webhook pods remain in pending.
The pods are unable to schedule because of the node affinity rules in the default values:
https://github.com/ROCm/gpu-operator/blob/main/helm-charts/values.yaml
There doesn't appear to be any reason why these pods require a controlplane versus just targeting a non-gpu node. It would be better if those values are recommended in install instructions versus being forced in the default values.
Operating System
Ubuntu
CPU
AMD Epyc
GPU
AMD Instinct MI300X
ROCm Version
ROCm 6.2.3
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: