Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Controller and Webhook remain in pending #23

Open
happytreees opened this issue Dec 4, 2024 · 3 comments
Open

[Issue]: Controller and Webhook remain in pending #23

happytreees opened this issue Dec 4, 2024 · 3 comments

Comments

@happytreees
Copy link

Problem Description

When deploying the gpu-operator on a Kubernetes cluster where the controlplanes are hidden such as in a Managed Kubernetes environment the controller and webhook pods remain in pending.

The pods are unable to schedule because of the node affinity rules in the default values:

  nodeAffinity:
    nodeSelectorTerms:
      - key: node-role.kubernetes.io/control-plane
        operator: Exists
      - key: node-role.kubernetes.io/master
        operator: Exists

https://github.com/ROCm/gpu-operator/blob/main/helm-charts/values.yaml

There doesn't appear to be any reason why these pods require a controlplane versus just targeting a non-gpu node. It would be better if those values are recommended in install instructions versus being forced in the default values.

Operating System

Ubuntu

CPU

AMD Epyc

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.2.3

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@yansun1996
Copy link
Contributor

Hi @happytreees thanks for using AMD GPU Operator, Please take a look at the example values.yaml of our helm charts at https://github.com/ROCm/gpu-operator/blob/main/example/helm_charts_k8s_values_example.yaml where we defined the template for specifying nodeAffinity and nodeSelector for related resources.

Specifically, if you want to remove the nodeAffinity, you can modify them in the values.yaml to be:

  nodeSelector: {}
  nodeAffinity:
    nodeSelectorTerms: []

you can configure it for AMD GPU Operator controller manager, KMM controller manager and KMM webhook separately. Then use the new values.yaml in the -f option with your helm install command.

helm install xxxxxx -f new_values.yaml

We will discuss the default values in values.yaml for the next release.

@happytreees
Copy link
Author

Hello @yansun1996 , thanks for the reply. We are already aware of how to workaround this issue, however, there doesn't appear to be a valid reason for the node selectors which is the purpose of opening this issue.

The following workaround is being used:

controllerManager:
  nodeSelector: {}
  nodeAffinity:
    nodeSelectorTerms: []
kmm:
  controller:
    nodeSelector: {}
    nodeAffinity:
      nodeSelectorTerms: []
  webhookServer:
    nodeSelector: {}
    nodeAffinity:
      nodeSelectorTerms: []

@yansun1996
Copy link
Contributor

yansun1996 commented Dec 11, 2024

thanks for the suggestion @happytreees , that's a good point. We will discuss and provide a fix in the future releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants