Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node NotReady Disruption Controller #1659

Open
diranged opened this issue Sep 12, 2024 · 2 comments · May be fixed by #1755
Open

Node NotReady Disruption Controller #1659

diranged opened this issue Sep 12, 2024 · 2 comments · May be fixed by #1755
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@diranged
Copy link

Description

What problem are you trying to solve?
Sometimes nodes just become NotReady for a variety of reasons (bad cloud provider instance, non-responsive kubelet, etc). When a Node has been in a Ready state and then transitions into NotReady, I think that Karpenter should have another Disruption Controller that monitors for these nodes and terminates them.

Third party controllers like the Spot.io Ocean Product, and the Cluster Autoscaler both handle nodes that become NotReady for you automatically. Karpenter should be able to do the same thing.

(Note we have also raised this with our AWS TAM via a support ticket, and we were recommended to open a feature-request here)

Related: #1573

How important is this feature to you?

This is actually a blocker for us migrating off of our current tools - we launch enough nodes and we have enough failures throughout the day that we cannot fully migrate unless we have a completely automated self healing system where these nodes get cycled out once they become NotReady.

(separate but related, is the ongoing discussion at bottlerocket-os/bottlerocket#4075 about EKS nodes becoming unready due to heavy memory pressure)

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@diranged diranged added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 12, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 12, 2024
@mariuskimmina
Copy link

This feature is also important to us as we have been running into this case a few times now where a node gets stuck in NotReady and require manual intervention. We've been slowly migrating our test envs from spotio to karpenter and this is currently stopping us from going to production.

We would also be interested in helping to implement this feature.

@mariuskimmina mariuskimmina linked a pull request Oct 16, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants