Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: LowNodeUtilization can soft taint overutilized nodes #1626

Open
tiraboschi opened this issue Feb 7, 2025 · 0 comments · May be fixed by #1625
Open

Feature: LowNodeUtilization can soft taint overutilized nodes #1626

tiraboschi opened this issue Feb 7, 2025 · 0 comments · May be fixed by #1625
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@tiraboschi
Copy link

Is your feature request related to a problem? Please describe.
The descheduler LowNodeUtilization plugin evicts pods from overutilized nodes. Ideally they should land on underutilized nodes but in the reality the choice is completely up to the scheduler.
The descheduler LowNodeUtilization plugin is only responsible for identifying overutilized and underutilized nodes and deschedule, it will be the scheduler that will decide where to place newly-recreated pods.
The assumption was that the descheduler and the scheduler were acting upon similar criteria achieving a converging result.
In the past the descheduler was only classifying node utilization according to CPU and Memory requests but not the actual usage.
With #1555 the LowNodeUtilization plugin gained the capability to classify nodes according to real utilization metrics and this is an interesting feature when the actual resources usage of pods is sensibly larger than its requests.
On the other side this is creating an asymmetry with the scheduler that, at least with the default kube-scheduler, is going to schedule only according to resource requests.
This could potentially break the assumption that a pod descheduled from an overutilized node is likely going to land on an underutilized node since the scheduler is not aware of which nodes the descheduler is considering over and under utilized.

Describe the solution you'd like
A simply and elegant solution is about having the descheduler providing an hint for the scheduler by dynamically setting/removing a soft taint (effect: PreferNoSchedule) to the node that it considered overutilized according to the metrics and the thresholds configured on the descheduler side.
A PreferNoSchedule is just a "preference" or a "soft" version of a NoSchedule taint: the scheduler will try to avoid placing pods on nodes that the deschduler is considering overutilized but it is not guaranteed.
On the other side being just an hint and not a predicate, this is not introducing any risk of limiting the cluster capacity or making it unschedulable so there is no need to implement complex logic to keep the number of tainted nodes below a certain ratio.
So, on each round, the LowNodeUtilization plugin can simply try to apply the soft taint to all the nodes that are now classified as overutilized and remove it from nodes that are not considered overutilized anymore.
This should help the scheduler to take a decision that is consistent with the descheduler expectation of having pods evicted from overutilized nodes landing on appropriately or underutilized ones.

We can have this as an optional (disable by default) sub-feature of the LowNodeUtilization plugin and we can use a user configurable key for the soft taint (nodeutilization.descheduler.kubernetes.io/overutilized sounds as a valid default).
The users can still set a toleration for that taint on their critical workloads.

Describe alternatives you've considered
Node-controller is already setting some conditions based hard taints (effect: NoSchedule) for some node conditions (MemoryPressure,
DiskPressure, PIDPressure...) but those are hard taints when the node is already in a serious/critical condition.
Here the idea is to use a similar approach but with a soft taint when a node is overutilized but still not in a critical condition.
thresholds and targetThresholds and metricsUtilization can already be configured on the LowNodeUtilization plugin so the plugin is the only component that has an accurate view of which nodes are considered under and over utilized according to its configuration.

What version of descheduler are you using?
descheduler version: v0.32.1

Additional context
Something like this has also been reported in the past as for #994 but at that time the descheduler was only able to classify nodes as under/over utilized according to CPU/Memory requests and so the assumption that we can rely on an implicitly symmetry between descheduler and scheduler behaviour was solid.
Now that the descheduler can consume actual node resource utilization by consuming kubernetes metric the issue could become more frequent when static resource requests are not matching the actual usage patterns.
For instance, to achieve higher workload density, in KubeVirt project the pods that executes KubeVirt VMs are configured by default with a CPU request of 1/10 of what is required by the user on the VM.
With such an high overcommit ratio and a descheduler that is able to classify nodes as overutilized according to actual utilization metrics that can be significantly different from the static request, the capability to provide a soft hint for the scheduler to avoid nodes classified as overutilized sounds like a more than reasonable feature.

@tiraboschi tiraboschi added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
1 participant