Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add node maintenance api support #76

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

heyvister1
Copy link

@heyvister1 heyvister1 commented Jan 14, 2025

Following PR contains the following changes:

  • Breaks upstream pkg to support inplace (legacy)/requestor (maintenance OP) upgrade logics
.
├── base
├── insplace
├── manager
 │   ├── mocks
└── requestor
  • Adding nodeMaintenance API support
  • Using runtime-controller reconcile loop for watching nodeMaintenance objs status change.

The motivation is to instantiate both upgrade logics, since there could be some edge cases when some cluster nodes state have already started inplace upgrade before Network operator was set to use maintenance OP logic.
We plan to retire inplace login once requestor flow is robust and used by both operators (Network/GPU)

Copy link

copy-pr-bot bot commented Jan 14, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@heyvister1 heyvister1 force-pushed the add-node-maintenance-api branch 5 times, most recently from 2008c9d to f5be649 Compare January 19, 2025 13:34
@adrianchiris
Copy link
Collaborator

/ok-to-test

@@ -104,7 +104,7 @@ linters-settings:
threshold: 100
funlen:
lines: 120
statements: 55
statements: 60
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not make this a habbit :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, tried to avoid it

@@ -0,0 +1,285 @@
---
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this file ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for test suite, as it need to register nodeMaintnence API scheme. Do you have an idea which other folder I can place it?

@@ -0,0 +1,125 @@
/*
Copyright 2022 NVIDIA CORPORATION & AFFILIATES
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for new files please update the year here

m.Log.V(consts.LogLevelInfo).Info("Driver Pod has no NodeName, skipping", "pod", pod.Name)
continue
}
//TODO: Add existing nodeMaintenance objs to sync with nodeState reference
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this TODO? i think the state can (and imo should) be built based on node labels

base.UpgradeStateValidationRequired, len(currentState.NodeStates[base.UpgradeStateValidationRequired]),
base.UpgradeStateUncordonRequired, len(currentState.NodeStates[base.UpgradeStateUncordonRequired]))

// Determine the object to log this event
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: comment in original ApplyState() code


return manager, nil
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alternative to consider that would IMO simplify implementation and be more in-line with how most of this lib is implemented:

  • drop the internal reconciler
    have upgradeManager implement the needed
  • ProcessNodeMaintenenaceRequiredNodes()
  • ProcessPostMaintenanceRequiredNodes()
  • ProcessUncordonRequiredNodes() - modify to set upgrade state done if NodeMaintenance was not found

then upgrade controllers that use this lib would need to watch on node maintenance to use this implementation
IMO its good since then they consciously need to add rbac role for it.

I need to digest this PR abit more

just a thought.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point here, only that the idea here was not to implement a full fledged nodeMaintenance controller, rather than just watching its status change. In addition watcher(controller's) logic is fairly generic and simple. The motivation was to maintain the following:
Reusability - Changes are seamless to GPU/Net Operators, no need to change upgrade controllers reconcile loop with nodeMaintenance logic (additional if/else cases for enabled/disabled feat).
Consistency - If we move reconciler to upper level probably require different code/tests in GPU/Net OP

Alternately, we can go with the hybrid approach. Upgrade controller Reconciler will fetch updated nodeMaintenance obj. While its predicateFunc as well as state machine will be part of lib logic. I mean that fetched nodeMaintnence obj will need (keep)to be a part of state object and current lib reconciler logic will be divided between ProcessPostMaintenanceRequiredNodes/ProcessUncordonRequiredNodes

MaintenanceOPFinalizerName = "maintenance.nvidia.com/finalizer"
MaintenanceOPRequestorNS string
MaintenanceOPPodEvictionFilter *string
MaintenanceOPControllerName = "node-maintenance"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: -node-maintenance-requestor ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


var (
// UseMaintenanceOperator enables requestor updrade mode
UseMaintenanceOperator bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance to roll those globals into RequestorOptions and reduce the ammount of globals in the pkg ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

// UseMaintenanceOperator enables requestor updrade mode
UseMaintenanceOperator bool
MaintenanceOPRequestorID = "nvidia.operator.com"
MaintenanceOPFinalizerName = "maintenance.nvidia.com/finalizer"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is a const no ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I can set it as const


for _, nodeState := range currentClusterState.NodeStates[base.UpgradeStateUncordonRequired] {
m.Log.V(consts.LogLevelDebug).Info("deleting node maintenance",
nodeState.NodeMaintenance.GetName(), nodeState.NodeMaintenance.GetNamespace())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo you should have a predictable name for the NodeMaintenance and not cache it.
moreover controller may restart and this information will be lost so either way you need to understand how to derive it

if you use the node name it should be fine.

Copy link
Author

@heyvister1 heyvister1 Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using node name as nodeMaintenance name it could be non-exclusive name while there are other operators(requestors) generating nodeMaintenance with same node name (unless they use a different namespaces)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets' move this to hack/test

inplace(legacy) and requestor (maintenance OP) upgrade modes

Signed-off-by: Ido Heyvi <[email protected]>
* Adding 'post-maintenance-required' state support

Signed-off-by: Ido Heyvi <[email protected]>
@heyvister1 heyvister1 force-pushed the add-node-maintenance-api branch from f5be649 to c44ec93 Compare January 27, 2025 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants