-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add node maintenance api support #76
base: main
Are you sure you want to change the base?
Conversation
2008c9d
to
f5be649
Compare
/ok-to-test |
@@ -104,7 +104,7 @@ linters-settings: | |||
threshold: 100 | |||
funlen: | |||
lines: 120 | |||
statements: 55 | |||
statements: 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not make this a habbit :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, tried to avoid it
@@ -0,0 +1,285 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for test suite, as it need to register nodeMaintnence API scheme. Do you have an idea which other folder I can place it?
@@ -0,0 +1,125 @@ | |||
/* | |||
Copyright 2022 NVIDIA CORPORATION & AFFILIATES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for new files please update the year here
m.Log.V(consts.LogLevelInfo).Info("Driver Pod has no NodeName, skipping", "pod", pod.Name) | ||
continue | ||
} | ||
//TODO: Add existing nodeMaintenance objs to sync with nodeState reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this TODO? i think the state can (and imo should) be built based on node labels
base.UpgradeStateValidationRequired, len(currentState.NodeStates[base.UpgradeStateValidationRequired]), | ||
base.UpgradeStateUncordonRequired, len(currentState.NodeStates[base.UpgradeStateUncordonRequired])) | ||
|
||
// Determine the object to log this event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: comment in original ApplyState() code
|
||
return manager, nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an alternative to consider that would IMO simplify implementation and be more in-line with how most of this lib is implemented:
- drop the internal reconciler
have upgradeManager implement the needed - ProcessNodeMaintenenaceRequiredNodes()
- ProcessPostMaintenanceRequiredNodes()
- ProcessUncordonRequiredNodes() - modify to set upgrade state done if NodeMaintenance was not found
then upgrade controllers that use this lib would need to watch on node maintenance to use this implementation
IMO its good since then they consciously need to add rbac role for it.
I need to digest this PR abit more
just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your point here, only that the idea here was not to implement a full fledged nodeMaintenance controller, rather than just watching its status change. In addition watcher(controller's) logic is fairly generic and simple. The motivation was to maintain the following:
Reusability - Changes are seamless to GPU/Net Operators, no need to change upgrade controllers reconcile loop with nodeMaintenance logic (additional if/else cases for enabled/disabled feat).
Consistency - If we move reconciler to upper level probably require different code/tests in GPU/Net OP
Alternately, we can go with the hybrid approach. Upgrade controller Reconciler will fetch updated nodeMaintenance obj. While its predicateFunc as well as state machine will be part of lib logic. I mean that fetched nodeMaintnence obj will need (keep)to be a part of state
object and current lib reconciler logic will be divided between ProcessPostMaintenanceRequiredNodes/ProcessUncordonRequiredNodes
MaintenanceOPFinalizerName = "maintenance.nvidia.com/finalizer" | ||
MaintenanceOPRequestorNS string | ||
MaintenanceOPPodEvictionFilter *string | ||
MaintenanceOPControllerName = "node-maintenance" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: -node-maintenance-requestor ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
||
var ( | ||
// UseMaintenanceOperator enables requestor updrade mode | ||
UseMaintenanceOperator bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance to roll those globals into RequestorOptions
and reduce the ammount of globals in the pkg ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
// UseMaintenanceOperator enables requestor updrade mode | ||
UseMaintenanceOperator bool | ||
MaintenanceOPRequestorID = "nvidia.operator.com" | ||
MaintenanceOPFinalizerName = "maintenance.nvidia.com/finalizer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is a const no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I can set it as const
|
||
for _, nodeState := range currentClusterState.NodeStates[base.UpgradeStateUncordonRequired] { | ||
m.Log.V(consts.LogLevelDebug).Info("deleting node maintenance", | ||
nodeState.NodeMaintenance.GetName(), nodeState.NodeMaintenance.GetNamespace()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo you should have a predictable name for the NodeMaintenance and not cache it.
moreover controller may restart and this information will be lost so either way you need to understand how to derive it
if you use the node name it should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using node name as nodeMaintenance name it could be non-exclusive name while there are other operators(requestors) generating nodeMaintenance with same node name (unless they use a different namespaces)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets' move this to hack/test
inplace(legacy) and requestor (maintenance OP) upgrade modes Signed-off-by: Ido Heyvi <[email protected]>
* Adding 'post-maintenance-required' state support Signed-off-by: Ido Heyvi <[email protected]>
Signed-off-by: Ido Heyvi <[email protected]>
controller layer - WIP
f5be649
to
c44ec93
Compare
Following PR contains the following changes:
The motivation is to instantiate both upgrade logics, since there could be some edge cases when some cluster nodes state have already started inplace upgrade before Network operator was set to use maintenance OP logic.
We plan to retire inplace login once requestor flow is robust and used by both operators (Network/GPU)