-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: prevent nil pointer ref when reschedule policy is missing #24893
base: main
Are you sure you want to change the base?
Conversation
When upgrading from older versions of Nomad, the reschedule policy block may be nil. There is logic to handle this safely in the `NextRescheduleTimeByTime` used for allocs on disconnected clients, but it's missing from the `NextRescheduleTime` method used by more typical allocations. Return an empty time object in this case. Fixes: #24846
//If reschedule is disabled, return early | ||
if reschedulePolicy.Attempts == 0 && !reschedulePolicy.Unlimited { | ||
// If reschedule is disabled, return early | ||
if reschedulePolicy == nil || (reschedulePolicy.Attempts == 0 && !reschedulePolicy.Unlimited) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for reviewers: this behavior is lifted from NextRescheduleTimeByTime
but I'm not 100% convinced it's correct there either. This effectively disables rescheduling rather than falling back to the default behavior. But we should never have a nil reschedule policy because we canonicalize the allocation on restore, which in turn canonicalizes the job, which in turns sets the reschedule policy to the default for the job type. So it's not clear to me how we could get into this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed what I think is the root cause of the panic: TaskGroup.Canonicalize
would set a nil
ReschedulePolicy for system batch jobs! We never updated the NewRestartPolicy func when we added those! I pushed that fix and switched the logic around to prevent nil panics in the future (although we obviously very rarely add job types).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Slack you pointed out this is not the root cause as the panic mentions the GenericScheduler while sysbatch jobs use the SystemScheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted my push
3c91c19
to
df1a994
Compare
When upgrading from older versions of Nomad, the reschedule policy block may be nil. There is logic to handle this safely in the
NextRescheduleTimeByTime
used for allocs on disconnected clients, but it's missing from theNextRescheduleTime
method used by more typical allocations. Return an empty time object in this case.Fixes: #24846