Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: prevent nil pointer ref when reschedule policy is missing #24893

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tgross
Copy link
Member

@tgross tgross commented Jan 17, 2025

When upgrading from older versions of Nomad, the reschedule policy block may be nil. There is logic to handle this safely in the NextRescheduleTimeByTime used for allocs on disconnected clients, but it's missing from the NextRescheduleTime method used by more typical allocations. Return an empty time object in this case.

Fixes: #24846

When upgrading from older versions of Nomad, the reschedule policy block may be
nil. There is logic to handle this safely in the `NextRescheduleTimeByTime` used
for allocs on disconnected clients, but it's missing from the
`NextRescheduleTime` method used by more typical allocations. Return an empty
time object in this case.

Fixes: #24846
//If reschedule is disabled, return early
if reschedulePolicy.Attempts == 0 && !reschedulePolicy.Unlimited {
// If reschedule is disabled, return early
if reschedulePolicy == nil || (reschedulePolicy.Attempts == 0 && !reschedulePolicy.Unlimited) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: this behavior is lifted from NextRescheduleTimeByTime but I'm not 100% convinced it's correct there either. This effectively disables rescheduling rather than falling back to the default behavior. But we should never have a nil reschedule policy because we canonicalize the allocation on restore, which in turn canonicalizes the job, which in turns sets the reschedule policy to the default for the job type. So it's not clear to me how we could get into this situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed what I think is the root cause of the panic: TaskGroup.Canonicalize would set a nil ReschedulePolicy for system batch jobs! We never updated the NewRestartPolicy func when we added those! I pushed that fix and switched the logic around to prevent nil panics in the future (although we obviously very rarely add job types).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Slack you pointed out this is not the root cause as the panic mentions the GenericScheduler while sysbatch jobs use the SystemScheduler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted my push

@tgross tgross added this to the 1.9.x milestone Jan 17, 2025
@tgross tgross added backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.9.x backport to 1.9.x release line labels Jan 17, 2025
@schmichael schmichael force-pushed the b-scheduler-panic-nil-reschedule-policy branch from 3c91c19 to df1a994 Compare January 17, 2025 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.9.x backport to 1.9.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nil pointer dereference after upgrade from 1.5 to 1.7
2 participants