-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K8SPSMDB-1263: retry on error during running backup #1838
base: main
Are you sure you want to change the base?
Conversation
} | ||
|
||
time.Sleep(5 * time.Second) | ||
return bcp.Status(ctx, cr) | ||
|
||
err := retry.OnError(defaultBackoff, func(err error) bool { return err != nil }, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the 1st retry will be executed after 5 seconds due to the time.Sleep(5 * time.Second)
we have already, maybe we can adjust the default backoff config to start after 10 seconds. With the current config we will start 1 retry after 5 seconds, then again after 5 seconds and then after 10. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var err error | ||
status, err = bcp.Status(ctx, cr) | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe writing this like this
updatedStatus, err := bcp.Status(ctx, cr)
if err == nil {
status = updatedStatus
}
return err
will make it more clear what status we are setting and what error we return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit: ea18c4a |
https://perconadev.atlassian.net/browse/K8SPSMDB-1263
DESCRIPTION
Problem:
During a long backup,
psmdb-backup
may enter anerror
state with the error:check for concurrent jobs: getting pbm object: create PBM connection to...
Cause:
The operator attempts to connect to the mongod pods but may fail due to some connection problem. However, it does not retry to connect and instead sets an
error
state for the backup. However, PBM continues and successfully completes the backup.Solution:
The operator should retry connecting to the database during a backup if an error occurs.
CHECKLIST
Jira
Needs Doc
) and QA (Needs QA
)?Tests
compare/*-oc.yml
)?Config/Logging/Testability