Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8SPSMDB-1263: retry on error during running backup #1838

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

pooknull
Copy link
Contributor

@pooknull pooknull commented Feb 19, 2025

K8SPSMDB-1263 Powered by Pull Request Badge

https://perconadev.atlassian.net/browse/K8SPSMDB-1263

DESCRIPTION

Problem:
During a long backup, psmdb-backup may enter an error state with the error: check for concurrent jobs: getting pbm object: create PBM connection to...

Cause:
The operator attempts to connect to the mongod pods but may fail due to some connection problem. However, it does not retry to connect and instead sets an error state for the backup. However, PBM continues and successfully completes the backup.

Solution:
The operator should retry connecting to the database during a backup if an error occurs.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported MongoDB version?
  • Does the change support oldest and newest supported Kubernetes version?

@pull-request-size pull-request-size bot added the size/M 30-99 lines label Feb 19, 2025
@pooknull pooknull marked this pull request as ready for review February 20, 2025 12:42
}

time.Sleep(5 * time.Second)
return bcp.Status(ctx, cr)

err := retry.OnError(defaultBackoff, func(err error) bool { return err != nil }, func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the 1st retry will be executed after 5 seconds due to the time.Sleep(5 * time.Second) we have already, maybe we can adjust the default backoff config to start after 10 seconds. With the current config we will start 1 retry after 5 seconds, then again after 5 seconds and then after 10. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 243 to 245
var err error
status, err = bcp.Status(ctx, cr)
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe writing this like this

		updatedStatus, err := bcp.Status(ctx, cr)
		if err == nil {
			status = updatedStatus
		}
		return err

will make it more clear what status we are setting and what error we return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hors hors added this to the v1.20.0 milestone Feb 25, 2025
@JNKPercona
Copy link
Collaborator

Test name Status
arbiter passed
balancer passed
custom-replset-name passed
custom-tls passed
custom-users-roles passed
custom-users-roles-sharded passed
cross-site-sharded passed
data-at-rest-encryption passed
data-sharded passed
demand-backup passed
demand-backup-fs passed
demand-backup-eks-credentials-irsa passed
demand-backup-physical passed
demand-backup-physical-sharded passed
demand-backup-sharded passed
expose-sharded passed
ignore-labels-annotations passed
init-deploy passed
finalizer passed
ldap passed
ldap-tls passed
limits passed
liveness passed
mongod-major-upgrade passed
mongod-major-upgrade-sharded passed
monitoring-2-0 passed
multi-cluster-service passed
non-voting passed
one-pod passed
operator-self-healing-chaos passed
pitr passed
pitr-sharded passed
pitr-physical passed
preinit-updates passed
pvc-resize passed
recover-no-primary passed
replset-overrides passed
rs-shard-migration passed
scaling passed
scheduled-backup passed
security-context passed
self-healing-chaos passed
service-per-pod passed
serviceless-external-nodes passed
smart-update passed
split-horizon passed
stable-resource-version passed
storage passed
tls-issue-cert-manager passed
upgrade passed
upgrade-consistency passed
upgrade-consistency-sharded-tls passed
upgrade-sharded passed
users passed
version-service passed
We run 55 out of 55

commit: ea18c4a
image: perconalab/percona-server-mongodb-operator:PR-1838-ea18c4a3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M 30-99 lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants