K8SPSMDB-1263: retry on error during running backup #1838

pooknull · 2025-02-19T12:05:44Z

https://perconadev.atlassian.net/browse/K8SPSMDB-1263

DESCRIPTION

Problem:
During a long backup, psmdb-backup may enter an error state with the error: check for concurrent jobs: getting pbm object: create PBM connection to...

Cause:
The operator attempts to connect to the mongod pods but may fail due to some connection problem. However, it does not retry to connect and instead sets an error state for the backup. However, PBM continues and successfully completes the backup.

Solution:
The operator should retry connecting to the database during a backup if an error occurs.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?
Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported MongoDB version?
Does the change support oldest and newest supported Kubernetes version?

https://perconadev.atlassian.net/browse/K8SPSMDB-1263

gkech · 2025-02-24T12:23:05Z

pkg/controller/perconaservermongodbbackup/perconaservermongodbbackup_controller.go

 	}

 	time.Sleep(5 * time.Second)
-	return bcp.Status(ctx, cr)
+
+	err := retry.OnError(defaultBackoff, func(err error) bool { return err != nil }, func() error {


Since the 1st retry will be executed after 5 seconds due to the time.Sleep(5 * time.Second) we have already, maybe we can adjust the default backoff config to start after 10 seconds. With the current config we will start 1 retry after 5 seconds, then again after 5 seconds and then after 10. WDYT?

gkech · 2025-02-24T12:27:36Z

pkg/controller/perconaservermongodbbackup/perconaservermongodbbackup_controller.go

+		var err error
+		status, err = bcp.Status(ctx, cr)
+		return err


Maybe writing this like this

updatedStatus, err := bcp.Status(ctx, cr) if err == nil { status = updatedStatus } return err

will make it more clear what status we are setting and what error we return

JNKPercona · 2025-02-25T11:08:02Z

Test name	Status
arbiter	passed
balancer	passed
custom-replset-name	passed
custom-tls	passed
custom-users-roles	passed
custom-users-roles-sharded	passed
cross-site-sharded	passed
data-at-rest-encryption	passed
data-sharded	passed
demand-backup	passed
demand-backup-fs	passed
demand-backup-eks-credentials-irsa	passed
demand-backup-physical	passed
demand-backup-physical-sharded	passed
demand-backup-sharded	passed
expose-sharded	passed
ignore-labels-annotations	passed
init-deploy	passed
finalizer	passed
ldap	passed
ldap-tls	passed
limits	passed
liveness	passed
mongod-major-upgrade	passed
mongod-major-upgrade-sharded	passed
monitoring-2-0	passed
multi-cluster-service	passed
non-voting	passed
one-pod	passed
operator-self-healing-chaos	passed
pitr	passed
pitr-sharded	passed
pitr-physical	passed
preinit-updates	passed
pvc-resize	passed
recover-no-primary	passed
replset-overrides	passed
rs-shard-migration	passed
scaling	passed
scheduled-backup	passed
security-context	passed
self-healing-chaos	passed
service-per-pod	passed
serviceless-external-nodes	passed
smart-update	passed
split-horizon	passed
stable-resource-version	passed
storage	passed
tls-issue-cert-manager	passed
upgrade	passed
upgrade-consistency	passed
upgrade-consistency-sharded-tls	passed
upgrade-sharded	passed
users	passed
version-service	passed
We run 55 out of 55

commit: ea18c4a
image: perconalab/percona-server-mongodb-operator:PR-1838-ea18c4a3

K8SPSMDB-1263: retry on error during running backup

56be67c

https://perconadev.atlassian.net/browse/K8SPSMDB-1263

pull-request-size bot added the size/M 30-99 lines label Feb 19, 2025

fix

899af74

pooknull marked this pull request as ready for review February 20, 2025 12:42

pooknull requested review from hors, egegunes, nmarukovich and gkech as code owners February 20, 2025 12:42

hors added 2 commits February 20, 2025 17:49

Merge branch 'main' into K8SPSMDB-1263

c1fec3c

Merge branch 'main' into K8SPSMDB-1263

6f17404

gkech reviewed Feb 24, 2025

View reviewed changes

small improvements

d58cee8

gkech approved these changes Feb 24, 2025

View reviewed changes

nmarukovich approved these changes Feb 25, 2025

View reviewed changes

hors added 2 commits February 25, 2025 10:28

Merge branch 'main' into K8SPSMDB-1263

bfb1b5c

Merge branch 'main' into K8SPSMDB-1263

ea18c4a

hors added this to the v1.20.0 milestone Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8SPSMDB-1263: retry on error during running backup #1838

K8SPSMDB-1263: retry on error during running backup #1838

pooknull commented Feb 19, 2025 •

edited by jira bot

Loading

gkech Feb 24, 2025

pooknull Feb 24, 2025

gkech Feb 24, 2025

pooknull Feb 24, 2025

JNKPercona commented Feb 25, 2025

K8SPSMDB-1263: retry on error during running backup #1838

Are you sure you want to change the base?

K8SPSMDB-1263: retry on error during running backup #1838

Conversation

pooknull commented Feb 19, 2025 • edited by jira bot Loading

DESCRIPTION

CHECKLIST

gkech Feb 24, 2025

Choose a reason for hiding this comment

pooknull Feb 24, 2025

Choose a reason for hiding this comment

gkech Feb 24, 2025

Choose a reason for hiding this comment

pooknull Feb 24, 2025

Choose a reason for hiding this comment

JNKPercona commented Feb 25, 2025

pooknull commented Feb 19, 2025 •

edited by jira bot

Loading