Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccurate status-"RUNNING" shown when pod is stuck in pending state #2864

Open
RavinaChidambaram opened this issue Feb 14, 2025 · 1 comment
Labels

Comments

@RavinaChidambaram
Copy link

  • Which image of the operator are you using? e.g. ghcr.io/zalando/postgres-operator:v1.12.2
  • Where do you run it - cloud or metal? Kubernetes or OpenShift? K8s
  • Are you running Postgres Operator in production? yes
  • Type of issue? Bug

Hi,
I deployed a PostgreSQL instance, and the pods were stuck in pending state. During this time, the PostgresClusterStatus was set to Creating. After some time the postgres status was set to CreateFailed and the following warning was observed in the operator logs:

time="2025-02-14T11:07:11Z" level=error msg="failed to create cluster: pod labels error: still failing after 200 retries" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:07:11Z" level=warning msg="cluster created failed: pod labels error: still failing after 200 retries" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:07:11Z" level=error msg="could not create cluster: pod labels error: still failing after 200 retries" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=controller worker=2

And later when the sync event is called, it patches the PostgresClusterStatus to Running which is incorrect as the pods are still stuck in pending state.

Image

This incorrect status is misleading, as it serves as the primary way for users to track the PostgreSQL cluster's state.

Logs from the operator during this sync event:

time="2025-02-14T11:11:29Z" level=debug msg="syncing Patroni config" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:11:29Z" level=warning msg="Patroni config updated? false - errors during config sync: could not get Postgres config from pod test-pg-demo/tcl-minimal-cluster-demo-0: could not get Postgres config from pod test-pg-demo/tcl-minimal-cluster-demo-0:  is not a valid IP', 'could not get Postgres config from pod test-pg-demo/tcl-minimal-cluster-demo-1: could not get Postgres config from pod test-pg-demo/tcl-minimal-cluster-demo-1:  is not a valid IP" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:11:29Z" level=error msg="errors while restarting Postgres in pods via Patroni API: could not restart Postgres in  pod test-pg-demo/tcl-minimal-cluster-demo-0: could not get member data:  is not a valid IP', 'could not restart Postgres in  pod test-pg-demo/tcl-minimal-cluster-demo-1: could not get member data:  is not a valid IP" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:11:29Z" level=debug msg="syncing pod disruption budgets" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:11:29Z" level=debug msg="syncing roles" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:11:29Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:11:44Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:11:59Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:12:14Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:12:29Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:12:44Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:12:59Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:13:14Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:13:14Z" level=error msg="could not sync roles: could not init db connection: could not init db connection: still failing after 8 retries" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:13:14Z" level=debug msg="syncing databases" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:13:14Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:13:29Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:13:44Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:13:59Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:14:14Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:14:29Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:14:44Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:14:59Z" level=warning msg="could not connect to Postgres database: dial tcp 10.245.50.134:5432: connect: connection refused" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:14:59Z" level=error msg="could not sync databases: could not init database connection" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:14:59Z" level=debug msg="syncing prepared databases with schemas" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:14:59Z" level=debug msg="syncing connection pooler (master, replica) from (false, nil) to (false, nil)" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:14:59Z" level=info msg="identified non running pod, potentially skipping major version upgrade" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:14:59Z" level=info msg="identified non running pod, potentially skipping major version upgrade" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=cluster worker=2
time="2025-02-14T11:14:59Z" level=info msg="cluster has been synced" cluster-name=test-pg-demo/tcl-minimal-cluster-demo pkg=controller worker=2
@FxKu
Copy link
Member

FxKu commented Feb 14, 2025

Interesting case. The SYNC completes even though no pod is running. Sounds like we should fail the sync when no pod is running.

@FxKu FxKu added the bug label Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants