-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator 1.14.0 and configAwsOrGcp.log_s3_bucket break cluster #2852
Comments
I tracked down issue to this Helm chart setting: Container startup without option:
Container startup with option:
So, we see configuration job failed and terminated here, configuration process incomplete, but container didn't terminated and postgres started anyway. It expected there is certificates, but they wasn't created and postgres failed to start:
Additional ENV variables for failed container (all other are same):
Not sure is it problem of only spilo or operator too - spilo shouldn't fail in such way, but probably it is operator provide incomplete or wrong settings to spilo container. Another issue - spilo configure sections in some kind of random order, that's why some of my clusters start up successfully, but failed to restart - if certificates was created before log configuration - postgres starts. Sometimes logs break things too early - and container hangs indefinitely. Complete logs and repeat steps: https://gist.github.com/baznikin/5d4f5d78613d3f333bd0a34fbd070433 |
Same problem for me |
TL;DR - compact explanation here #2852 (comment)
First of all, sorry for long logs and unstructured message. To write clean issue you have to have at least some understanding of what happens, but I have no idea yet. I read release notes on 1.12, 1.13 and 1.14 and descide I can upgrade stright to 1.14.0. But...
few kilobytes of logs and perplexity
After upgrading postgres-operator 1.11.0 to 1.14.0 my clusters won't startup:
3 clusters successfully started with updated spilo image (
payments-pg
,asana-automate-db
anddevelop-postgresql
) and 2 - not (brandadmin-pg
andgames-aggregator-pg
). Before I noticed not clusters are updated, I initialized upgrade 16 -> 17 on clusterdevelop-postgresql
and it stuck with same symptoms (at first I thought it is this reason, but now I don't thinks so, see below):and no more logs.
Some clusters managed to start there is same error:
After I delete this pod it stuck too!
Processes inside of failed clusters:
After one more deletion it is managed to start.
I notice one thing in the logs - sometimes container starts with WAL-E variables, sometimes - not. Operator shows its status as OK, but it's not:
While I wrote this issue passed like an hour or so, in despair I restarted this failed pod one more time and it STARTED (container
postgres
becameReady
), but still not working:All my clusters consisting of two nodes can't start replica node: Probably problem is with WAL variables...
It's complete mess!
Operator installed with Helm and terraform. Configured with ConfigMap:
The text was updated successfully, but these errors were encountered: