Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator 1.14.0 and configAwsOrGcp.log_s3_bucket break cluster #2852

Open
baznikin opened this issue Jan 23, 2025 · 2 comments
Open

Operator 1.14.0 and configAwsOrGcp.log_s3_bucket break cluster #2852

baznikin opened this issue Jan 23, 2025 · 2 comments

Comments

@baznikin
Copy link

baznikin commented Jan 23, 2025

TL;DR - compact explanation here #2852 (comment)

First of all, sorry for long logs and unstructured message. To write clean issue you have to have at least some understanding of what happens, but I have no idea yet. I read release notes on 1.12, 1.13 and 1.14 and descide I can upgrade stright to 1.14.0. But...

few kilobytes of logs and perplexity

After upgrading postgres-operator 1.11.0 to 1.14.0 my clusters won't startup:

$ kubectl get postgresqls.acid.zalan.do -A
NAMESPACE            NAME                  TEAM               VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE    STATUS
brandadmin-staging   brandadmin-pg         develop            16        1      100Gi    1             500Mi            429d   SyncFailed
ga                   games-aggregator-pg   games-aggregator   16        2      125Gi    1000m         512Mi            157d   SyncFailed
payments             payments-pg           develop            16        1      20Gi     1             500Mi            457d   Running
sprint-reports       asana-automate-db     sprint             16        1      25Gi     1             500Mi            358d   Running
staging              develop-postgresql    develop            17        2      250Gi    1             2Gi              435d   UpdateFailed

3 clusters successfully started with updated spilo image (payments-pg, asana-automate-db and develop-postgresql) and 2 - not (brandadmin-pg and games-aggregator-pg). Before I noticed not clusters are updated, I initialized upgrade 16 -> 17 on cluster develop-postgresql and it stuck with same symptoms (at first I thought it is this reason, but now I don't thinks so, see below):

2025-01-23 15:59:28,706 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

and no more logs.

Some clusters managed to start there is same error:

$ kubectl -n sprint-reports logs asana-automate-db-0
2025-01-23 15:38:54,983 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 15:38:55,040 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 15:38:55,043 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 15:38:55,191 - bootstrapping - INFO - Configuring certificate
2025-01-23 15:38:55,192 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 15:38:55,775 - bootstrapping - INFO - Configuring pgqd
2025-01-23 15:38:55,776 - bootstrapping - INFO - Configuring wal-e
2025-01-23 15:38:55,777 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2025-01-23 15:38:55,777 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2025-01-23 15:38:55,778 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2025-01-23 15:38:55,780 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2025-01-23 15:38:55,780 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2025-01-23 15:38:55,782 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2025-01-23 15:38:55,782 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2025-01-23 15:38:55,793 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2025-01-23 15:38:55,793 - bootstrapping - INFO - Configuring crontab
2025-01-23 15:38:55,794 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring pgbouncer
2025-01-23 15:38:55,808 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring patroni
2025-01-23 15:38:55,826 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-01-23 15:38:55,827 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
2025-01-23 15:38:57,916 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2025-01-23 15:38:57,974 INFO: No PostgreSQL configuration items changed, nothing to reload.
2025-01-23 15:38:57,995 WARNING: Postgresql is not running.
2025-01-23 15:38:57,995 INFO: Lock owner: ; I am asana-automate-db-0
2025-01-23 15:38:58,000 INFO: pg_controldata:

After I delete this pod it stuck too!

Processes inside of failed clusters:

root@develop-postgresql-0:/home/postgres# ps ax
    PID TTY      STAT   TIME COMMAND
      1 ?        Ss     0:00 /usr/bin/dumb-init -c --rewrite 1:0 -- /bin/sh /launch.sh
      7 ?        S      0:00 /bin/sh /launch.sh
     20 ?        S      0:00 /usr/bin/runsvdir -P /etc/service
     21 ?        Ss     0:00 runsv pgqd
     22 ?        S      0:00 /bin/bash /scripts/patroni_wait.sh --role primary -- /usr/bin/pgqd /home/postgres/pgq_ticker.ini
     83 ?        S      0:00 sleep 60
     84 pts/0    Ss     0:00 bash
     97 pts/0    R+     0:00 ps ax

After one more deletion it is managed to start.

I notice one thing in the logs - sometimes container starts with WAL-E variables, sometimes - not. Operator shows its status as OK, but it's not:

$ kubectl -n brandadmin-staging logs brandadmin-pg-0
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 15:38:43,529 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 15:38:43,587 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 15:38:43,588 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 15:38:43,726 - bootstrapping - INFO - Configuring wal-e
2025-01-23 15:38:43,727 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2025-01-23 15:38:43,728 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2025-01-23 15:38:43,728 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2025-01-23 15:38:43,729 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2025-01-23 15:38:43,729 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2025-01-23 15:38:43,730 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2025-01-23 15:38:43,730 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2025-01-23 15:38:43,730 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2025-01-23 15:38:43,731 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2025-01-23 15:38:43,731 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2025-01-23 15:38:43,732 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2025-01-23 15:38:43,732 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2025-01-23 15:38:43,732 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2025-01-23 15:38:43,733 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2025-01-23 15:38:43,733 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2025-01-23 15:38:43,736 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2025-01-23 15:38:43,736 - bootstrapping - INFO - Configuring certificate
2025-01-23 15:38:43,736 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 15:38:43,910 - bootstrapping - INFO - Configuring crontab
2025-01-23 15:38:43,910 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 15:38:43,931 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

$ kubectl -n brandadmin-staging get postgresqls.acid.zalan.do -A
NAMESPACE            NAME                  TEAM               VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE    STATUS
brandadmin-staging   brandadmin-pg         develop            16        1      100Gi    1             500Mi            429d   SyncFailed
ga                   games-aggregator-pg   games-aggregator   16        2      125Gi    1000m         512Mi            157d   SyncFailed
payments             payments-pg           develop            16        1      20Gi     1             500Mi            457d   Running
sprint-reports       asana-automate-db     sprint             16        1      25Gi     1             500Mi            358d   Running
staging              develop-postgresql    develop            17        2      250Gi    1             2Gi              435d   UpdateFailed

$ kubectl -n brandadmin-staging delete pod brandadmin-pg-0
pod "brandadmin-pg-0" deleted

$ kubectl -n brandadmin-staging get pod
NAME                                                         READY   STATUS             RESTARTS         AGE
brand-admin-backend-api-7b7856c75-d2ktr                      1/1     Running            0                22h
brand-admin-backend-api-7b7856c75-vczsg                      1/1     Running            0                22h
brand-admin-backend-async-tasks-69c5876799-nm4nh             1/1     Running            0                22h
brandadmin-pg-0                                              1/2     Running            0                82s

$ kubectl -n brandadmin-staging logs brandadmin-pg-0
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 15:59:27,840 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 15:59:27,896 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 15:59:27,897 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 15:59:28,051 - bootstrapping - INFO - Configuring crontab
2025-01-23 15:59:28,053 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 15:59:28,070 - bootstrapping - INFO - Configuring certificate
2025-01-23 15:59:28,070 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 15:59:28,706 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

$ kubectl -n brandadmin-staging get pod brandadmin-pg-0
NAME              READY   STATUS    RESTARTS   AGE
brandadmin-pg-0   1/2     Running   0          81m

$ kubectl -n brandadmin-staging get postgresqls.acid.zalan.do brandadmin-pg
NAME            TEAM      VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE    STATUS
brandadmin-pg   develop   16        1      100Gi    1             500Mi            429d   Running

While I wrote this issue passed like an hour or so, in despair I restarted this failed pod one more time and it STARTED (container postgres became Ready), but still not working:

kubectl -n brandadmin-staging delete pod brandadmin-pg-0
pod "brandadmin-pg-0" deleted

$ kubectl  -n brandadmin-staging describe pod brandadmin-pg-0
Name:             brandadmin-pg-0
Namespace:        brandadmin-staging
Priority:         0
Service Account:  postgres-pod
Node:             pri-staging-wx2ci/10.106.0.35
Start Time:       Thu, 23 Jan 2025 18:26:41 +0100
Labels:           application=spilo
                  apps.kubernetes.io/pod-index=0
                  cluster-name=brandadmin-pg
                  controller-revision-hash=brandadmin-pg-5f65fc8dbd
                  spilo-role=master
                  statefulset.kubernetes.io/pod-name=brandadmin-pg-0
                  team=develop
Annotations:      prometheus.io/path: /metrics
                  prometheus.io/port: 9187
                  prometheus.io/scrape: true
                  status:
                    {"conn_url":"postgres://10.244.2.104:5432/postgres","api_url":"http://10.244.2.104:8008/patroni","state":"running","role":"primary","versi...
Status:           Running
IP:               10.244.2.104
IPs:
  IP:           10.244.2.104
Controlled By:  StatefulSet/brandadmin-pg
Containers:
  postgres:
    Container ID:   containerd://d67d695d8bce177e07b0ec3c23efbe59cc5349cb81e95abea6ba6e913fe7d836
    Image:          ghcr.io/zalando/spilo-17:4.0-p2
    Image ID:       ghcr.io/zalando/spilo-17@sha256:23861da069941ff5345e6a97455e60a63fc2f16c97857da8f85560370726cbe7
    Ports:          8008/TCP, 5432/TCP, 8080/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 23 Jan 2025 18:26:46 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     10
      memory:  6Gi
    Requests:
      cpu:      1
      memory:   500Mi
    Readiness:  http-get http://:8008/readiness delay=6s timeout=5s period=10s #success=1 #failure=3
    Environment:
      SCOPE:                        brandadmin-pg
      PGROOT:                       /home/postgres/pgdata/pgroot
      POD_IP:                        (v1:status.podIP)
      POD_NAMESPACE:                brandadmin-staging (v1:metadata.namespace)
      PGUSER_SUPERUSER:             postgres
      KUBERNETES_SCOPE_LABEL:       cluster-name
      KUBERNETES_ROLE_LABEL:        spilo-role
      PGPASSWORD_SUPERUSER:         <set to the key 'password' in secret 'postgres.brandadmin-pg.credentials.postgresql.acid.zalan.do'>  Optional: false
      PGUSER_STANDBY:               standby
      PGPASSWORD_STANDBY:           <set to the key 'password' in secret 'standby.brandadmin-pg.credentials.postgresql.acid.zalan.do'>  Optional: false
      PAM_OAUTH2:                   https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees
      HUMAN_ROLE:                   zalandos
      PGVERSION:                    16
      KUBERNETES_LABELS:            {"application":"spilo"}
      SPILO_CONFIGURATION:          {"postgresql":{"parameters":{"shared_buffers":"1536MB"}},"bootstrap":{"initdb":[{"auth-host":"md5"},{"auth-local":"trust"}],"dcs":{"postgresql":{"parameters":{"checkpoint_completion_target":"0.9","default_statistics_target":"100","effective_cache_size":"4608MB","effective_io_concurrency":"200","hot_standby_feedback":"on","huge_pages":"off","jit":"false","maintenance_work_mem":"384MB","max_connections":"100","max_standby_archive_delay":"900s","max_standby_streaming_delay":"900s","max_wal_size":"4GB","min_wal_size":"1GB","random_page_cost":"1.1","wal_buffers":"16MB","work_mem":"7864kB"}},"failsafe_mode":true}}}
      DCS_ENABLE_KUBERNETES_API:    true
      ALLOW_NOSSL:                  true
      AWS_ACCESS_KEY_ID:            xxxx
      AWS_ENDPOINT:                 https://fra1.digitaloceanspaces.com
      AWS_SECRET_ACCESS_KEY:        xxxx
      CLONE_AWS_ACCESS_KEY_ID:      xxx
      CLONE_AWS_ENDPOINT:           https://fra1.digitaloceanspaces.com
      CLONE_AWS_SECRET_ACCESS_KEY:  xxxx
      LOG_S3_ENDPOINT:              https://fra1.digitaloceanspaces.com
      WAL_S3_BUCKET:                xxx-staging-db-wal
      WAL_BUCKET_SCOPE_SUFFIX:      /79c4fff8-6efb-477a-83bc-a43d34e8160a
      WAL_BUCKET_SCOPE_PREFIX:      
      LOG_S3_BUCKET:                xxx-staging-db-backups-all
      LOG_BUCKET_SCOPE_SUFFIX:      /79c4fff8-6efb-477a-83bc-a43d34e8160a
      LOG_BUCKET_SCOPE_PREFIX:      
    Mounts:
      /dev/shm from dshm (rw)
      /home/postgres/pgdata from pgdata (rw)
      /var/run/postgresql from postgresql-run (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9mghg (ro)
  exporter:
    Container ID:   containerd://48c54ad6591eaf9e60aa92b3235cb4878900fb46e94aacfeedcb70465d005619
    Image:          quay.io/prometheuscommunity/postgres-exporter:latest
    Image ID:       quay.io/prometheuscommunity/postgres-exporter@sha256:6999a7657e2f2fb0ca6ebf417213eebf6dc7d21b30708c622f6fcb11183a2bb0
    Port:           9187/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Thu, 23 Jan 2025 18:26:47 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  256Mi
    Requests:
      cpu:     100m
      memory:  200Mi
    Environment:
      POD_NAME:                             brandadmin-pg-0 (v1:metadata.name)
      POD_NAMESPACE:                        brandadmin-staging (v1:metadata.namespace)
      POSTGRES_USER:                        postgres
      POSTGRES_PASSWORD:                    <set to the key 'password' in secret 'postgres.brandadmin-pg.credentials.postgresql.acid.zalan.do'>  Optional: false
      DATA_SOURCE_URI:                      127.0.0.1:5432
      DATA_SOURCE_USER:                     $(POSTGRES_USER)
      DATA_SOURCE_PASS:                     $(POSTGRES_PASSWORD)
      PG_EXPORTER_AUTO_DISCOVER_DATABASES:  true
    Mounts:
      /home/postgres/pgdata from pgdata (rw)
      /var/run/postgresql from postgresql-run (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9mghg (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  pgdata:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pgdata-brandadmin-pg-0
    ReadOnly:   false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  postgresql-run:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  kube-api-access-9mghg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             workloadKind=postgres:NoSchedule
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  22s   default-scheduler  Successfully assigned brandadmin-staging/brandadmin-pg-0 to pri-staging-wx2ci
  Normal  Pulled     18s   kubelet            Container image "ghcr.io/zalando/spilo-17:4.0-p2" already present on machine
  Normal  Created    18s   kubelet            Created container postgres
  Normal  Started    18s   kubelet            Started container postgres
  Normal  Pulling    18s   kubelet            Pulling image "quay.io/prometheuscommunity/postgres-exporter:latest"
  Normal  Pulled     17s   kubelet            Successfully pulled image "quay.io/prometheuscommunity/postgres-exporter:latest" in 455ms (455ms including waiting). Image size: 11070758 bytes.
  Normal  Created    17s   kubelet            Created container exporter
  Normal  Started    17s   kubelet            Started container exporter

$ kubectl -n brandadmin-staging logs brandadmin-pg-0
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 17:26:47,349 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 17:26:47,407 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 17:26:47,408 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 17:26:47,460 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 17:26:47,462 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 17:26:47,462 - bootstrapping - INFO - Configuring certificate
2025-01-23 17:26:47,463 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 17:26:47,768 - bootstrapping - INFO - Configuring patroni
2025-01-23 17:26:47,792 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-01-23 17:26:47,793 - bootstrapping - INFO - Configuring wal-e
2025-01-23 17:26:47,794 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2025-01-23 17:26:47,794 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2025-01-23 17:26:47,795 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2025-01-23 17:26:47,795 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2025-01-23 17:26:47,795 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2025-01-23 17:26:47,798 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2025-01-23 17:26:47,798 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2025-01-23 17:26:47,801 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2025-01-23 17:26:47,802 - bootstrapping - INFO - Configuring crontab
2025-01-23 17:26:47,803 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 17:26:47,816 - bootstrapping - INFO - Configuring pam-oauth2
2025-01-23 17:26:47,817 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-01-23 17:26:47,817 - bootstrapping - INFO - Configuring pgbouncer
2025-01-23 17:26:47,817 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-01-23 17:26:47,818 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
2025-01-23 17:26:49,683 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2025-01-23 17:26:49,754 INFO: No PostgreSQL configuration items changed, nothing to reload.
2025-01-23 17:26:49,774 WARNING: Postgresql is not running.
2025-01-23 17:26:49,775 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:26:49,781 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071
  Database system identifier: 7369539194529993100
  Database cluster state: shut down
  pg_control last modified: Thu Jan 23 17:32:16 2025
  Latest checkpoint location: 5A/82000028
  Latest checkpoint's REDO location: 5A/82000028
  Latest checkpoint's REDO WAL file: 0000001B0000005A00000082
  Latest checkpoint's TimeLineID: 27
  Latest checkpoint's PrevTimeLineID: 27
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:929334
  Latest checkpoint's NextOID: 873526
  Latest checkpoint's NextMultiXactId: 19
  Latest checkpoint's NextMultiOffset: 37
  Latest checkpoint's oldestXID: 717
  Latest checkpoint's oldestXID's DB: 5
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 5
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Thu Jan 23 17:32:16 2025
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: 389f9007f77b578836bfcee51eabb488b11042d00c48d0d84e718a826ce23d29

2025-01-23 17:32:36,148 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:36,326 INFO: starting as a secondary
2025-01-23 17:32:36 UTC [51]: [1-1] 67927d34.33 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2025-01-23 17:32:36 UTC [51]: [2-1] 67927d34.33 0     LOG:  pg_stat_kcache.linux_hz is set to 125000
2025-01-23 17:32:36 UTC [51]: [3-1] 67927d34.33 0     FATAL:  could not load server certificate file "/run/certs/server.crt": No such file or directory
2025-01-23 17:32:36 UTC [51]: [4-1] 67927d34.33 0     LOG:  database system is shut down
2025-01-23 17:32:36,971 INFO: postmaster pid=51
/var/run/postgresql:5432 - no response
2025-01-23 17:32:46,146 WARNING: Postgresql is not running.
2025-01-23 17:32:46,146 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:46,149 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071
  Database system identifier: 7369539194529993100
  Database cluster state: shut down
  pg_control last modified: Thu Jan 23 17:32:16 2025
  Latest checkpoint location: 5A/82000028
  Latest checkpoint's REDO location: 5A/82000028
  Latest checkpoint's REDO WAL file: 0000001B0000005A00000082
  Latest checkpoint's TimeLineID: 27
  Latest checkpoint's PrevTimeLineID: 27
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:929334
  Latest checkpoint's NextOID: 873526
  Latest checkpoint's NextMultiXactId: 19
  Latest checkpoint's NextMultiOffset: 37
  Latest checkpoint's oldestXID: 717
  Latest checkpoint's oldestXID's DB: 5
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 5
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Thu Jan 23 17:32:16 2025
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: 389f9007f77b578836bfcee51eabb488b11042d00c48d0d84e718a826ce23d29

2025-01-23 17:32:46,162 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:46,190 INFO: starting as a secondary
2025-01-23 17:32:46 UTC [62]: [1-1] 67927d3e.3e 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2025-01-23 17:32:46 UTC [62]: [2-1] 67927d3e.3e 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2025-01-23 17:32:46 UTC [62]: [3-1] 67927d3e.3e 0     FATAL:  could not load server certificate file "/run/certs/server.crt": No such file or directory
2025-01-23 17:32:46 UTC [62]: [4-1] 67927d3e.3e 0     LOG:  database system is shut down
2025-01-23 17:32:46,821 INFO: postmaster pid=62
/var/run/postgresql:5432 - no response
2025-01-23 17:32:56,143 WARNING: Postgresql is not running.
2025-01-23 17:32:56,144 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:56,146 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071

All my clusters consisting of two nodes can't start replica node: Probably problem is with WAL variables...

$ kubectl -n staging exec -it develop-postgresql-0 -- patronictl topology
Defaulted container "postgres" out of: postgres, exporter
+ Cluster: develop-postgresql (7369262358642845868) --------+----+-----------+
| Member                 | Host         | Role    | State   | TL | Lag in MB |
+------------------------+--------------+---------+---------+----+-----------+
| develop-postgresql-0   | 10.244.0.253 | Leader  | running | 39 |           |
| + develop-postgresql-1 |              | Replica |         |    |   unknown |
+------------------------+--------------+---------+---------+----+-----------+
$ kubectl -n staging logs develop-postgresql-0 | head -20
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 16:20:51,723 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 16:20:51,766 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 16:20:51,767 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 16:20:51,823 - bootstrapping - INFO - Configuring patroni
2025-01-23 16:20:51,846 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-01-23 16:20:51,847 - bootstrapping - INFO - Configuring pgbouncer
2025-01-23 16:20:51,847 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-01-23 16:20:51,848 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 16:20:51,848 - bootstrapping - INFO - Configuring pam-oauth2
2025-01-23 16:20:51,848 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-01-23 16:20:51,848 - bootstrapping - INFO - Configuring crontab
2025-01-23 16:20:51,848 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 16:20:51,868 - bootstrapping - INFO - Configuring certificate
2025-01-23 16:20:51,868 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 16:20:53,422 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 16:20:53,423 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main

$ kubectl -n staging exec -it develop-postgresql-0 -- tail /home/postgres/pgdata/pgroot/pg_log/postgresql-4.log
Defaulted container "postgres" out of: postgres, exporter
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist

$ kubectl -n staging logs develop-postgresql-1 | head -20
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 16:38:15,383 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 16:38:15,424 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 16:38:15,424 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 16:38:15,473 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 16:38:15,473 - bootstrapping - INFO - Configuring crontab
2025-01-23 16:38:15,473 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 16:38:15,482 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 16:38:15,482 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

$ kubectl -n staging exec -it develop-postgresql-1 -- tail /home/postgres/pgdata/pgroot/pg_log/postgresql-4.log
Defaulted container "postgres" out of: postgres, exporter
        STRUCTURED: time=2025-01-23T16:30:08.235670-00 pid=8215 action=push-wal key=s3://xxx-staging-db-wal/spilo/develop-postgresql/939ea78b-0caf-458f-a088-989352a97300/wal/16/wal_005/00000026000006700000009C.lzo prefix=spilo/develop-postgresql/939ea78b-0caf-458f-a088-989352a97300/wal/16/ rate=18353.3 seg=00000026000006700000009C state=complete
2025-01-23 16:30:12 UTC [8234]: [5-1] 67926e94.202a 0     LOG:  ending log output to stderr
2025-01-23 16:30:12 UTC [8234]: [6-1] 67926e94.202a 0     HINT:  Future log output will go to log destination "csvlog".
ERROR: 2025/01/23 16:30:12.698764 Archive '00000027.history' does not exist.
ERROR: 2025/01/23 16:30:13.204088 Archive '00000026000006700000009D' does not exist.
ERROR: 2025/01/23 16:30:13.573033 Archive '00000027.history' does not exist.
ERROR: 2025/01/23 16:30:13.845528 Archive '00000028.history' does not exist.
ERROR: 2025/01/23 16:30:14.117082 Archive '00000027.history' does not exist.
ERROR: 2025/01/23 16:30:14.478060 Archive '00000027000006700000009D' does not exist.
ERROR: 2025/01/23 16:30:14.807988 Archive '00000026000006700000009D' does not exist.

$ kubectl -n staging describe pod develop-postgresql-0
Name:             develop-postgresql-0
Namespace:        staging
Priority:         0
Service Account:  postgres-pod
Node:             pri-staging-wx2cv/10.106.0.46
Start Time:       Thu, 23 Jan 2025 17:20:44 +0100
Labels:           application=spilo
                  apps.kubernetes.io/pod-index=0
                  cluster-name=develop-postgresql
                  controller-revision-hash=develop-postgresql-5f869975bf
                  spilo-role=master
                  statefulset.kubernetes.io/pod-name=develop-postgresql-0
                  team=develop
Annotations:      prometheus.io/path: /metrics
                  prometheus.io/port: 9187
                  prometheus.io/scrape: true
                  status:
                    {"conn_url":"postgres://10.244.0.253:5432/postgres","api_url":"http://10.244.0.253:8008/patroni","state":"running","role":"primary","versi...
Status:           Running
IP:               10.244.0.253
IPs:
  IP:           10.244.0.253
Controlled By:  StatefulSet/develop-postgresql
Containers:
  postgres:
    Container ID:   containerd://5004728ea5d71484a313b6124f2534a839da5ef0527427cec1942f135aa33e93
    Image:          ghcr.io/zalando/spilo-17:4.0-p2
    Image ID:       ghcr.io/zalando/spilo-17@sha256:23861da069941ff5345e6a97455e60a63fc2f16c97857da8f85560370726cbe7
    Ports:          8008/TCP, 5432/TCP, 8080/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 23 Jan 2025 17:20:50 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     10
      memory:  13500Mi
    Requests:
      cpu:      1
      memory:   2Gi
    Readiness:  http-get http://:8008/readiness delay=6s timeout=5s period=10s #success=1 #failure=3
    Environment:
      SCOPE:                        develop-postgresql
      PGROOT:                       /home/postgres/pgdata/pgroot
      POD_IP:                        (v1:status.podIP)
      POD_NAMESPACE:                staging (v1:metadata.namespace)
      PGUSER_SUPERUSER:             postgres
      KUBERNETES_SCOPE_LABEL:       cluster-name
      KUBERNETES_ROLE_LABEL:        spilo-role
      PGPASSWORD_SUPERUSER:         <set to the key 'password' in secret 'postgres.develop-postgresql.credentials.postgresql.acid.zalan.do'>  Optional: false
      PGUSER_STANDBY:               standby
      PGPASSWORD_STANDBY:           <set to the key 'password' in secret 'standby.develop-postgresql.credentials.postgresql.acid.zalan.do'>  Optional: false
      PAM_OAUTH2:                   https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees
      HUMAN_ROLE:                   zalandos
      PGVERSION:                    17
      KUBERNETES_LABELS:            {"application":"spilo"}
      SPILO_CONFIGURATION:          {"postgresql":{"parameters":{"shared_buffers":"3GB","shared_preload_libraries":"bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,pg_cron,pg_stat_kcache,decoderbufs"}},"bootstrap":{"initdb":[{"auth-host":"md5"},{"auth-local":"trust"}],"dcs":{"postgresql":{"parameters":{"checkpoint_completion_target":"0.9","default_statistics_target":"100","effective_cache_size":"9GB","effective_io_concurrency":"200","hot_standby_feedback":"on","huge_pages":"off","jit":"false","maintenance_work_mem":"768MB","max_connections":"200","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_standby_archive_delay":"900s","max_standby_streaming_delay":"900s","max_wal_size":"4GB","max_worker_processes":"8","min_wal_size":"1GB","random_page_cost":"1.1","wal_buffers":"16MB","wal_level":"logical","work_mem":"4MB"}},"failsafe_mode":true}}}
      DCS_ENABLE_KUBERNETES_API:    true
      ALLOW_NOSSL:                  true
      AWS_ACCESS_KEY_ID:            xxx
      AWS_ENDPOINT:                 https://fra1.digitaloceanspaces.com
      AWS_SECRET_ACCESS_KEY:        xxx
      CLONE_AWS_ACCESS_KEY_ID:      xxx
      CLONE_AWS_ENDPOINT:           https://fra1.digitaloceanspaces.com
      CLONE_AWS_SECRET_ACCESS_KEY:  xxx
      LOG_S3_ENDPOINT:              https://fra1.digitaloceanspaces.com
      WAL_S3_BUCKET:                xxx-staging-db-wal
      WAL_BUCKET_SCOPE_SUFFIX:      /939ea78b-0caf-458f-a088-989352a97300
      WAL_BUCKET_SCOPE_PREFIX:      
      LOG_S3_BUCKET:                xxx-staging-db-backups-all
      LOG_BUCKET_SCOPE_SUFFIX:      /939ea78b-0caf-458f-a088-989352a97300
      LOG_BUCKET_SCOPE_PREFIX:      
    Mounts:

It's complete mess!

Operator installed with Helm and terraform. Configured with ConfigMap:

resource "kubectl_manifest" "postgres-pod-config" {
  yaml_body = <<-EOF
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: postgres-pod-config
      namespace: ${var.namespace}
    data:
      ALLOW_NOSSL: "true"
      # WAL archiving and physical basebackups for PITR
      AWS_ENDPOINT: ${local.s3_endpoint}
      AWS_SECRET_ACCESS_KEY: ${local.s3_secret_key}
      AWS_ACCESS_KEY_ID: ${local.s3_access_id}
      # default values for cloning a cluster (same as above)
      CLONE_AWS_ENDPOINT: ${local.clone_s3_endpoint}
      CLONE_AWS_SECRET_ACCESS_KEY: ${local.clone_s3_secret_key}
      CLONE_AWS_ACCESS_KEY_ID: ${local.clone_s3_access_id}
      # send pg_logs to s3 (work in progress)
      LOG_S3_ENDPOINT: ${local.s3_endpoint}
    EOF
}

resource "helm_release" "postgres-operator" {
  name       = "postgres-operator"
  namespace  = var.namespace
  chart      = "postgres-operator"
  repository = "https://opensource.zalando.com/postgres-operator/charts/postgres-operator"
  version    = "1.14.0"

  depends_on = [kubectl_manifest.postgres-pod-config]

  dynamic "set" {
    for_each = var.wal_backup ? ["yes"] : []
    content {
      name  = "configAwsOrGcp.wal_s3_bucket"
      value = local.bucket_name_wal
    }
  }

  dynamic "set" {
    for_each = var.log_backup ? ["yes"] : []
    content {
      name  = "configAwsOrGcp.log_s3_bucket"
      value = "${var.name}-db-backups-all" # bucket with logical backups; 15 days ttl
    }
  }

  set {
    name  = "configLogicalBackup.logical_backup_s3_access_key_id"
    value = local.s3_access_id
  }

  set {
    name  = "configLogicalBackup.logical_backup_s3_bucket"
    value = local.bucket_name_backups
  }

  set {
    name  = "configLogicalBackup.logical_backup_s3_region"
    value = var.bucket_region
  }

  set {
    name  = "configLogicalBackup.logical_backup_s3_endpoint"
    value = local.s3_endpoint
  }

  set {
    name  = "configKubernetes.pod_environment_configmap"
    value = "${var.namespace}/postgres-pod-config"
  }
  set {
    name  = "configLogicalBackup.logical_backup_s3_secret_access_key"
    value = local.s3_secret_key
  }

  values = [<<-YAML
    configConnectionPooler:
      connection_pooler_image: "registry.xxx.com/devops/postgres-zalando-pgbouncer:master-32"

    configLogicalBackup:
      logical_backup_docker_image: "registry.xxx.com/devops/postgres-logical-backup:0.6"
      logical_backup_schedule: "32 8 * * *"
      logical_backup_s3_retention_time: "2 week"

    configKubernetes:
      enable_pod_antiaffinity: true
      # it doesn't influence pulling of images from public repos (like operator image) if there is no such secret
      # but will help to fetch postgres-logical-backup image
      pod_service_account_definition: |
        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: postgres-pod
        imagePullSecrets:
          - name: gitlab-registry-token
      # became disabled by default since 1.9.0 https://github.com/zalando/postgres-operator/releases/tag/v1.9.0
      # Quote: "We recommend enable_readiness_probe: true with pod_management_policy: parallel"
      enable_readiness_probe: true
      # Quote: "We recommend enable_readiness_probe: true with pod_management_policy: parallel"
      pod_management_policy: "parallel"
      enable_sidecars: true
      share_pgsocket_with_sidecars: true
      custom_pod_annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/metrics"
        prometheus.io/port: "9187"

    configPatroni:
      # https://patroni.readthedocs.io/en/master/dcs_failsafe_mode.html
      enable_patroni_failsafe_mode: true

    configGeneral:
      sidecars:
        - name: exporter
          image: quay.io/prometheuscommunity/postgres-exporter:latest
          ports:
            - name: exporter
              containerPort: 9187
              protocol: TCP
          resources:
            limits:
              cpu: 500m
              memory: 256Mi
            requests:
              cpu: 100m
              memory: 200Mi
          env:
            - name: DATA_SOURCE_URI
              value: "127.0.0.1:5432"
            - name: DATA_SOURCE_USER
              value: "$(POSTGRES_USER)"
            - name: DATA_SOURCE_PASS
              value: "$(POSTGRES_PASSWORD)"
            - name: PG_EXPORTER_AUTO_DISCOVER_DATABASES
              value: "true"
  YAML
  ]
}

@baznikin baznikin changed the title Issues with upgrading 1.11.0 -> 1.14.0 Cluster broken with upgrading 1.11.0 -> 1.14.0 Jan 23, 2025
@baznikin baznikin changed the title Cluster broken with upgrading 1.11.0 -> 1.14.0 Clusters became broken after upgrading to 1.14.0 Jan 28, 2025
@baznikin
Copy link
Author

baznikin commented Feb 13, 2025

I tracked down issue to this Helm chart setting: configAwsOrGcp.log_s3_bucket. Long ago I wanted to send logs to S3 storage but there were no possibility to specify custom endpoint at that time, so I put it off until later. Everything was OK until chart version 1.14.0 and spilo-17:4.0-p2.

Container startup without option:

2025-02-12 14:54:48,556 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-02-12 14:54:50,560 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup
2025-02-12 14:54:50,561 - bootstrapping - INFO - No meta-data available for this provider
2025-02-12 14:54:50,561 - bootstrapping - INFO - Looks like you are running local
2025-02-12 14:54:50,595 - bootstrapping - INFO - Configuring pgqd
2025-02-12 14:54:50,596 - bootstrapping - INFO - Configuring patroni
2025-02-12 14:54:50,601 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-02-12 14:54:50,601 - bootstrapping - INFO - Configuring pgbouncer
2025-02-12 14:54:50,601 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-02-12 14:54:50,602 - bootstrapping - INFO - Configuring wal-e
2025-02-12 14:54:50,602 - bootstrapping - INFO - Configuring log
2025-02-12 14:54:50,602 - bootstrapping - INFO - Configuring pam-oauth2
2025-02-12 14:54:50,602 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-02-12 14:54:50,602 - bootstrapping - INFO - Configuring crontab
2025-02-12 14:54:50,602 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-02-12 14:54:50,602 - bootstrapping - INFO - Configuring certificate
2025-02-12 14:54:50,602 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-02-12 14:54:50,774 - bootstrapping - INFO - Configuring bootstrap
2025-02-12 14:54:50,774 - bootstrapping - INFO - Configuring standby-cluster
2025-02-12 14:54:52,191 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.

Container startup with option:

2025-02-12 15:27:28,872 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-02-12 15:27:30,874 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup
2025-02-12 15:27:30,875 - bootstrapping - INFO - No meta-data available for this provider
2025-02-12 15:27:30,875 - bootstrapping - INFO - Looks like you are running local
2025-02-12 15:27:30,897 - bootstrapping - INFO - Configuring patroni
2025-02-12 15:27:30,902 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-02-12 15:27:30,902 - bootstrapping - INFO - Configuring pam-oauth2
2025-02-12 15:27:30,902 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-02-12 15:27:30,902 - bootstrapping - INFO - Configuring bootstrap
2025-02-12 15:27:30,902 - bootstrapping - INFO - Configuring wal-e
2025-02-12 15:27:30,902 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
2025-02-12 15:27:32,181 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.

So, we see configuration job failed and terminated here, configuration process incomplete, but container didn't terminated and postgres started anyway. It expected there is certificates, but they wasn't created and postgres failed to start:

2025-02-12 15:27:32 UTC [53]: [3-1] 67acbde4.35 0     FATAL:  could not load server certificate file "/run/certs/server.crt": No such file or directory

Additional ENV variables for failed container (all other are same):

> LOG_BUCKET_SCOPE_PREFIX=
> LOG_BUCKET_SCOPE_SUFFIX=/b715f8ec-2584-41fd-892a-bda4cba3a5ff
  LOG_ENV_DIR=/run/etc/log.d/env
> LOG_S3_BUCKET=ttt-db-backups-all

Not sure is it problem of only spilo or operator too - spilo shouldn't fail in such way, but probably it is operator provide incomplete or wrong settings to spilo container.

Another issue - spilo configure sections in some kind of random order, that's why some of my clusters start up successfully, but failed to restart - if certificates was created before log configuration - postgres starts. Sometimes logs break things too early - and container hangs indefinitely.

Complete logs and repeat steps: https://gist.github.com/baznikin/5d4f5d78613d3f333bd0a34fbd070433

@baznikin baznikin changed the title Clusters became broken after upgrading to 1.14.0 Operator 1.14.0 and configAwsOrGcp.log_s3_bucket break cluster Feb 13, 2025
@petrushinvs
Copy link

Same problem for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants