Skip to content

Commit

Permalink
Critical operation PDB (#2830)
Browse files Browse the repository at this point in the history
Create the second PDB to cover Pods with a special "critical operation" label set.

This label is going to be assigned to all pg cluster's Pods by the Operator during a PG major version upgrade, by Patroni during a cluster/replica bootstrap. It can also be set manually or by any other automation tool.
  • Loading branch information
hughcapet authored Jan 29, 2025
1 parent f49b4f1 commit a56ecaa
Show file tree
Hide file tree
Showing 11 changed files with 456 additions and 123 deletions.
30 changes: 21 additions & 9 deletions docs/administrator.md
Original file line number Diff line number Diff line change
Expand Up @@ -620,22 +620,34 @@ By default the topology key for the pod anti affinity is set to
`kubernetes.io/hostname`, you can set another topology key e.g.
`failure-domain.beta.kubernetes.io/zone`. See [built-in node labels](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#interlude-built-in-node-labels) for available topology keys.

## Pod Disruption Budget
## Pod Disruption Budgets

By default the operator uses a PodDisruptionBudget (PDB) to protect the cluster
from voluntarily disruptions and hence unwanted DB downtime. The `MinAvailable`
parameter of the PDB is set to `1` which prevents killing masters in single-node
clusters and/or the last remaining running instance in a multi-node cluster.
By default the operator creates two PodDisruptionBudgets (PDB) to protect the cluster
from voluntarily disruptions and hence unwanted DB downtime: so-called primary PDB and
and PDB for critical operations.

### Primary PDB
The `MinAvailable` parameter of this PDB is set to `1` and, if `pdb_master_label_selector`
is enabled, label selector includes `spilo-role=master` condition, which prevents killing
masters in single-node clusters and/or the last remaining running instance in a multi-node
cluster.

## PDB for critical operations
The `MinAvailable` parameter of this PDB is equal to the `numberOfInstances` set in the
cluster manifest, while label selector includes `critical-operation=true` condition. This
allows to protect all pods of a cluster, given they are labeled accordingly.
For example, Operator labels all Spilo pods with `critical-operation=true` during the major
version upgrade run. You may want to protect cluster pods during other critical operations
by assigning the label to pods yourself or using other means of automation.

The PDB is only relaxed in two scenarios:

* If a cluster is scaled down to `0` instances (e.g. for draining nodes)
* If the PDB is disabled in the configuration (`enable_pod_disruption_budget`)

The PDB is still in place having `MinAvailable` set to `0`. If enabled it will
be automatically set to `1` on scale up. Disabling PDBs helps avoiding blocking
Kubernetes upgrades in managed K8s environments at the cost of prolonged DB
downtime. See PR [#384](https://github.com/zalando/postgres-operator/pull/384)
The PDBs are still in place having `MinAvailable` set to `0`. Disabling PDBs
helps avoiding blocking Kubernetes upgrades in managed K8s environments at the
cost of prolonged DB downtime. See PR [#384](https://github.com/zalando/postgres-operator/pull/384)
for the use case.

## Add cluster-specific labels
Expand Down
2 changes: 1 addition & 1 deletion docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ kubectl delete postgresql acid-minimal-cluster
```

This should remove the associated StatefulSet, database Pods, Services and
Endpoints. The PersistentVolumes are released and the PodDisruptionBudget is
Endpoints. The PersistentVolumes are released and the PodDisruptionBudgets are
deleted. Secrets however are not deleted and backups will remain in place.

When deleting a cluster while it is still starting up or got stuck during that
Expand Down
4 changes: 2 additions & 2 deletions docs/reference/operator_parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -334,13 +334,13 @@ configuration they are grouped under the `kubernetes` key.
pod namespace).

* **pdb_name_format**
defines the template for PDB (Pod Disruption Budget) names created by the
defines the template for primary PDB (Pod Disruption Budget) name created by the
operator. The default is `postgres-{cluster}-pdb`, where `{cluster}` is
replaced by the cluster name. Only the `{cluster}` placeholders is allowed in
the template.

* **pdb_master_label_selector**
By default the PDB will match the master role hence preventing nodes to be
By default the primary PDB will match the master role hence preventing nodes to be
drained if the node_readiness_label is not used. If this option if set to
`false` the `spilo-role=master` selector will not be added to the PDB.

Expand Down
5 changes: 4 additions & 1 deletion e2e/tests/test_e2e.py
Original file line number Diff line number Diff line change
Expand Up @@ -2547,7 +2547,10 @@ def check_cluster_child_resources_owner_references(self, cluster_name, cluster_n
self.assertTrue(self.has_postgresql_owner_reference(config_ep.metadata.owner_references, inverse), "config endpoint owner reference check failed")

pdb = k8s.api.policy_v1.read_namespaced_pod_disruption_budget("postgres-{}-pdb".format(cluster_name), cluster_namespace)
self.assertTrue(self.has_postgresql_owner_reference(pdb.metadata.owner_references, inverse), "pod disruption owner reference check failed")
self.assertTrue(self.has_postgresql_owner_reference(pdb.metadata.owner_references, inverse), "primary pod disruption budget owner reference check failed")

pdb = k8s.api.policy_v1.read_namespaced_pod_disruption_budget("postgres-{}-critical-op-pdb".format(cluster_name), cluster_namespace)
self.assertTrue(self.has_postgresql_owner_reference(pdb.metadata.owner_references, inverse), "pod disruption budget for critical operations owner reference check failed")

pg_secret = k8s.api.core_v1.read_namespaced_secret("postgres.{}.credentials.postgresql.acid.zalan.do".format(cluster_name), cluster_namespace)
self.assertTrue(self.has_postgresql_owner_reference(pg_secret.metadata.owner_references, inverse), "postgres secret owner reference check failed")
Expand Down
64 changes: 31 additions & 33 deletions pkg/cluster/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,16 +59,17 @@ type Config struct {
}

type kubeResources struct {
Services map[PostgresRole]*v1.Service
Endpoints map[PostgresRole]*v1.Endpoints
PatroniEndpoints map[string]*v1.Endpoints
PatroniConfigMaps map[string]*v1.ConfigMap
Secrets map[types.UID]*v1.Secret
Statefulset *appsv1.StatefulSet
VolumeClaims map[types.UID]*v1.PersistentVolumeClaim
PodDisruptionBudget *policyv1.PodDisruptionBudget
LogicalBackupJob *batchv1.CronJob
Streams map[string]*zalandov1.FabricEventStream
Services map[PostgresRole]*v1.Service
Endpoints map[PostgresRole]*v1.Endpoints
PatroniEndpoints map[string]*v1.Endpoints
PatroniConfigMaps map[string]*v1.ConfigMap
Secrets map[types.UID]*v1.Secret
Statefulset *appsv1.StatefulSet
VolumeClaims map[types.UID]*v1.PersistentVolumeClaim
PrimaryPodDisruptionBudget *policyv1.PodDisruptionBudget
CriticalOpPodDisruptionBudget *policyv1.PodDisruptionBudget
LogicalBackupJob *batchv1.CronJob
Streams map[string]*zalandov1.FabricEventStream
//Pods are treated separately
}

Expand Down Expand Up @@ -343,14 +344,10 @@ func (c *Cluster) Create() (err error) {
c.logger.Infof("secrets have been successfully created")
c.eventRecorder.Event(c.GetReference(), v1.EventTypeNormal, "Secrets", "The secrets have been successfully created")

if c.PodDisruptionBudget != nil {
return fmt.Errorf("pod disruption budget already exists in the cluster")
if err = c.createPodDisruptionBudgets(); err != nil {
return fmt.Errorf("could not create pod disruption budgets: %v", err)
}
pdb, err := c.createPodDisruptionBudget()
if err != nil {
return fmt.Errorf("could not create pod disruption budget: %v", err)
}
c.logger.Infof("pod disruption budget %q has been successfully created", util.NameFromMeta(pdb.ObjectMeta))
c.logger.Info("pod disruption budgets have been successfully created")

if c.Statefulset != nil {
return fmt.Errorf("statefulset already exists in the cluster")
Expand Down Expand Up @@ -1081,9 +1078,9 @@ func (c *Cluster) Update(oldSpec, newSpec *acidv1.Postgresql) error {
}
}

// pod disruption budget
if err := c.syncPodDisruptionBudget(true); err != nil {
c.logger.Errorf("could not sync pod disruption budget: %v", err)
// pod disruption budgets
if err := c.syncPodDisruptionBudgets(true); err != nil {
c.logger.Errorf("could not sync pod disruption budgets: %v", err)
updateFailed = true
}

Expand Down Expand Up @@ -1228,10 +1225,10 @@ func (c *Cluster) Delete() error {
c.logger.Info("not deleting secrets because disabled in configuration")
}

if err := c.deletePodDisruptionBudget(); err != nil {
if err := c.deletePodDisruptionBudgets(); err != nil {
anyErrors = true
c.logger.Warningf("could not delete pod disruption budget: %v", err)
c.eventRecorder.Eventf(c.GetReference(), v1.EventTypeWarning, "Delete", "could not delete pod disruption budget: %v", err)
c.logger.Warningf("could not delete pod disruption budgets: %v", err)
c.eventRecorder.Eventf(c.GetReference(), v1.EventTypeWarning, "Delete", "could not delete pod disruption budgets: %v", err)
}

for _, role := range []PostgresRole{Master, Replica} {
Expand Down Expand Up @@ -1730,16 +1727,17 @@ func (c *Cluster) GetCurrentProcess() Process {
// GetStatus provides status of the cluster
func (c *Cluster) GetStatus() *ClusterStatus {
status := &ClusterStatus{
Cluster: c.Name,
Namespace: c.Namespace,
Team: c.Spec.TeamID,
Status: c.Status,
Spec: c.Spec,
MasterService: c.GetServiceMaster(),
ReplicaService: c.GetServiceReplica(),
StatefulSet: c.GetStatefulSet(),
PodDisruptionBudget: c.GetPodDisruptionBudget(),
CurrentProcess: c.GetCurrentProcess(),
Cluster: c.Name,
Namespace: c.Namespace,
Team: c.Spec.TeamID,
Status: c.Status,
Spec: c.Spec,
MasterService: c.GetServiceMaster(),
ReplicaService: c.GetServiceReplica(),
StatefulSet: c.GetStatefulSet(),
PrimaryPodDisruptionBudget: c.GetPrimaryPodDisruptionBudget(),
CriticalOpPodDisruptionBudget: c.GetCriticalOpPodDisruptionBudget(),
CurrentProcess: c.GetCurrentProcess(),

Error: fmt.Errorf("error: %s", c.Error),
}
Expand Down
40 changes: 37 additions & 3 deletions pkg/cluster/k8sres.go
Original file line number Diff line number Diff line change
Expand Up @@ -109,10 +109,15 @@ func (c *Cluster) servicePort(role PostgresRole) int32 {
return pgPort
}

func (c *Cluster) podDisruptionBudgetName() string {
func (c *Cluster) PrimaryPodDisruptionBudgetName() string {
return c.OpConfig.PDBNameFormat.Format("cluster", c.Name)
}

func (c *Cluster) criticalOpPodDisruptionBudgetName() string {
pdbTemplate := config.StringTemplate("postgres-{cluster}-critical-op-pdb")
return pdbTemplate.Format("cluster", c.Name)
}

func makeDefaultResources(config *config.Config) acidv1.Resources {

defaultRequests := acidv1.ResourceDescription{
Expand Down Expand Up @@ -2207,7 +2212,7 @@ func (c *Cluster) generateStandbyEnvironment(description *acidv1.StandbyDescript
return result
}

func (c *Cluster) generatePodDisruptionBudget() *policyv1.PodDisruptionBudget {
func (c *Cluster) generatePrimaryPodDisruptionBudget() *policyv1.PodDisruptionBudget {
minAvailable := intstr.FromInt(1)
pdbEnabled := c.OpConfig.EnablePodDisruptionBudget
pdbMasterLabelSelector := c.OpConfig.PDBMasterLabelSelector
Expand All @@ -2225,7 +2230,36 @@ func (c *Cluster) generatePodDisruptionBudget() *policyv1.PodDisruptionBudget {

return &policyv1.PodDisruptionBudget{
ObjectMeta: metav1.ObjectMeta{
Name: c.podDisruptionBudgetName(),
Name: c.PrimaryPodDisruptionBudgetName(),
Namespace: c.Namespace,
Labels: c.labelsSet(true),
Annotations: c.annotationsSet(nil),
OwnerReferences: c.ownerReferences(),
},
Spec: policyv1.PodDisruptionBudgetSpec{
MinAvailable: &minAvailable,
Selector: &metav1.LabelSelector{
MatchLabels: labels,
},
},
}
}

func (c *Cluster) generateCriticalOpPodDisruptionBudget() *policyv1.PodDisruptionBudget {
minAvailable := intstr.FromInt32(c.Spec.NumberOfInstances)
pdbEnabled := c.OpConfig.EnablePodDisruptionBudget

// if PodDisruptionBudget is disabled or if there are no DB pods, set the budget to 0.
if (pdbEnabled != nil && !(*pdbEnabled)) || c.Spec.NumberOfInstances <= 0 {
minAvailable = intstr.FromInt(0)
}

labels := c.labelsSet(false)
labels["critical-operation"] = "true"

return &policyv1.PodDisruptionBudget{
ObjectMeta: metav1.ObjectMeta{
Name: c.criticalOpPodDisruptionBudgetName(),
Namespace: c.Namespace,
Labels: c.labelsSet(true),
Annotations: c.annotationsSet(nil),
Expand Down
Loading

0 comments on commit a56ecaa

Please sign in to comment.