June 6, 2022

How to handle blocking PodDisruptionBudgets on K8s with distributed storage

Kevin Reeuwijk
Kevin Reeuwijk
Principal Solution Architect

PDBs: a safety net for data availability

Pod Disruption Budgets, or PDBs, are the safekeepers of both your cluster and your applications. They can ensure that your distributed application never entirely goes down during cluster rebalancing, and they help postpone eviction of a pod until enough other suitable pods are online to handle the application’s traffic.

For core cluster components such as distributed storage, PDBs should be your first line of defense when it comesto ensuring data resiliency —and your last line of defense when it comes to preventing storage cluster outages. Bottom line: good applications have PDBs, poor applications don't.

Outages are never a good thing

In the real world, applications aren’t all written by Kubernetes experts to be perfectly distributed. The dev team may not see it as worth the additional hassle and complexity to implement a distributed/replicated solution (like this one for PostgreSQL).

The end results? There may be (for example) a singular database pod that will bring down the whole application when you kill it during maintenance.

Sure, many applications can recover from these single points of failure in a minute or two —a disruption that, arguably, most users won’t even notice. But evicting a Rook-Ceph OSD pod at the wrong time might result in catastrophic cluster storage failure. A PDB can protect against that.

These catastrophic failures are the ones you’ll lose sleep over, but in our opinion it’s never a good idea to introduce more application outages than strictly necessary. If you allow the K8s scheduler to evict application pods whenever it wants to, you’re potentially worsening the UX of your application. Especially if you use a K8s rebalancer like the descheduler, this could happen several times a day.

So, you shouldn’t shy away from using PDBs widely, when they might improve the UX of your application. Even if that means using a PDB that doesn’t allow any disruptions: a blocking PDB.

What to do when PDBs get in the way of Day 2 operations

The challenge then is how to deal with these blocking PDBs during automated maintenance activity. A PDB is a double-edged sword there: it helps prevent killing the pod voluntarily, but it also makes K8s operations work difficult, since it prevents the node from draining.

Platforms like Spectro Cloud Palette provide built-in capabilities for OS patching, OS upgrades, Kubernetes upgrades and CNI/CSI upgrades. All of these depend on successful draining of K8s nodes to gracefully move application workloads to other nodes as the cluster goes through a rolling upgrade.

You can read a nice example hereof how blocking PDBs can cause your cloud-managed cluster to spin up more and more nodes as a result of rolling upgrades with nodes that never successfully drain.

Another complication is that you’ll typically use a StatefulSet to run your application’s stateful workload (such as the database), which implies that you actually wantto get the database pod evicted before the node goes down. This is because K8s won’t automatically reschedule an unresponsive pod on a different node when the pod is part of a StatefulSet. So, if the node was given a forced reboot, the PDB would prevent the pod from getting evicted and the StatefulSet would prevent the pod from getting scheduled on a different node when the node goes down. If the node doesn’t come up again soon for any reason, the application outage could last for a lot longer than you were anticipating, since a replacement pod won’t get created until Kubernetes gives up on waiting for the offline node to come back.

The right way to drain nodes

So how do you deal with this situation? The answer lies in a multi-step node draining procedure. Essentially, we want node drains to follow this logic:

  1. Evict all (non-daemonset) pods from the node—honoring PDBs if they exist —with a maximum timeout (e.g. 10 minutes) for the whole operation to complete.
  2. If the operation aborts due to the timeout threshold being reached, perform a cluster storage check to determine if e.g. Rook-Ceph PDBs are preventing the node drain while storage resiliency is being ensured

           a) If storage PDBs are preventing the node drain, go back to step 1 and try again.

           b) If the storage cluster is healthy enough to allowf or our node to drain, go to

                the next step.

        3. Drain the remaining pods on the node without using eviction (effectively killing those pods), which will trigger K8s to immediately schedule them on other nodes.This procedure can be achieved for a cluster with the Rook-Ceph CSI in the following way:


if ! kubectl get node $1
  echo "This node does not exist! Exiting..."
  exit 1

until $drained
  if kubectl drain $nodetodrain --ignore-daemonsets --delete-emptydir-data --timeout=300s
    echo "Node successfully drained"
    if kubectl -n rook-ceph get pdb rook-ceph-osd-host-$nodetodrain
      echo "A Rook-Ceph PDB is preventing this node from draining, retrying until Ceph cluster state is healthy..."
      echo "No Rook-Ceph PDB preventing this node from draining, continuing with forced drain..."
      kubectl drain $nodetodrain --ignore-daemonsets --delete-emptydir-data --force --disable-eviction
      echo "Node successfully drained"

With the procedure above, you should be able to safely use PDBs for applications that don’t support non-disruptive eviction of some of their components, without endangering the resiliency of your storage cluster. It ensures the lowest possible downtime of your application during maintenance work.

In the Spectro Cloud Palette platform, we are adding this feature as a user-selectable option for individual clusters in an upcoming release. Stay tuned for more information once we release this capability.

Data Storage
Enterprise Scale
How to
Subscribe to our newsletter
By signing up, you agree with our Terms of Service and our Privacy Policy