Architecting Kubernetes High Availability for Applications

Why application availability matters

If you’re reading this, you’re probably concerned about how you can most easily ensure high availability (HA) for the applications running in your Kubernetes clusters.

You’re not alone: apps running in production are in many cases mission-critical. If they perform poorly, or suffer downtime, it can be costly: whether that’s lost revenue, lost reputation, or something even more serious.

That’s not to say that only production environments matter. Clusters used for development and testing are vital, too — no CIO wants hundreds of developers sitting twiddling their thumbs because something in the staging environment is broken.

Kubernetes is a complex distributed system and many factors can cause an application to fail or its performance to be degraded due to complex interactions, human error, or hardware failure. Misconfigurations can cause problems to cascade through the cluster. Loosely coupled applications may fail if one of their dependent microservices suffers issues.

Given the velocity of changes to Kubernetes itself and other elements in the cluster, availability may also be non-trivially affected by planned downtime events and maintenance windows — see Darren’s blog on Kubernetes upgrades for an angle on this.

However there are common design patterns that you can follow, using the capabilities that Kubernetes enables, to address single points of failure and minimize downtime.

A multilayered approach to HA Kubernetes

In Kubernetes, high availability is achieved through a multilayered approach. This blog, and others to follow, will describe an architecture that is designed to survive outages of parts of a Kubernetes cluster, including worker nodes, Kubernetes control plane components, etcd (or other backing datastore), cluster components like the API server and scheduler, and underlying services like storage or application pods.

The first step is designing and deploying a resilient cluster architecture on your underlying infrastructure, be it cloud, bare metal or edge. That’s where a solution like Spectro Cloud Palette can help. But it’s also important to make sure your applications are well-architected to best take advantage of your highly available cluster, ensuring the applications themselves are resilient to changes and outages within the cluster, either planned or unplanned.

Let’s jump into five ways you can make that happen.

Load balancing service traffic

Starting from the outside in, you should plan for high availability in the layer that brings traffic into your application.

The three most common ways of bringing traffic into your Kubernetes cluster are node port, Ingress and LoadBalancer.

If you create a node port, Kubernetes will allocate a high-number service port for your application, which you can find with `kubectl get service -n <your-application-namespace>`. That port will be open on every worker node in your cluster, and you can use round-robin DNS, or a hardware or software load-balancer, to spread traffic across the nodes.

This is quick-and-dirty, and relies on your client implementing round-robin DNS correctly, which is not always a safe assumption. Your client application will also determine how failures on nodes and pods are handled, and this level of uncertainty may make node port unattractive or infeasible.

If your application speaks HTTP, it’s generally better to use an ingress layer, like nginx, Kong or HAProxy. You will need to ensure there are redundant ingress pods, so that if one pod becomes unavailable or the underlying node fails, another ingress pod can take its place. Again, a hardware or software load balancer in front of the ingress layer can spread load across the ingress pods. Running multiple Ingress pods behind a load-balancer ensures access to the Ingress service.

A third common way to bring traffic into a Kubernetes cluster is to use a LoadBalancer object. This is an abstraction in front of several different kinds of services, which can be fulfilled by different cloud objects (AWS ELB, as an example), external hardware (like F5), or cluster-served providers like MetalLB or Kube-vip.

You can use a load balancer to provide fail-over or highly available access to a service (including your ingress service) so that access to your application is maintained in the case of a pod or node failure. When a LoadBalancer is used, traffic will be routed only to pod replicas in Ready state (see below for more on Ready).

Who handles your replicas?

There are several objects in Kubernetes that describe pods: for example, Statefulsets, Daemonsets, Replicasets and Jobs. There are more.

If you want Kubernetes to handle your application availability, you should probably use Deployments, which create Replicasets. The other types take a less opinionated approach to how and when to operate on pods in response to a problem, so the work there is on you, and we want the work to be on the cluster!

When you create a Deployment, you’re telling the cluster how to create and update your replicaset; the replicaset then ensures the number of pods in its description are ready.

When a deployment is updated or restarted, it will create a new replicaset using one of two strategies, rollout or recreate.

Rollout is the default; it will replace the old replicaset by gradually scaling down pods in the old replicaset while gradually scaling up the pods in the new one — see the Kubernetes docs for maxSurge and maxUnavailable in the RollingUpdate section for the math that governs the number of pods to create and kill at a time.

This is the behavior that is best for maintaining the availability of your application, and you should take advantage of it.

Availability of persistent storage in stateful workloads

Pods with persistent data have additional considerations, especially when using persistent data storage volumes that are read-write-once (RWO). Examples of RWO volumes include block-based volumes like EBS, ceph block devices, or longhorn block devices.

Pods using a Deployment to manage replicasets can get stuck in Pending when using RWO volumes. Statefulset pods with persistent data further complicate matters; in the event of a node failure, a statefulset pod will be put in “unknown” status, and will not be rescheduled until the pod reports back to the cluster scheduler.

In order to work around these concerns, it is important to also plan for your persistent data to be highly available.

One way to address this is to avoid persistentVolumes entirely, and instead use a highly-available data store like a clustered database, a highly-available object or key/value store, or a messaging system exclusively for storing data. This could be an S3 bucket, Postgres database, either on an external cloud-managed system like AWS RDS, or an in-cluster operator like Cloud Native Postgres, Crunchy Data or Percona.

If you want to stick with persistentVolumes, you can use a read-write-many (RWX) persistent volume, backed by a shared file system like NFS, to avoid getting stuck in a rollout. An RWX volume can be attached simultaneously to several pods and nodes, though classic concerns like performance and shared file-locking still exist with these systems.

There are several ways to deliver an RWX volume, including rook-ceph, which uses CephFS as a backend; Longhorn, which relies on NFS; as well as proprietary solutions like Portworx, which you can deploy directly from Palette.

Health checks

Once we have our pod up and storage attached, we need to make sure the pod stays in a good state. Kubernetes supports three kinds of checks for the state of running pods, and you should use all three (carefully!) in your workload definitions to help ensure your pods are healthy.

We’ll give an overview here, but refer to the Kubernetes documentation for a deeper dive on the specifics of how to define each probe.

Readiness probes are probably the most familiar to new users of Kubernetes — if you run a command like `kubectl get pod –all-namespaces` in the command line interface, you’ll see a column that shows how many replicas of a pod are in Ready state, which comes from the readiness probe. Kubernetes uses the Readiness state of a pod to decide whether to send network traffic to that pod or not.

Multiple readiness checks can be defined on a pod, and they must all pass in order for the pod to be considered Ready.

Liveness probes are configured similarly to Readiness probes, using tests like http GET commands, shell commands, TCP connect tests, or some mix of those. The difference between Readiness and Liveness is that, when a liveness probe fails for longer than defined as tolerable, the container being probed will be restarted. This is significantly more risk-prone than the Readiness probe, which will simply direct traffic away from that instance of the container.

The Kubernetes documentation describes possible cascading failure scenarios stemming from the use of Liveness probes, and they definitely need to be evaluated and implemented carefully.

Liveness or Readiness? Liveness and Readiness!
There is some very important nuance here.

The obvious difference between these two checks is what the checks do when they fail. A failed readiness check will direct traffic away from your pod, while a failed liveness check will restart your container. The nuance is in what the checks describe when they succeed: readiness tells the cluster your pod is ready to serve traffic, while liveness proves the container is up.

These two checks combine to give you the most granular control when a pod fails.

If an application in your pod deadlocks, you can specify a low threshold for removing the Ready status, and a higher threshold for failing Liveness; this lets you quickly divert traffic away from a pod that is having a problem, but give the pod some time to recover before using the big hammer and restarting it.

If, by comparison, you only specify a Readiness probe, your pod would never be restarted as a result of health checks. If you only specify a Liveness probe, the cluster will only restart your container in response to a problem, and the pod would receive traffic as soon as the liveness probe succeeded.

Finally, Startup probes block liveness probes, usually so that you can give a slow-starting application time to come up. The startup probe will run according to the definition in the pod document; if it succeeds in the described timeframe, then it stops, and the liveness probe starts. If it does not succeed in the described timeframe, the container is restarted. These types of probes are meant to work around a pod that spends a few moments getting itself ready (instantiating a JVM, as an example).
Using a combination of Readiness, Liveness and Startup probes help your cluster determine which pods should receive traffic, and helps the cluster route around pods that may be having problems.

Pod Disruption Budgets

Pod Disruption Budgets (PDBs) let an application owner define how many pods in a replica set can be down at a given time.

It can be helpful to think of this as a layer of protection against administrative actions taken on the cluster — you can, through the API, require that your pod have a minimum number of replicas running, and administrators will be discouraged from removing pods that would otherwise take your pod below that threshold.

It is good practice to create PDBs for your applications that accurately describe the number of lost replicas your application can tolerate; your cluster administrator will see that if they attempt to take actions that would not satisfy the PDB.

Administrators can still take your pod replicas below your requested minimum by specifically ignoring your pdb setting, for example `kubectl drain node some-node –disable-eviction`. Without this flag, a drain operation that would normally wait for eviction, which is prevented from evicting a pod through a PDB, will use deletion to force the pod out despite its PDB. The administrator will see a warning when running the command without –disable-eviction, and the node will not fully drain if that violates a pod’s PDB.

Cluster administrators should be educated and aware of PDBs — they should not be using the “--disable-eviction” flag without good reason to do so; ideally they would contact an application owner before forcing a pod out of a node in violation of its PDB.

If you want to learn a little more about PDBs, check out this blog.

Putting it all together

Let’s pull this into an high-level description of an application that will show no downtime to end users, even when faced with Kubernetes cluster component upgrades, unplanned node failures, and upgrades of the application itself.

This application should, as an example:

Run multiple replicas
Use a Pod Disruption Budget
Store stateful data either in an external store, like a cache, message queue, database, object storage service, or a RWX persistentVolume
Be deployed from a Deployment object, with a deployment strategy type Rollout (again, this is one example, not the only way to achieve zero-downtime upgrades)
Use a Readiness probe to ensure that traffic only routes to healthy pod replicas
Use a Liveness probe to detect and correct deadlocking and other hard failures of a pod or node
Use a Startup probe in the case of a pod that takes longer to start up than a Liveness probe would allow
Bring traffic to the application in a way that also ensures high availability — this could be an Ingress service in the cluster running multiple replicas, an in-cluster load balancer that detects and corrects for failures, like MetalLB or Kube-vip, or a cloud or on-prem loadbalancer like an AWS ALB, or an F5 load balancer.

Let’s review how this configuration will operate when faced with a failure:

Deployment Foo has pods PodA PodB and PodC, deployed on nodes node1, node2 and node3. There is a fourth node, node4, running none of these workloads. They are fronted by an nginx ingress service, which has pods nginx1 and nginx2 deployed on node2 and node3.

A goat eats the power cable attached to node3, and the node goes offline (the goat was properly grounded, and is fine).

Goat Meets Electric Fence

The Kubernetes controller-manager attempts to connect to the kubelet on node3 to retrieve the status of pod3 and nginx2; under normal circumstances, node3 would be running liveness and readiness probes of those pods, and report the data back to controller-manager. However, in this case, the connection to the kubelet on node3 times out.

Controller-manager sets the status of these pods to Unknown. The number of ready replicas for both Ingress and our application are each reduced by one, and the endpoints for each of those pods are removed. The controller-manager then contacts the scheduler to schedule new replicas of each pod.

The Ingress pod is scheduled on node4; being stateless, the kubelet on that node pulls the container image (if it’s not already present) and starts the workload. The kubelet will run the readiness check against the pod and, when successful, report that to a poll from the controller-manager, which will update the API with a new ready state and a new endpoint, which allows traffic to flow to the now-ready ingress container.

The application workload will operate similarly, but any startup probes, persistentVolume mounts or sidecar containers will slow the start of the newly scheduled replica.

However, because the endpoint for the old pod was removed, and a new endpoint for the new replica has not been created yet, the other replicas will receive all the traffic for the application, and users should see no impact in accessing the application. Which is, of course, the outcome we’re looking for!

Next steps

We hope these examples and best practices will help you design more resilient applications, taking advantage of native design patterns to achieve high availability with Kubernetes.

We also hope you’ll stick around for the next in this series, which will describe architecting your cluster for the best resilience and availability.

Tags:

Concepts

Enterprise Scale

Data Storage

Operations

Observability

Architecting high availability applications on Kubernetes