Kubernetes configuration drift: what it is, how to stop it

For decades, operations teams have been battling configuration drift and its unexpected effects on production infrastructure and application environments.

From the sysadmins of the past to today’s SREs and platform engineers, config drift continues to plague teams with security and reliability risks.

This might be surprising. After all, the push towards immutability and GitOps movements brought on by DevOps should have solved this issue a long time ago. But the rise of GitOps was accompanied by a rise in complexity with Kubernetes.

Many organizations struggle to operate and maintain their Kubernetes clusters, let alone keep up with the rapid pace of innovation. With the expanding scope of Kubernetes deployments from multi-cluster to multi-cloud to hybrid environments, there are more opportunities for configuration drift to creep into the system.

In this article, we’ll explore the configuration drift problem, looking at the common causes, especially in the context of Kubernetes environments. We’ll then look at some strategies to combat configuration drift, and end with how Spectro Cloud tackles this issue to promise zero configuration drift.

What is configuration drift?

Configuration drift, is when the true state of an environment drifts away from the intended state.

The word “configuration” can refer to anything from the underlying infrastructure components (e.g., type of VM, number of instances) to configuration for applications (e.g. version of Docker image, values of environment variables). As long as the state of the deployed environment differs from the desired or recorded state, it is regarded as configuration drift.

The impacts of configuration drift can vary depending on the degree and components affected by the drift. Sometimes small changes may go undetected and ultimately not cause a significant issue. But configuration drift can sometimes lead to catastrophic results, such as prolonged downtime or even loss of data.

Often, configuration drift catches teams by surprise, and this unexpected nature is why configuration drift can be so dangerous. The last thing an on-call engineer wants to find when dealing with an outage is configuration drift, because it makes it hard to determine whether the cause is from prolonged effects of the drift or some other bug in the application.

Besides these visible impacts, configuration drift also poses a threat to security as either outdated components or manually modified parts can create weak points. For organizations in regulated industries, configuration drift can potentially impact compliance issues as well. For example, if an engineer manually opens a port for testing or debugging, it may be violating PCI DSS rules.

Common causes of configuration drift

Many things can cause configuration drift. Most commonly it’s brought on by change, whether planned or unplanned:

Temporary fixes: Sometimes engineers must issue manual fixes in production to respond to outages or incidents. The normal software development loop may take too long or not be feasible to test in time, and an engineer logs directly into the system to modify configuration files. Manual changes issued at this time may not be reflected back in the recorded desired state; they may not be documented at all.
Scheduled updates: Automated updates to various infrastructure or application components can also cause drift. This includes third-party software that may have either scheduled or automated updates as well as external dependencies from other teams or organizations (e.g., open-source libraries).
Unexpected changes: Even when state for your team is captured and unmodified, sometimes it can be impacted by unexpected changes from external parties who may have inadvertently gained access to your system (e.g., platform team upgrading the wrong AWS account) or more likely unestimated the impact (e.g., upgrading the database version without realizing that your team depends on it).

It’s important to note that configuration drift can also happen in the case of inaction:

Deprecated components: Not all pieces of software are supported forever. External vendors may stop support or issue a deprecation notice. Old software with security vulnerabilities may be blocked for execution.
Broken/unhealthy components: If a component cannot self-recover from a failure event, then it will remain broken. Also, sometimes self-recovery mechanisms may not be perfect and cause unexpected outcomes.

In the context of Kubernetes, configuration drift mostly comes from the scale of things that need to be kept in sync. While Kubernetes provides a consistent API, it’s still a challenge to keep everything from application Kubernetes manifests to the entire Kubernetes stack (e.g., infrastructure, tools, policies, etc) in sync. If your organization is considering going multi-cluster or multi-cloud, the burden multiplies as now detecting and addressing drift is that much more difficult.

How to stop config drift

The key to combating configuration drift is rooted in implementing immutability. Immutability in software engineering–and more specifically in infrastructure context–means that any object that’s created (e.g., infrastructure state, applications, etc) should not be modified. In practice, this means the following:

On the application level, immutability is implemented by pinning library versions and using Docker containers to encapsulate everything the application needs as a single artifact.

For example, pin packages in Dockerfile to specific versions instead of using latest:

RUN apt-get update &amp;&amp; apt-get install -y package-foo=1.3.2

Instead of:

RUN apt-get update &amp;&amp; apt-get install -y package-foo

On the infrastructure side, Infrastructure-as-Code (IaC) tools such as Terraform and Pulumi are used to capture infrastructure state in a declarative manner.
Both of these elements are then checked into a Git repository for a GitOps pattern to store the desired state in Git to track and version-control state as closely as possible.
Configuration drift is then checked periodically via IaC tool’s plan step (e.g., Terraform plan to detect changes) or via dedicated detection tool (e.g., Terraform Cloud for Drift Detection, Pulumi operator).
Finally, access to production is restricted to discourage manual changes that are hard to track and revert.

Of course, all of this is easier said than done. Within Kubernetes, there is a growing number of things to capture and keep in sync.

Most organizations by now have IaC and application immutability under control fairly well. But Kubernetes demands other lists of “configuration” items such as security and network policies as well as monitoring and incident management playbooks. To make matters worse, Kubernetes deployed to different environments inevitably have incompatibilities. When interacting with non-Kubernetes native components (e.g., cloud-specific infrastructure, IAM policies, etc) that must be accounted for and updated frequently.

Preventing config drift with Palette

Palette addresses this problem by applying the declarative desired state based management approach from Kubernetes to the entire application stack. Specifically, Palette utilizes Kubernetes Cluster API and Cluster Profiles underneath the hood:

Cluster API (CAPI) is a Kubernetes Special Interest Group project to extend Kubernetes APIs for cluster lifecycle management. In other words, this means using Kubernetes APIs to not only provision the underlying infrastructure components (e.g., VMs, networking, etc), but also Kubernetes cluster configuration like the topology and the cluster resources.
Cluster Profiles are templates or collections of workloads that are needed to run and operate a cluster. These include core Kubernetes components like CSI and CNI as well as common application level tooling like logging, monitoring, and ingress controllers. Cluster Profiles can also include custom Helm charts for application level deployments as well.

Together with Cluster API and Cluster Profiles, both the infrastructure and application level components are codified in a declarative manner.

Palette goes further to centrally manage these and continuously enforce them in a distributed manner via its decentralized architecture. This means that Palette uses a central management plane to define the desired state and policies for various clusters, while enforcing said policies locally at a cluster level.

Best of all, the reconciliation loop to enforce the policies continues to function even if the cluster is air-gapped or temporarily disconnected from the internet.

By addressing both the X-as-Code for immutability and continuous remediation at the same time, Palette is able to prevent config drift at scale.

As for the Day 2 concerns (e.g., OS/K8s upgrades, certificate rotation, security patches) and config drift that may occur due to inaction, Palette also simplifies that process as the updates can happen at the control plane level and automatically pushed out to all the clusters. Common tasks such as OS patching can be configured to patch on boot or on a scheduled cadence.

Finally for Kubernetes on the edge, Spectro Cloud supports Kairos Project , an open source solution for managing immutable Linux distributions for the edge.

Conclusion

Configuration drift is a pesky problem with serious ramifications. It’s a problem that can easily get out of hand given the complexity of Kubernetes and the scale to which enterprises are rolling out Kubernetes solutions.

While existing tools such as IaC and GitOps can mitigate the impacts to a certain extent, patching together multiple tools can often be challenging to cover all the scenarios in which configuration drift can occur.

On the other end, Palette provides a single solution to not only declaratively manage the desired state but also continuously enforce them.