Why is it so painful to upgrade Kubernetes?
Kubernetes is well known for being a challenging system to maintain and keep current. The shifting landscape of the cloud native ecosystem means that changes are frequent. New versions not only patch bugs and close vulnerabilities, but add or enhance features.
Staying on top of these changes can become an all-consuming task for even the largest operations team, with upgrades taking multiple weeks every quarter.
And the burden of upgrades grows more even more complex when:
- Responsibilities are shared between platform teams, DevOps, or other invested parties.
- More clusters, more locations, or more providers are added to the mix.
- Edge locations, with their unique challenges, are in play.
You may naturally worry that patches and upgrades may break the infrastructure and compromise availability of your mission-critical applications. Even large, experienced organizations running Kubernetes can struggle with upgrades — like Reddit did with its major Pi-day outage.
After reading these post-mortems you likely build up a fear of the unknown — has anything changed in the new version that will cause downstream problems? Any unintended consequences? It can seem better to stay with the devil you know. Or turn to coping mechanisms…
But upgrading regularly is important, because:
- New versions not only provide new features, but contain important patches to security vulnerabilities and bugs that could affect the integrity and availability of your workloads
- Older versions of Kubernetes are only supported for a limited period, meaning if you encounter a problem you may be on your own
- Other software and services you use in your infrastructure will assume you’re running a recent version
- Making big version jumps in a single upgrade is more risky and is not endorsed by the Kubernetes docs
In this article we'll help you select the right strategy for upgrading Kubernetes versions, and share some best practices that will ensure your next upgrade goes smoothly.
Which version should you upgrade to?
Understanding the upstream Kubernetes release cycle
Kubernetes follows a typical versioning system expressed as x.y.z, where “x” is the “major” version, “y” is the “minor” version, and “z” is the “patch” version.
For example, at the time of this writing the current Kubernetes version is v1.28.1 from the August release, Planternetes.
Although the semantic versioning scheme and the Kubernetes release cycle plan implies major versions could occur every few years, there has not been a “major” version change since the original release of 1.0 in 2015.
Prior to 2021, the maintainers of Kubernetes aimed for quarterly “minor” releases. But both developers and operators found this cadence too frequent. We explored this in a panel discussion at KubeCon EU 2022, where the audience consensus was that keeping up with upgrades was a very real challenge, with many organizations lagging behind multiple versions.
The release teams struggled with this pace, too. Several releases were delayed or incomplete.
Starting with Kubernetes version 1.22, “minor” upgrades are now released three times per year. This new cadence was accompanied by a calendar of expected release dates and a process for managing and testing included items.
Typically releases are at the beginning of the year, April-May (typically coinciding with KubeCon EU), and August-September (often coinciding with KubeCon NA). Regular maintenance and patch releases are released more frequently throughout the year.
Support follows an N-2 model, meaning that support is available for the current release and the previous two. This means effectively that each Kubernetes release has a full one-year support cycle.
What do managed Kubernetes providers do?
Many organizations run their Kubernetes clusters inside a managed Kubernetes service such as EKS, AKS, and GKE. This makes upgrades easier in one major way: because the provider manages the control plane node(s), you will not have to perform upgrades on anything besides your worker nodes and other components.
But you still need to be aware of versions. With each new release, the providers must perform additional integration testing and customer support, so their releases lag behind the upstream release by approximately 4-8 weeks. In other words, if you check out EKS today, it doesn’t yet support 1.28.
Each provider’s support policies typically follow the same N-2 model, with the time based on their releases.
It is essential to check with your provider and understand their policies and published release windows.
Kubernetes version skew policy
While the details of Kubernetes component interoperability is beyond the scope of this article, it may be helpful to understand that Kubernetes can operate with different but close versions. The Version Skew Policy document provides details of this support.
This is important to understand as many tools and components will only allow you to upgrade one minor Kubernetes release at a time. For example, going straight from 1.26 to 1.28 would not be supported. Upgrading components in an unsupported manner could leave you without access to the cluster. While having to perform intermediate upgrades is more time consuming, it is the only way to ensure that the upgrade completes successfully without issue.
Defining your upgrade strategy
With a solid understanding of the Kubernetes release landscape, it is now time to see how that calendar integrates into your business processes. Obviously, each organization will have different priorities, risk tolerance, and resources available to tackle the upgrade task. It is essential to find a balance between keeping business moving and staying on top of maintenance.
Oftentimes companies adopt a policy of “if it ain’t broke, don’t fix it”. Kubernetes and other essential software upgrades are pushed to the back burner and quickly become an afterthought.
This line of thinking is extremely dangerous in production environments. Not only is support not available should you encounter issues, but you may be opening yourself up to vulnerabilities that could compromise your applications and data.
Lastly, managed Kubernetes providers will enforce upgrades on unsupported versions which may result in instability within your environment. This is particularly evident when using auto-scaling to spin up new nodes.
There are three main strategies, each with its own pros and cons:
Latest and greatest
These organizations often want to take advantage of the latest features and improvements so typically plan to perform upgrades shortly after release.
In version 1.28 for example, organizations may want to take advantage of the new sidecar support and improved crashed node support.
Teams that adopt this strategy are usually testing release candidate versions in order to be ready to go on release day.
Generally these are newer smaller organizations that are nimble and quick enough to support this way of working. However, companies with a solid investment in Kubernetes and a well-managed crew can also operate in this manner.
This strategy has the benefit of never having to worry about running out of support and always being able to take advantage of the latest innovations. However, it requires a very regimented release process and commitment from all stakeholders.
Many larger organizations, or those in regulated industries such as healthcare, tend not to want to live on the leading (or bleeding) edge.
They take a deliberate stance of waiting until large releases are rolled out in a variety of different environments and any issues are well sorted by the time they are ready to perform their own upgrades.
This strategy means they miss out on the very latest features, but it also leads to a nice stable environment that still ensures plenty of remaining time for support should the need arise.
Many organizations prioritize stability over getting the latest features, and they may see a new release as a distraction from other work, and a risk of breaking something that is running fine.
These organizations may choose to upgrade at the last minute, staying one step ahead of the end-of-support deadline.
This strategy can work but without proper controls you can quickly find yourself encountering problems with no easy path to get back to a supported environment.
A twist on this strategy is to upgrade to a brand new release, then wait until that release is nearly end-of-support (roughly a year), then jump all the way up to the current latest version. This upgrade process is usually more in-depth, complicated, and prone to errors. Attempts to leapfrog versions are often met with failure, forcing many interim steps to complete the upgrade successfully. It is hard to justify the time savings over regular upgrades at the expense of these potential downfalls.
Best practices for your next cluster upgrade in Kubernetes
Know your environment
Creating a Kubernetes upgrade plan can only be done once you understand the full scope of your entire environment. You must understand not only the infrastructure, but also the applications running on your clusters.
You must have a clear grasp of your environment stages and how upgrades will flow through them. This is particularly important for hybrid- and multi-cloud deployments. Staging environments should match production in order to ensure no surprises.
Research the release in depth
Make sure to fully understand the changes in the new version of Kubernetes. This includes reading the release notes and understanding any new feature, deprecations, or changes to the API. You should also check for any known issues with the intended version.
Social media and Reddit are great sources of blogs and comments from those on the front lines of implementing a new release, who may have encountered ‘edge case’ bugs and conflicts.
Plan ahead for key dates
Having a defined calendar for releases based upon the published Kubernetes or your chosen managed provider will help ensure resources and processes are available to support your goals.
Consider maintenance windows and provide ample notification to your customers and end-users. Weekend nights are the typical window but may not be appropriate if you are running overnight or long-running jobs. You may want to perform upgrades only to a subset of clusters based on geography, application, or user base.
Test as you go
It is of course best to test Kubernetes upgrades in “Dev” and “Test” environments before following a formal process to roll them into “Staging” and eventually into “Production”. You should make sure to perform integration testing with your applications at each stage.
It is highly recommended to keep infrastructure upgrades as a separate process than application or development release cycles. This ensures proper attention is given to the underlying infrastructure without introducing additional changes to an already complicated release.
Follow best practices for configuration and availability
As upgrades are typically done in a rolling fashion with nodes being drained completely before upgrading, your application will need to be able to survive this process. This includes all front-end and back-end components of your application.
Ensure that your application is fully resilient. Understand any stateful vs stateless requirements. Make use of liveness and readiness checks. If high availability is required, make sure your clusters contain multiple control plane and worker nodes.
Perform regular backups and have a tested restore plan in place in case an upgrade needs to be rolled back. While it is often easier to proceed forward, there are rare occasions where you would have to cancel and revert an upgrade in order to preserve the integrity of your systems.
Consider ‘stand up and switch’
Another alternative to upgrading clusters is to simply stand up brand new clusters at the new version and migrate workloads over to them. While this can help eliminate a lot of the potential risk with upgrading in place, it does introduce new challenges of application stack migration.
Make sure to consider ingress, load balancing, networking, storage, database, observability, and other required changes to support this.
Check that all is well
Following any update, of course you’ll need to ensure that your systems are running correctly. Verify all performance and connectivity metrics. Perform user acceptance testing and continue to monitor the entire system for a period of time.
This will help you to identify any problems that may have been introduced by the upgrade, even if they’re small or intermittent. This might be a good time to keep your best engineers or outside experts on call!
How Spectro Cloud Palette can help
Palette Cluster Overview
Spectro Cloud Palette provides a listing of all of the clusters within your environment. Drilling into the details of each cluster quickly shows the current Kubernetes version and the status and version of each node that comprises the cluster.
Use Cluster Profiles to control versions
Spectro Cloud Palette uses “Cluster Profiles” as the declarative definition of your clusters. The entire stack can be defined from the underlying OS and Kubernetes versions to the infrastructure components and applications layered on top.
Using Cluster Profiles can initially help ensure that all clusters are created with the same components at the same tested and approved versions, but then during upgrades they can be used to control when and how clusters are upgraded.
There are a variety of ways to utilize Cluster Profiles, such as defining a profile for each environment stage running different versions. Cluster Profiles themselves are versioned, so that any changes are tracked and can be deployed as necessary. This also allows for easy rollback should that be needed.
Use native SBOM scans to audit software elements
Software Bill of Materials or SBOM is a comprehensive list of all of the software components that are running on your clusters. They can be used for tracking security and compliance but are equally useful for planning upgrades.
Understanding all of the components and their version support and interoperability is key to early upgrade planning. It can be difficult to gather definitive support matrixes, so again it is critical to test complete upgrades in Dev and Test environments.
Spectro Cloud Palette provides built-in tools to perform Kubernetes SBOM scans and output the results in a variety of formats that can be digested however you need. It is helpful to run these scans in each environment before and after an upgrade to ensure that everything is completed as planned.
Upgrading clusters with Palette
Because Spectro Cloud Palette uses Cluster API (CAPI) and Cluster Profiles for managing clusters, the actual process of upgrading clusters that were created using Palette becomes much simpler than with self-managed or DIY clusters.
Since each cloud provider and Kubernetes distribution offers different tools to handle their own upgrades, working with multiple providers requires a level of expertise that not every person in your team may possess. Conversely, Palette provides a single consistent interface that works across all cloud providers, enabling push-button efficiency. Since Palette will automatically reconcile any manual changes, it becomes essential to drive changes from within Palette. This helps ensure that all clusters maintain the desired configuration as defined by your organization.
There are several strategies for updating the Cluster Profile definition:
- Creating a new version following a numbering convention that works for your environment.
- Making the necessary changes to the pack version for the corresponding components.
- Updating any required configuration changes in the YAML definition.
- Saving the updated profile in order to lock in your definition changes.
When a Cluster Profile version is updated, Palette’s UI will indicate a new version is available for each cluster that has that profile applied. When the upgrade is approved, Palette’s reconciliation engine will communicate with the cluster via CAPI and ensure the running state of the cluster matches the new desired state.
A rolling upgrade will be commenced that can be monitored from within the console. Nodes can be seen coming and going with their version and age clearly visible. Any errors can be viewed and acted upon as needed.
Once the upgrade has completed and the cluster has returned to a Healthy state, verify that your application is accessible and fully functioning. Continue to monitor the entire system until you are satisfied that everything is stable.
Go forth and upgrade!
Hopefully the advice in this blog has taken some of the fear out of running a Kubernetes upgrade, and just as importantly, helped you decide on what your strategy for upgrades will be going forward.
Of course, we believe that using Palette is the best way to make large-scale, multi-cluster upgrades easier — and we encourage you to get in touch if you want to see a personal demo or get access to try it yourself.
If you want to learn more about deploying and upgrading Kubernetes clusters, come and watch our webinar on September 14th to get a live insight and ask your questions! All the details are here.