Upgrading your Amazon EKS clusters — a practical guide

Running Kubernetes on Amazon Elastic Kubernetes Service (EKS) is a little like driving a modern electric car. The platform handles most of the engineering complexity for you, so you can focus on the journey.

EKS sounds like a dream: easy cluster setup, no need to maintain your control plane, and the API perfectly matches the CNCF K8s API, so portability is assured.

But no matter how easy the vehicle is to drive, you still need to keep it serviced. And that’s true in the cloud‑native world, too. The service is the regular Kubernetes upgrade — and it comes around faster than you might think.

Kubernetes ships a new minor release every four months. Amazon supports each version for only 14 months of standard support and, for those who need extra time, 12 months of paid extended support. If you drift further behind, AWS steps in and upgrades the control plane automatically, on its own schedule, and begins charging an extended‑support fee that can run into thousands of dollars a year per cluster.

But it’s not just about the control plane. With every upgrade, you also need to touch the entire Kubernetes stack, including the operating system, Amazon EKS add-ons, and all applications. You need to monitor, change roles, back up data, and ensure pod security with the latest updates — and do so consistently across potentially dozens or hundreds of clusters.

Although we’ve blogged about upgrading K8s before, this guide distils everything we have learned helping organisations large and small to upgrade their EKS clusters specifically. We will explain why staying current matters, unpack the EKS lifecycle, explore common upgrade pitfalls and show you a proven, repeatable workflow. Finally, we’ll look at how Spectro Cloud Palette turns an annual chore into a low‑touch routine.

Why bother? Four reasons upgrades pay off

It is tempting to live by the maxim “if it ain’t broke, don’t fix it,” but in Kubernetes that approach quickly becomes risky and expensive. Upgrading on a predictable cadence delivers four concrete benefits:

Security. Every minor release addresses newly discovered CVEs in the Kubernetes codebase and its dependencies. Falling two or three versions behind can leave you exposed to vulnerabilities for which the wider community has already moved on.

Features and performance. Recent releases have introduced sidecar containers, refined the scheduler and stabilised the Gateway API. Upgrades are your ticket to these capabilities without having to rebuild your platform from scratch.

Operational consistency. When development, staging and production clusters run different versions they behave differently. Teams lose time tracking down “works‑on‑my‑machine” mysteries that stem from API drift rather than code defects.

Cost control. At the 14‑month mark AWS stops shipping patches for free. From month 15 the meter starts, and from month 26 the platform pushes the upgrade for you. Staying ahead of this curve avoids both the service charge and the scramble.

In short, upgrades are not optional housekeeping; they are the safest, fastest and most economical way to keep your platform in shape.

Decoding the EKS and Kubernetes lifecycle

Although Amazon follows upstream Kubernetes closely, it overlays its own schedule. A version is

in standard support for 14 months after GA;
in extended support for a further 12 months, billed per cluster‑hour; and
eligible for forced upgrade after 26 months.

Because upgrades must proceed one minor version at a time, you need at least two upgrade windows inside every 26‑month period just to stay inside support. Many teams aim for a semi‑annual rhythm: plan in January, execute in February; plan in July, execute in August. Aligning that rhythm with release dates ensures you always land on a well‑tested patch release rather than day‑zero GA code.

AWS publishes the definitive timetable in the EKS documentation and console. Checking it quarterly helps you spot if a long‑running project or holiday freeze is about to collide with end of support.

Whose job is it anyway? The shared‑responsibility model

EKS is a managed service, but “managed” does not mean “hands‑off”. The control plane is Amazon’s responsibility; everything else is yours.

Layer	Owner	What that means during an upgrade
Control plane	AWS	You initiate the version change with the API, the CLI or Terraform; AWS replaces the API servers and controller managers behind the scenes.
Data plane	You	All worker nodes — whether self‑managed EC2 instances, managed node groups or Fargate — must be recreated or re‑imaged at the new version. Upgrades can be initiated via the AWS CLI or tools like eksctl, CloudFormation, Spectro Cloud Palette, or Terraform.
Add‑ons	You	CoreDNS, kube‑proxy, the VPC CNI, CSI drivers, Ingress controllers and any custom operators must be upgraded or redeployed.
Workloads	You	Application manifests that rely on deprecated APIs need updating; some may require code or configuration changes.

Kubernetes allows only a limited skew between the control plane and the kubelet: two minor versions in releases prior to v1.28, three versions thereafter. If you upgrade the control plane and forget the nodes, sooner or later kubelets will be refused. That is the most common upgrade outage we see in the field.

The hidden snags that trip teams up

Even with a clear responsibility split, upgrades can falter. Five issues appear again and again:

API removals. As Kubernetes matures, beta APIs graduate and old paths disappear. A manifest that still declares extensions/v1beta1 Ingress, or a Helm chart that assumes PodSecurityPolicies exist, will break. Tools such as Pluto, kubent or the upstream kubectl convert plug‑in can scan a repository in minutes and list the objects to fix.

Add‑on drift. AWS pinches the version window even tighter for some system components. The VPC CNI plug‑in, for example, must be updated before the control‑plane jump if its existing version does not understand the target Kubernetes API. Overlooking this order of operations can sever pod networking cluster‑wide.

Multiple stakeholders. Platform engineering, SRE, security, networking and dozens of application teams each own a slice of your cluster. Unless someone centralises the timeline and the communication plan, upgrades stall waiting for sign‑offs.

Operational fatigue. A production cluster rarely has a maintenance window longer than an hour or two. Staff turnover, holidays, PTO and incident firefighting all eat into the time available for careful rehearsals, leading teams to push upgrades past the safe window.

Node group misunderstandings. Self-managed node groups aren’t automatically upgraded when you or Amazon EKS update the control plane version on your behalf. A self-managed node group doesn’t have any indication in the console that it needs updating. You can view the kubelet version installed on a node by selecting the node in the Nodes list on the Overview tab of your cluster to determine which nodes need updating. You must manually update the nodes. But also watch out — a managed node group creates Amazon EC2 instances in your account. These instances aren’t automatically upgraded either when you or Amazon EKS update your control plane.

Planning an upgrade that never surprises you

Successful upgrades start well before the first API call. The following preparation sequence has proven itself useful across hundreds of clusters:

Make a comprehensive checklist. Document every pre‑flight and post‑flight action: backups, deprecation scans, IAM policy verifications, add‑on compatibility checks, smoke tests and rollback steps.
Snapshot the cluster state. Whether you use Velero, Kasten or Palette’s built‑in backup engine, capture the etcd objects and any persistent‑volume data to S3 so that a failed upgrade becomes an inconvenience, not a disaster. Note that EKS doesn’t include a backup feature itself!
Read the changelogs line by line. Both the Kubernetes release notes and the EKS “differences” page flag behavioural changes. Spotting them early avoids the 3 a.m. Slack crisis.
Upgrade add‑ons first in staging. Rebuilding CoreDNS or the CNI in a test environment verifies image references, IAM roles and service‑account mappings before production users are watching.
Turn on control‑plane audit logs. Streaming API events to CloudWatch during the rehearsal surfaces unexpected flood‑fill patterns such as a noisy controller that suddenly requests deprecated endpoints.
Dry‑run everything in staging. A cluster whose node count, region and network policies mirror prod will reveal 95% of issues.
Publish the timeline. Leave no room for doubt: tell application owners the exact freeze window and the fallback plan, and have them sign off.

With these steps complete you can pick the strategy that suits your risk appetite.

In‑place versus blue/green: choosing your path

Most organisations follow an in‑place upgrade: bump the control plane, patch the add‑ons, roll the node groups, then let the workloads drain and reschedule. The API endpoint remains unchanged, so scripts and integrations keep working, and you pay for only one cluster’s worth of infrastructure. But note that only one minor version upgrade can be executed at a time (e.g., from 1.29 to 1.30). If there are multiple versions between the current cluster version and the target version, then upgrades must be done sequentially, one at a time, until you meet the target version.

Some, though, prefer a blue/green approach: spin up a brand‑new cluster at the target version, migrate workloads gradually, then decommission the old cluster. Blue/green is attractive when you need to leap more than one minor version, when you need an instant rollback path, or when you want to migrate workloads individually. The trade‑off is higher cost during the overlap and extra work to wire cross‑cluster network policies, service discovery and secrets replication.

Neither path is inherently better. Your compliance obligations, your cost model and, most importantly, your organization’s tolerance for change will decide.

A proven sequence for an in‑place upgrade

Below is the workflow we recommend and automate with Palette. It assumes you are stepping from version N‑1 to N.

1. Confirm version skew

Run kubectl version and compare the server version with kubectl get nodes -o wide. If you are already at the maximum supported skew you must upgrade the nodes first in a staging environment to regain breathing room.

2. Upgrade the control plane

Use the AWS Management Console, AWS CLI, or eksctl to upgrade the control plane. For example:

eksctl upgrade cluster --name <cluster-name> --version <target-version>

The operation is zero‑downtime for the API endpoint but briefly disrupts control‑plane webhook calls. Controllers with aggressive timeouts may need retries.

You can confirm that the EKS control plane has been updated to the target Kubernetes version, using the AWS CLI to check the cluster version:

aws eks describe-cluster --name <cluster-name> --query cluster.version

Alternatively, check the version in the AWS Management Console.

3. Bring managed add‑ons up to date

The EKS console shows a yellow warning icon for any add‑on that is behind. Upgrade each one in reverse order of dependency — for example, start with the VPC CNI, then CoreDNS, then kube‑proxy — so that later components can discover the services they need.

It’s as simple as:

aws eks update-addon --cluster-name <cluster-name> --addon-name <addon-name> --addon-version <version>

4. Refresh your local tools

Download the matching kubectl, eksctl or AWS SDK version. Using a newer client against an older cluster is fine; the reverse often fails because the client tries to reach APIs that no longer exist.

5. Rotate the node groups

For a managed node group you can ask AWS to create a new launch template and roll the instances:

eksctl upgrade nodegroup --cluster my-cluster --name blue-ng

A self‑managed group needs a new AMI. Build it with eks‑optimized‑ami or your Packer pipeline, then cordon and drain each node, replace the EC2 instance and uncordon.

You can then ensure all node groups (managed and self-managed) are running the same version as the control plane.

List nodes and check their versions:

kubectl get nodes

Verify that the kubelet version matches the upgraded Kubernetes version.

6. Validate workloads under load

Ensure that all deployed workloads are running and performing as expected:

Check the status of Pods across all namespaces:

kubectl get pods --all-namespaces

Look for issues such as CrashLoopBackOff, ImagePullBackOff, or Pending.

Smoke tests alone are not enough. Run your synthetic transactions or load‑testing tools like Apache JMeter, Locust, or k6 to validate application performance, while monitoring pod restarts, error rates and latency in Prometheus or CloudWatch. Catching a mis‑timed readiness probe now saves a customer‑visible outage later.

7. Prune the leftovers

Old launch templates, unused AMIs and Helm releases at superseded versions clutter your automation and may be picked up by compliance scanners. Deleting them while the context is fresh closes the loop.

8. Capture lessons learned

Add the time it took, the quirks you hit and the dashboards that proved most useful back into the runbook. Next quarter’s upgrade will be easier.

How Spectro Cloud Palette makes all this boring (in a good way)

At Spectro Cloud, over half of our employees are technical, many dedicated to ensuring that all components of a Kubernetes cluster are prevalidated before deployment or upgrade commences. Palette will deploy to EKS or any cloud provider. We will also check and validate whether all components are interoperable and provide clear guidance and prevalidation, with specific instructions and details when components are not supported together.

In essence, Palette acts as a guardrail to ensure that the cluster owner and administrators know all cluster components will work together before an upgrade and to provide guidance and insight when misconfigurations are identified.

Palette treats a cluster as code. You define a Cluster Profile that pins the Kubernetes version, the OS image, every add‑on and even your own applications. When a new version lands you edit the profile and click Save. Palette then:

Validates dependencies. It refuses to apply a profile that pairs an incompatible CoreDNS image with the target Kubernetes API.
Surfaces drift. The dashboard shows at a glance which clusters deviate from their profile and how far behind they are.
Orchestrates the upgrade. Palette upgrades the control plane and node pools in the right order, with health checks, back‑off logic and automatic retry.

The result is a dramatic drop in toil. When GE HealthCare needed to patch more than 100 clusters across nine business units, Palette cut the calendar time from weeks to days because operators no longer had to stitch together CLI scripts.

Bringing it all together

Upgrading an EKS cluster is unavoidable, but it should never feel like a gamble. By understanding the Amazon support lifecycle, preparing thoroughly, following a disciplined sequence and using a declarative platform such as Palette, you can turn what used to be a fraught midnight exercise into a predictable, daytime routine.

Cloud‑native teams that keep pace reap the rewards: tighter security, faster‑running applications, lower AWS bills and happier auditors. The upgrade is not the goal; it is the runway to everything you want to build next.

Next steps

Ready to see how Palette can simplify your next upgrade? Get in touch with your AWS account team or your Spectro Cloud representative today to arrange a personalized demo.

Tags:

Cloud

Operations

Cluster Profiles

Upgrading your Amazon EKS clusters without the pain — a practical guide