Kubernetes: multi-tenant vs single tenant clusters tradeoffs

Kubernetes: multi-tenant vs single tenant clusters?

Deep-dive into how isolation, availability, and complexity requirements may influence your decision

As Kubernetes matures into critical business technology infrastructure, IT teams are increasingly centralizing control, providing governance, and applying constraints to Kubernetes operations. A critical question that we are asked repeatedly is — should IT organizations run larger, multi-tenant clusters or would they be better suited running smaller clusters for each tenant? If you are a repeat reader of our blogs then you know that the answer is, “IT DEPENDS”. In this post, we’ll evaluate how Kubernetes isolation, application availability, and operational complexity may influence your decision one way or another.

What is Multi-tenancy?

Multi-tenancy is a deployment design, where different applications, teams, or even other departments, can run and operate in a shared environment; hence a “multi-tenant” system. A cluster “tenant” doesn’t necessarily break down into the typical organizational unit such as a team or department; it depends on how your organization defines it. For example, an extremely large multi-tenant cluster could have both dev and production workloads, across multiple different departments, where even components of a much larger application are all separated into different tenants — completely isolated and unaware of the presence of each other.

Proper isolation and application availability are key for multi-tenant systems to be feasible; would AWS have been as successful if there was even a sliver of doubt that other “tenants” (aka potentially malicious actors) could extract all your VM data? Even if you’re not concerned about malicious scripts stealing all your app data, multi-tenant systems are inherently less isolated; what if your sibling tenant’s PrintSecret benchmarking script accidentally pulls all the secrets out of your namespace and publishes sensitive user data to the company’s BB board (people still use those, right?). Similarly, performance consistency is important since you don’t want another team’s Jenkins CI job impacting your application.

In Kubernetes land, with proper configuration from the Infrastructure and Operations team, also known as Cluster Operators, it’s possible to virtually isolate most of these concerns but there are known limitations with what’s possible with Kubernetes’ soft-tenancy model. Please read-on if you’re interested in understanding these limitations and when it might be better to just separate each tenant into a separate cluster — and live with the added operational overhead of maintaining separate clusters.

Isolation

The highest level of isolation is achieved by system isolation, where each cluster is completely independent of other teams, applications, and even location boundaries. However, as you might expect, running and operating such a large number of clusters can significantly add to operational overhead. On the flip side, not having any isolation or partial isolation is easier to manage, but will come at the expense of isolation granularity and control.

Within a single cluster, Kubernetes provides logical isolation through different namespaces. The platform provides quite a few namespace policies to help with governance and control:

RBAC: isolation and permissions,
LimitRange: default/limits on CPU and memory,
ResourceQuota: memory and CPU quotas for a namespace,
NetworkPolicy: network communication policies,
PodSecurityPolicy: fine-grained authorization of pod creation and updates.

All of these policies help protect and isolate namespaces from each other — but at best provide soft-tenancy isolation. Not only are all tenant workloads running on shared infrastructure (compute nodes, networking, and storage) but they also share critical system Kubernetes components such as api-server, kube-proxy, kubelet, and others. As of right now, all known critical vulnerabilities have been resolved but it’s likely more critical escape vulnerabilities will be discovered (or introduced over time). If complete isolation is important to your applications and clusters, then you should probably be deploying more, and smaller, single-tenant clusters.

There are certain Kubernetes resources such as CRDs and Operators are by-design global to the cluster. You could add additional RBAC roles and permissions to limit only users in a certain namespace to be able to access such CRDs and Operators, but that is some extra work whenever some new CRDs and Operators are introduced into the cluster by admin. Also, because of their global nature, there is potential for name and version conflicts. For in-house designed CRDs and Operators, this could be addressed by applying unique prefixes to the name, but if you are using some 3rd party CRDs or Operators, it can be challenging if users from two different namespaces want to use two different versions of global resources of the same name/type.

Availability

As a container orchestration platform, Kubernetes does a fantastic job of orchestrating and maintaining your workloads — handling resiliency, scaling, and other application lifecycle management concerns with ease. However, if the underlying platform fails — either due to some misconfiguration, a bad upgrade, or a critical bug — then all workloads on the cluster will grind to a halt. Oftentimes, the downtime may not even be something of your own doing, but a misconfiguration or mistake on the part of your cloud provider.

Degraded application experience and downtime hurts end-user experience, so what actions should you take to mitigate the risk of cluster downtime? For one — don’t run your application on only one cluster! Properly designed cloud-native applications are designed to scale outwards — and should scale across multiple cluster boundaries — ideally in different regions, maybe even cross-cloud.

In addition to a more fault-tolerant, highly-available application, workloads deployed across different regions will also provide higher performance, lower latencies and a more pleasant user-experience for your global end-users.

Related to application availability is separating Kubernetes’ upgrade lifecycle from your application’s lifecycle. Cluster operators of large, multi-tenant clusters may find it more challenging to schedule and coordinate cluster and infrastructure upgrades, as cluster tenants may impose unique requirements to not impact their application’s SLAs/SLOs.

Operational Complexity

The advantages and benefits of deploying multiple clusters, hitherto compared to larger multi-tenant clusters have been promising and encouraging! The not so hidden cost of provisioning and operating multiple clusters is, of course, the management and operational burden of maintaining them…

Even a single Kubernetes cluster can be complex to operate. These complexities can be categorized as:

Lifecycle management: complex and heavyweight lifecycle management (provision, upgrade, backup, restore),
Involved integrations for basic components such as storage, networking, security, and others,
Secure access to the clusters: user authentication and authorization, network policies, ingress,
Logging & observability: centralized logging, metrics, tracing for apps and infrastructure.

The complexities and overhead exponentially increase as you maintain more clusters. In addition to the concerns above, how do you bring consistency to the upgrades, security policies, and integrations you’re applying to similar purpose clusters? Many organizations want to control the operating system, e.g., a corporate security-hardened image of RHEL for the worker nodes. Others may require certain security agents such as AuditD or others to always run. Provisioning support for these components is already difficult without extensive custom automation — but supporting these components’ complex upgrade and maintenance lifecycle adds another level of complexity to the equation.

Operating multiple Kubernetes clusters costs more… literally, there is an increase in overall cloud costs. Clusters rarely fully utilize their allocated CPU and memory; there’s always some spare capacity available to handle the atypical bursty workload. However, across all your clusters, the VM and infrastructure costs attributed to idle and underused CPU/memory resources add up — and would surely warrant a tear or two from your CIO. Larger, multi-tenant clusters are not immune to resource wastage, but may fare better compared to multiple clusters since:

Multi-tenant clusters’ workload capacity can be oversubscribed; apps across tenants don’t have the same traffic patterns and times,
Larger clusters increase the likelihood that it can make sense to do “reserved/dedicated” instances, which can mean heavier discounts for sustained usage,
IT teams monitoring a smaller set of shared clusters have more opportunities to optimize cluster auto-scaling policies and configuration.

That said, the added cloud costs of running multiple clusters tend to be minuscule compared to the operational costs of running Kubernetes clusters. And in practice, even with the higher operational costs, most organizations are choosing to deploy single-tenant clusters to reap the isolation and availability benefits.

Look for platforms that manage multiple clusters. Most Gartner customers, according to inquiries, are deploying multiple clusters as opposed to creating a large cluster consisting of multiple namespaces. Namespaces help avoid name clashing and create resource granularity across apps, but they do not provide the required isolation that creating multiple clusters does.

Assessing Kubernetes-Based Hybrid Container Management Platforms; G00377372, Simon Richard [Gartner]

Closing Thoughts

The design choice between multi-tenant and single-tenant clusters will invariably depend on the purpose and requirements of your applications. While single-tenant deployment design makes sense for most production applications (since it enables greater isolation and control), however, during development the same applications might be deployed to shared multi-tenant dev/test clusters. There are fewer concerns of isolation between tenants (aka other engineers), and short cluster/infra downtimes are acceptable — and may even spark an impromptu Netrek match (Federation Captain, at your service)!

Over time as organizations continue to build and operate more Kubernetes clusters, they’ll look for tools and platforms to help support their multi-cluster deployments. Proper Kubernetes management platforms can help reign in the operational complexities, and provide simple, intuitive operations to manage Kubernetes clusters at scale.

Are there other killer use-cases or verticals which heavily favor multi-tenant clusters or single-tenant clusters? Let us know!

—

Looking for a pocket-card reference? — take a look here:

Multi-Tenant vs Single Tenant ClustersBreakdown of multi-tenant vs single-tenant clustersdocs.google.com

Tags:

Operations

Enterprise Scale

Thought Leadership