An overview of the fundamental decision factors to determine whether a single large Kubernetes cluster or many small Kubernetes clusters will address your workload needs.
Today, Kubernetes is the de facto standard for deploying and managing cloud-native applications. But as more organizations adopt Kubernetes as the base platform layer, infrastructure teams are faced with operational challenges of architecting the Kubernetes clusters to fit the needs of the team.
As a Kubernetes administrator, you need to determine:
- The number of Kubernetes clusters to manage
- The size of the nodes running on each of those clusters
- What workloads should run on which cluster
- How to isolate the applications for security and compliance measures
The answer of course depends on the use case and the unique constraints ranging from team size and compliance, to deployment model needs. And as Spectro Cloud research found, there’s no consensus out in the real world today:
Replace this with the following image: 56-image-2.png
Like the decade-old question on how to separate workloads on AWS, Azure, or GCP accounts, Kubernetes administrators are faced with the same challenges when deciding on how to deploy applications onto Kubernetes. Some of the deciding factors for these architectural decisions include:
- Compliance requirements: for example, to physically isolate financial or healthcare data
- Primary purpose of the cluster: an internal developer platform may have different requirements than a cluster for production workloads
- Workload constraints or requirements: certain workloads require specialized hardware (e.g. GPU) that’s only available in certain regions
- Customer requirements: some applications need to also run on-prem instead of SaaS
- Size of the infrastructure or platform team able to support the clusters: can my team support 100s of clusters or do we need a more centralized solution?
- Scale of the workloads running on Kubernetes: is your organization a small startup serving a few customers or a large corporation needing 1000s of nodes?
With these considerations in mind, Kubernetes architects have a few options to address the needs of each use case. Because Kubernetes was originally designed as a single-tenant system, one valid approach is to simply have a cluster per environment or per use case. On the other hand, another approach is to have a large, shared cluster and separate the workloads by namespaces or some custom resource like vClusters. Ultimately, the “correct” architecture will depend on the unique circumstances of your team and use case. But we will walk through each of the options to help you weigh the pros and cons of the approaches.
Replace this with the following image: 56-image-3.png
Option 1: One cluster per application
If multi-tenancy is not a requirement or an option due to customer request or compliance reasons, one valid approach is to separate each application by Kubernetes cluster.
With this approach, deployment units–whether that is an application bundle (e.g. frontend, backend, database) or some logical grouping of workloads–are each deployed to a separate cluster.
Some of the pros for this approaches include:
- Strong isolation: Each application is physically separated from other workloads so none of the underlying resources such as CPU, memory, or other Kubernetes components are shared. This may be desirable for applications with sensitive data or strong data isolation requirements.
- User management: Users can be tied to access to each Kubernetes cluster instead of relying on Kubernetes RBAC and users to grant authentication and authorization. The advantage to this approach is tying access to existing user management systems and not having to tie those into Kubernetes resources. If you have different teams who only need access to a small set of workloads (e.g. an external QA team who only needs a sandbox account), this may be easier than restricting per namespace.
- Smaller learning curve: If you already have complex automation tied to a cloud account, then it may be easier to reuse those tools or principles rather than having to reinvent them with Kubernetes-native resources again. There is no need to retrain the staff on how to audit or enforce policies for access.
On the other hand, the cons for this approach are:
- Cost: Running Kubernetes comes with some overhead. Most managed Kubernetes providers charge a fee per master node usage, and even self-hosted Kubernetes nodes require some resources to run Kubernetes itself.
- Inefficient utilization: Along the same reasoning as above, it is harder to share compute resources when each application is running on separate clusters. Even if each application is fitted to optimal node size, common utilities and tools needed on each cluster (e.g. monitoring agents, log forwarders, ingress controllers) will eat away into available resources.
- Management burden: Finally, without a mature CI/CD ecosystem, operating multiple clusters means higher administration burden of maintaining and upgrading each cluster. This also applies to each of the tools and applications deployed onto each cluster.
Option 2: One large shared cluster
On the opposite spectrum, the other approach is to have a large, multi-tenant cluster that shares the underlying compute resources but is isolated via namespaces or custom resources. With this approach, various Kubernetes policies such as network policies, pod security policies, or other controller-based policies are used to enforce tenant isolation. Isolation can occur on the environment level (e.g. dev, testing, UAT, stage, prod) or on application or customer level.
Some of the pros of this approach are:
- Efficient utilization: Not only do all workloads running on the same cluster use the same Kubernetes components (e.g. master node, etcd, kube-proxy, etc), but it also shares other cluster-level resources such as ingress controller, secrets manager, logging and monitoring agents, and custom resources for your applications or third party plugins. You can make better use of compute resources to pack the same node with various applications with differing resource needs.
- Cheap: As a corollary to the above, it is generally cheaper to operate a smaller number of clusters. The obvious savings come from not having to pay for the management fee of master nodes if you are relying on managed Kubernetes services like GKE, EKS, or AKS. But there’s also savings from being able to scale up or down resources with a view of the entire workload requests and limits.
- Centralized administration: A single cluster means that cluster-wide operations such as upgrading AMIs and Kubernetes versions, as well as upgrading tooling associated with the cluster happens only once. Since Kubernetes is a rapidly evolving space, the cost-savings in terms of human capital can be significant, especially when the infrastructure team does not have a robust CI/CD system in place.
- Faster onboarding: It is easier to onboard new applications or use cases to the cluster since provisioning a new tenant would be replicating the existing tenant isolation model (e.g. namespace creation combined with policy enforcement). This is generally much faster than standing up a whole new cluster, which can take a few minutes to hours.
On the other hand, single cluster also has its cons:
- Huge blast radius: Single cluster means that any problems with the cluster would impact the availability of all the applications running in the cluster. For example, a botched Kubernetes master node upgrade or a CNI plugin that is misbehaving could lead to prolonged downtime.
- Shared cluster-scoped resources: Even with policies in place, soft-isolation via namespaces or custom resources do not provide the same level of hard security isolations as ones provided by separate VMs. All of the containers in the cluster use the same cluster-wide resources like external-dns, external-secrets, or CNI plugins. It is hard to get the IAM model down to restrict privilege escalation that may impact the entire cluster.
- Restricted access: Sometimes a use case may warrant the need to install cluster-wide resources. For example, if a development team wants to use Strimzi as its Kafka stack, they would need to involve an admin as custom resources used by Strimzi are at cluster scope and not namespace based. Depending on the purpose of the cluster, this behavior may not be desirable.
- Higher learning curve: Instead of using the familiar IAM roles and policies, administrators must now enforce rules via Kubernetes-native resources such as ResourceQuotas, LimitRanges, and Kubernetes RBAC objects. This may be an extra step to the configuration that previously might have been the responsibility of another team (e.g. InfoSec, IT, etc).
- Cluster limitations: Kubernetes has upper bounds for the number of nodes and pods it can support (~5,000 nodes and 150k pods). Big clusters present challenges for the control plane to orchestrate and respond accordingly.
Option 3: Hybrid Approach
So far the options presented have been on two extreme ends of a spectrum. In reality, a hybrid approach can be used to address the unique requirements of each use case. For example, a large shared cluster can be used for an internal developer platform. A SaaS platform could also implement multi-tenancy in a controlled manner, since the entire infrastructure would be managed by an internal team. On the other hand, for critical or sensitive applications, a dedicated cluster can be created to reduce the blast radius.
A few popular approaches to thinking about cluster creation include:
- Separating by environments: a production cluster vs. a non-prod cluster. A bigger team may even split the non-prod cluster into dev, test, UAT, etc. For example, dev and test can be internal purposes only, whereas a UAT cluster can be public-facing for integration work.
- Separating by products: some products may lend themselves nicely to a multi-tenant cluster, whereas others require hard isolation.
- Separating by teams: some teams like the platform team may need a multi-tenant cluster to undergo rapid integration work, whereas customer-facing teams may prefer a dedicated cluster for their use case.
Finally, even in multi-tenant clusters, hardware-level isolation is achievable via separating nodegroups with taints and tolerations. This approach is more applicable for environment-based isolation (e.g. putting performance testing on bigger machines vs. dev machines on smaller ones), but can be used for other purposes as well.
Choosing the optimal operating model for your Kubernetes application is one that requires careful planning and thought. It directly ties into your plan for multi-tenancy as well as the requirements imposed by the use case. Having a cluster per application provides better isolation, but comes with a huge cost in terms of compute and human resources. On the other hand, a shared cluster is great for resource utilization, but may not satisfy various compliance requirements. Find the right balance between the two approaches to cater the Kubernetes platform to address the needs of your team.