In modern computing, where applications and workloads are dynamic and interdependent, finding the sweet spot for resource allocation can be challenging.
Scaling resources by manually adjusting parameters is slow, inefficient — and risky.
Allocate too little and you’ll deliver subpar user experience or even application outages. But overprovisioning ‘just in case’ is wasteful and costly.
Autoscaling is the true cloud-native solution. It promises to deliver the right resources at the right time, saving you from underprovisioning and overprovisioning.
But, like almost every technology, the path to successful autoscaling has its share of hurdles.
In this blog, we explore:
- Why you need Kubernetes autoscalers
- The challenges and prerequisites you need to think about
- The many flavors of autoscaler — HPA, VPA and KEDA — and when to use them
Let’s get started.
Why is autoscaling important?
Autoscaling means dynamically allocating cluster resources, such as CPU and memory, to your applications based on real-time demand. This ensures that your applications have the right amount of resources to handle varying levels of load, so it directly improves application performance and availability. But there are many other benefits:
- Cost efficiency: Autoscaling can help you optimize infrastructure costs, because you only pay for the variable resources you need, instead of overprovisioning. You can learn more about optimizing costs in Kubernetes from this blog post.
- Positive environmental impact: Aligning resources more accurately to demand results in reduced power consumption and carbon emissions, helping support green corporate objectives.
- Time savings: Autoscaling essentially automates the manual task of adjusting resources. This can free up a huge amount of ops and SRE time, particularly in environments with rapidly changing workloads.
Challenges of autoscaling
Autoscaling, while powerful for optimizing resource allocation, comes with its own set of challenges. Here are some of the common challenges associated with autoscaling:
- Predictive scaling: Predicting the exact resource needs of an application can be tricky. Autoscaling decisions are often made based on past usage patterns or metrics, and unexpected traffic spikes or changes in usage can lead to under or over-scaling.
- Cost unpredictability: Autoscaling can lead to increased infrastructure costs, especially when left unchecked. Overprovisioning or rapidly scaling up in response to short-lived demand spikes can be costly.
- Resource contention: When many applications autoscale independently within a cluster, they may compete for finite resources, leading to resource contention and performance degradation.
- Complexity: Managing autoscaling configurations and rules for a growing number of microservices and applications can become complex and error-prone.
Managing these challenges effectively requires a combination of smart policies, automation, monitoring, and a deep understanding of the specific needs of your applications and workloads.
If you want to start exploring autoscaling options in your clusters, here’s what you’ll need.
- A basic understanding of Kubernetes, including Pods, Deployments, Services, and basic networking.
- A running Kubernetes cluster. In this tutorial, we will be using minikube.
- A kubectl installation configured to work with the cluster.
- An installation of Helm.
- An installation of the Metrics Server.
The Kubernetes metrics server
Before going into auto scaling, we need to understand what metrics and Metric Servers are. We also need to understand the role they play in autoscaling.
A metric in Kubernetes is a quantitative measurement or data point that provides information about the usage or behavior of a particular resource, such as CPU, memory, or custom application-specific metrics.
Metrics are used to assess the performance and health of Kubernetes objects like pods, nodes, and containers.
Some common examples of metrics in Kubernetes include:
- The percentage of CPU resources consumed by a container or pod (CPU usage)
- The amount of memory resources consumed by a container or pod (memory usage)
- The rate of data transfer over the network for a pod (network throughput)
Metrics are useful for making informed scaling and resource allocation decisions, ensuring optimal performance, and troubleshooting issues in Kubernetes environments.
The Metrics Server, also known as the Kubernetes Metrics API, is a component of the Kubernetes cluster that collects and stores metrics. It also provides a way to query and access this data, allowing you to monitor the performance and resource usage of your cluster and its workloads.
The Metrics Server is important for Horizontal Pod Autoscaling (HPA), Vertical Pod Autoscaling(VPA) and autoscaling in general because autoscaling relies on real-time resource utilization data, such as CPU and memory usage, to make informed scaling decisions.
Without the Metrics Server, the critical metrics needed for efficient and effective autoscaling of pods and resources in a Kubernetes cluster would not be available.
How to install the Metrics Server on a Kubernetes cluster
For most generic clusters, you can use the official Metrics Server deployment manifest from the Kubernetes GitHub repository.
Wait a few moments for the Metrics Server to start running. You can check its status using:
You can view CPU and memory usage for pods or resource usage by nodes by using the kubectl top command as shown.
Horizontal Pod Autoscaling (HPA) vs Vertical Pod Autoscaling (VPA) vs Kubernetes Event Driven Autoscaling(KEDA)
You have three main types of autoscaler to choose from in Kubernetes.
Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) are the two most commonly used patterns.
But the Kubernetes event driven autoscaler, or KEDA, is getting a lot of attention after it recently graduated as a CNCF project.
Let’s dive into the three options in more detail.
Horizontal Pod Autoscaling (HPA)
What is HPA?
Horizontal Pod Autoscaling (HPA) is a Kubernetes feature that automatically scales the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on certain metrics like CPU utilization or custom metrics. Horizontal scaling is the most basic autoscaling pattern in Kubernetes.
HPA sets two parameters: the target utilization level and the minimum or maximum number of replicas allowed. When the utilization of a pod exceeds the target, HPA will automatically scale up the number of replicas to handle the increased load. Conversely, when usage drops below the target, HPA will scale down the number of replicas to conserve resources.
When to use HPA
Use HPA when your primary concern is adjusting the number of pod replicas based on the application's workload and resource usage:
- Scalability Based on Workload: HPA is ideal when you want to scale the number of pod replicas up or down to handle varying levels of incoming traffic or workload.
- Traffic-Driven Scaling: If your application experiences load changes due to incoming traffic, HPA can help you automatically adjust the number of replicas to maintain optimal performance.
- Resource Utilization Scaling: HPA is well-suited for scenarios where you want to ensure that CPU or memory utilization of pods remains within a certain threshold.
Working with HPA
In this section, we will demonstrate how to use HPA. We will create a deployment, enable HPA, generate load, and monitor a HPA.
Create a Deployment:
Create a Deployment for your application. Start by creating a file for your deployment.
Use the template below as a guide. In place of your-app-image, you can add your app’s image if you want to work with a different image. For this example, we will be using an nginx image.
Run a kubectl create to create the deployment.
Run a kubectl get deploy to see the newly created deployment.
Also, run a kubectl get pods to see the pods created by the deployment.
To enable HPA, you need to define an HPA resource that specifies the scaling behavior. Run touch hpa-config.yaml to create a resource using the template below.
Here's a breakdown of the components in the YAML above:
- apiVersion: autoscaling/v2: Specifies the API version for the HorizontalPodAutoscaler resource.
- kind: HorizontalPodAutoscaler: Specifies the kind of resource, indicating that it's an HPA.
- metadata: Metadata for the HPA, including its name.
- scaleTargetRef: Defines the deployment that the HPA will scale.
- apiVersion: apps/v1: The API version of the resource the HPA will scale (Deployment in this case).
- kind: Deployment: The kind of resource the HPA will scale (Deployment in this case).
- name: my-app: The name of the Deployment that the HPA will scale.
- minReplicas: 2: Specifies the minimum number of replicas (pods) that the HPA should maintain. In this case, at least 2 replicas will be running even if there's low demand.
- maxReplicas: 10: Specifies the maximum number of replicas (pods) that the HPA should scale up to. If there's high demand, the HPA can scale up to a maximum of 10 replicas.
- metrics: Specifies the metrics used for autoscaling decisions.
- type: Resource: Indicates that the autoscaler should use resource metrics for scaling.
- resource: Specifies the resource metric type, which in this case is CPU.
- name: cpu: Specifies that the CPU resource metric will be used.
- target: Specifies the target utilization of the CPU metric.
- type: Utilization: Indicates that the target is based on resource utilization.
- averageUtilization: 50: Specifies that the target CPU utilization should be around 50%. The autoscaler will attempt to maintain this level of CPU utilization by adjusting the number of replicas.
Apply HPA Configuration
Apply the HPA configuration to your cluster.
If the HPA is enabled successfully, you can go ahead to generate a load for your application.
Generate load on your application to trigger scaling based on the CPU metric.
Monitor the HPA and the number of replicas to observe the automatic scaling based on CPU utilization.
Vertical Pod Autoscaling (VPA):
What is VPA?
Vertical Pod Autoscaling (VPA) adjusts the resource requests and limits of individual containers within a pod to optimize resource allocation based on historical usage patterns. This helps pods get the right amount of resources without manual intervention.
VPA sets two parameters: the target utilization level and the minimum or maximum amount of resources allowed for each pod. When the utilization of a pod exceeds the target, VPA will automatically increase the resources allocated to that pod. Also, when usage drops below the target, VPA will reduce the resources allocated to that pod. In this way, VPA allows you to scale up and down the resources used by individual pods quickly and efficiently.
When to use VPA
Use VPA when you're concerned about optimizing the resource utilization within individual pods to improve performance and efficiency:
- Fine-Tuning Resource Requests and Limits: VPA is beneficial when you want to automatically adjust resource requests and limits of containers within pods based on their observed resource usage.
- Efficient Resource Utilization: VPA is particularly valuable in scenarios where you need to prevent over-provisioning or underutilization of resources within pods.
- Optimizing Resource Allocation: If you want to maximize the utilization of CPU and memory resources within your pods, VPA can help you achieve that by dynamically adjusting resource settings.
Working with VPA
Install the Vertical Pod Autoscaler to your cluster. Clone the autoscaler repository from GitHub and run ./hack/vpa-up.sh in the vertical-pod-scaler directory to install VPA.
Apply VPA Policy
Create a VPA config file named vpa-config.yaml. Apply the VPA policy to specify the resource requirements. Use the template below to create the vpa-config.yaml file.
Here's a breakdown of the key components of the YAML:
- apiVersion: autoscaling.k8s.io/v1: Specifies the API version for the VerticalPodAutoscaler resource.
- kind: VerticalPodAutoscaler: Specifies the kind of resource, indicating that it's a VPA.
- metadata: Metadata for the VPA, including its name.
- targetRef: Defines the target deployment for the VPA to analyze and adjust.
- apiVersion: "apps/v1": The API version of the target resource (Deployment in this case).
- kind: "Deployment": The kind of target resource (Deployment in this case).
- name: "my-app": The name of the Deployment that the VPA will analyze.
- updatePolicy: Specifies the update policy for the VPA.
- updateMode: "Auto": Indicates that the VPA should automatically update the resource requests and limits of the containers based on observed usage patterns. The VPA will analyze resource usage and make recommendations for changes.
Apply the VPA configuration to your cluster.
There are several ways in which you can generate load for your deployment. In this tutorial, we will use hey. Download hey by going through the installation guide on GitHub. If you use a Linux machine, you can use sudo apt install to install hey.
Generate load by using hey with the right arguments.
This command generates load for 5 minutes with 10 concurrent requests. Adjust the parameters as needed for your testing.
Keep in mind that generating load from within the cluster is different from generating load externally. VPA might respond differently to load generated from within the cluster compared to external load.
Monitor the VPA to see the adjustments made to your pod's resource requests and limits.
Kubernetes Event-Driven Autoscaling (KEDA):
KEDA, "Kubernetes-based Event-Driven Autoscaling," is an open-source project designed to provide event-driven autoscaling for container workloads in Kubernetes. The buzz around KEDA is well-founded. KEDA extends Kubernetes' native horizontal pod autoscaling capabilities to allow applications to scale automatically based on events coming from various sources, such as message queues, event buses, or custom metrics. This makes it easier to build and operate event-driven applications that can efficiently handle varying workloads.
Events can be external or internal events that signal a need to scale an application based on workload demand. Some examples of events in KEDA are Cron Jobs, message queues, and HTTP Requests.
A KEDA event should not be confused with an observability event. The key difference is that a KEDA event triggers autoscaling actions, whereas an observability event provides insights and data for monitoring and understanding your application's behavior. While they both deal with events, their focus and usage are distinct: KEDA events are about managing the runtime scaling of your application, while observability events are about monitoring and improving your system's performance and reliability.
When to use KEDA
Use KEDA when you want to autoscale based on event-driven patterns and sources beyond typical resource metrics:
- Event-Driven Scaling: KEDA is designed for scenarios where you want to scale your applications based on events such as messages from message queues, HTTP requests, or other custom event sources.
- Scale Beyond Resource Metrics: If your application's scaling decisions depend on events generated by external systems or services, KEDA provides a way to integrate autoscaling with these events.
- Cloud-Native Event Patterns: KEDA is suitable for cloud-native applications that leverage event-driven architectures and need to scale in response to events from various sources.
Working with KEDA
Install KEDA to your cluster using Helm.
Create a Deployment And Service
Create a deployment that will be scaled using KEDA. Also, create an accompanying service for the deployment.
Create a ScaledObject
Create a ScaledObject resource that defines how your application scales based on an event source and configure KEDA to scale the application based on an event source. For this example, let's use the HTTP scaler:
Let's break down THE YAML for the ScaledObject:
- apiVersion: keda.sh/v1alpha1: Is the API version used for the resource definition, which indicates compatibility and functionality with the KEDA framework.
- kind: ScaledObject: Defines the type of resource being created, a "ScaledObject." A ScaledObject is a KEDA-specific resource that describes how to scale deployment or workload based on specific triggers.
- metadata: Contains metadata information about the resource, such as its name and any additional labels or annotations.
- name: http-scaler: Specifies the name of the ScaledObject resource.
- spec: This section defines the specifications for the ScaledObject.
- scaleTargetRef: This is the target deployment or workload that needs to be scaled based on certain triggers.
- name: http-app: Refers to the name of the deployment or workload that the ScaledObject is associated with.
- triggers: Defines the triggers that will cause the scaling of the target workload. In this case, a single trigger is defined as "http," indicating that the scaling will be based on incoming HTTP requests.
- type: http: Specifies the type of trigger, an HTTP request trigger.
- metadata: Contains additional metadata for the HTTP trigger.
- url: "http://http-app": Specifies the URL of the HTTP endpoint that will be used as the trigger source. In this example, it's set to "http://http-app."
- port: "80": Specifies the port number for the HTTP requests (port 80 in this case).
- path: "/": Defines the path on the HTTP server that will trigger the scaling. In this example, requests to the root path ("/") will trigger scaling.
- authenticationRef: Specifies the authentication configuration for the HTTP trigger.
- name: "": The name of a reference to an authentication configuration. Here, no specific authentication configuration is being used.
Apply the ScaledObject configuration to your cluster.
To generate load, you can use tools like hey or wrk. Install the tool you choose on your local machine and run it against the Nginx service. We will be using hey.
Monitor the scaling behavior of your application based on the events coming from the specified event source.
The ultimate goal of scaling is to effectively utilize and balance resources, as well as manage costs. You can look at how to properly analyze and save costs in your Kubernetes deployments by going through this blog post.
In this tutorial, we covered the basics of autoscaling in Kubernetes, including both Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and Kubernetes Event Driven Autoscaler (KEDA).
HPA allows you to automatically scale pods in a deployment or replica set based on observed metrics such as CPU or memory usage. You learned how to create an HPA using a YAML configuration. The HPA reacts to changes in the metric and adjusts the number of replicas accordingly.
VPA adjusts the resource requests and limits of containers within pods based on their observed resource usage patterns. You learned how to create a VPA using a YAML configuration. The VPA analyzes container resource usage and suggests or applies resource adjustments to optimize performance.
KEDA focuses on scaling an application based on events called triggers. You learned how to create a ScaledObject, generate load, and monitor a ScaledObject with hey.
Effective autoscaling requires careful monitoring, tuning of metrics, and testing to ensure that your applications can handle changes in load while maintaining stability.
To learn more about Kubernetes or If you have other questions, you can join our Slack channel and we will be happy to help.