Best practices for running cost-optimized Kubernetes applications on GKE

This document discusses Google Kubernetes Engine (GKE) features and options, and the best practices for running cost-optimized applications on GKE to take advantage of the elasticity provided by Google Cloud. This document assumes that you are familiar with Kubernetes, Google Cloud, GKE, and autoscaling.

Introduction

As Kubernetes gains widespread adoption, a growing number of enterprises and platform-as-a-service (PaaS) and software-as-a-service (SaaS) providers are using multi-tenant Kubernetes clusters for their workloads. This means that a single cluster might be running applications that belong to different teams, departments, customers, or environments. The multi-tenancy provided by Kubernetes lets companies manage a few large clusters, instead of multiple smaller ones, with benefits such as appropriate resource utilization, simplified management control, and reduced fragmentation.

Over time, some of these companies with fast-growing Kubernetes clusters start to experience a disproportionate increase in cost. This happens because traditional companies that embrace cloud-based solutions like Kubernetes don't have developers and operators with cloud expertise. This lack of cloud readiness leads to applications becoming unstable during autoscaling (for example, traffic volatility during a regular period of the day), sudden bursts, or spikes (such as TV commercials or peak scale events like Black Friday and Cyber Monday). In an attempt to "fix" the problem, these companies tend to over-provision their clusters the way they used to in a non-elastic environment. Over-provisioning results in considerably higher CPU and memory allocation than what applications use for most of the day.

This document provides best practices for running cost-optimized Kubernetes workloads on GKE. The following diagram outlines this approach.

The approach to optimizing Kubernetes applications for cost.

The foundation of building cost-optimized applications is spreading the cost-saving culture across teams. Beyond moving cost discussions to the beginning of the development process, this approach forces you to better understand the environment that your applications are running in—in this context, the GKE environment.

In order to achieve low cost and application stability, you must correctly set or tune some features and configurations (such as autoscaling, machine types, and region selection). Another important consideration is your workload type because, depending on the workload type and your application's requirements, you must apply different configurations in order to further lower your costs. Finally, you must monitor your spending and create guardrails so that you can enforce best practices early in your development cycle.

The following table summarizes the challenges that GKE helps you solve. Although we encourage you to read the whole document, this table presents a map of what's covered.

Challenge Action
I want to look at easy cost savings on GKE. Select the appropriate region, sign up for committed-use discounts, and use E2 machine types.
I need to understand my GKE costs. Observe your GKE clusters and watch for recommendations, and enable GKE usage metering
I want to make the most out of GKE elasticity for my existing workloads. Read Horizontal Pod Autoscaler, Cluster Autoscaler, and understand best practices for Autoscaler and over-provisioning.
I want to use the most efficient machine types. Choose the right machine type for your workload.
Many nodes in my cluster are sitting idle. Read best practices for Cluster Autoscaler.
I need to improve cost savings in my batch jobs. Read best practices for batch workloads.
I need to improve cost savings in my serving workloads. Read best practices for serving workloads.
I don't know how to size my Pod resource requests. Use Vertical Pod Autoscaler (VPA), but pay attention to mixing Horizontal Pod Autoscaler (HPA) and VPA best practices.
My applications are unstable during autoscaling and maintenance activities. Prepare cloud-based applications for Kubernetes, and understand how Metrics Server works and how to monitor it.
How do I make my developers pay attention to their applications' resource usage? Spread the cost-saving culture, consider using GKE Enterprise Policy Controller, design your CI/CD pipeline to enforce cost savings practices, and use Kubernetes resource quotas.
What else should I consider to further reduce my ecosystem costs? Review small development clusters, review your logging and monitoring strategies, and review inter-region egress traffic in regional and multi-zonal clusters.

GKE cost-optimization features and options

Cost-optimized Kubernetes applications rely heavily on GKE autoscaling. To balance cost, reliability, and scaling performance on GKE, you must understand how autoscaling works and what options you have. This section discusses GKE autoscaling and other useful cost-optimized configurations for both serving and batch workloads.

Fine-tune GKE autoscaling

Autoscaling is the strategy GKE uses to let Google Cloud customers pay only for what they need by minimizing infrastructure uptime. In other words, autoscaling saves costs by 1) making workloads and their underlying infrastructure start before demand increases, and 2) shutting them down when demand decreases.

The following diagram illustrates this concept. In Kubernetes, your workloads are containerized applications that are running inside Pods, and the underlying infrastructure, which is composed of a set of Nodes, must provide enough computing capacity to run the workloads.

Autoscaling saves cost by 1) making workloads and their underlying infrastructure start before demand increases, and 2) shutting them down when demand decreases.

As the following diagram shows, this environment has four scalability dimensions. The workload and infrastructure can scale horizontally by adding and removing Pods or Nodes, and they can scale vertically by increasing and decreasing Pod or Node size.

The four scalability dimensions of a cost-optimized environment.

GKE handles these autoscaling scenarios by using features like the following:

The following diagram illustrates these scenarios.

Using the HPA, VPA, CA, and node auto-provisioning scenarios.

The remainder of this section discusses these GKE autoscaling capabilities in more detail and covers other useful cost-optimized configurations for both serving and batch workloads.

Horizontal Pod Autoscaler

Horizontal Pod Autoscaler (HPA) is meant for scaling applications that are running in Pods based on metrics that express load. You can configure either CPU utilization or other custom metrics (for example, requests per second). In short, HPA adds and deletes Pods replicas, and it is best suited for stateless workers that can spin up quickly to react to usage spikes, and shut down gracefully to avoid workload instability.

The HPA target utilization threshold lets you customize when to automatically
trigger scaling.

As the preceding image shows, HPA requires a target utilization threshold, expressed in percentage, which lets you customize when to automatically trigger scaling. In this example, the target CPU utilization is 70%. That means your workload has a 30% CPU buffer for handling requests while new replicas are spinning up. A small buffer prevents early scale-ups, but it can overload your application during spikes. However, a large buffer causes resource waste, increasing your costs. The exact target is application specific, and you must consider the buffer size to be enough for handling requests for two or three minutes during a spike. Even if you guarantee that your application can start up in a matter of seconds, this extra time is required when Cluster Autoscaler adds new nodes to your cluster or when Pods are throttled due to lack of resources.

The following are best practices for enabling HPA in your application:

For more information, see Configuring a Horizontal Pod Autoscaler.

Vertical Pod Autoscaler

Unlike HPA, which adds and deletes Pod replicas for rapidly reacting to usage spikes, Vertical Pod Autoscaler (VPA) observes Pods over time and gradually finds the optimal CPU and memory resources required by the Pods. Setting the right resources is important for stability and cost efficiency. If your Pod resources are too small, your application can either be throttled or it can fail due to out-of-memory errors. If your resources are too large, you have waste and, therefore, larger bills. VPA is meant for stateless and stateful workloads not handled by HPA or when you don't know the proper Pod resource requests.

VPA detects that a Pod is consistently running at its limits and recreates the Pod with larger resources.

As the preceding image shows, VPA detects that the Pod is consistently running at its limits and recreates the Pod with larger resources. The opposite also happens when the Pod is consistently underutilized—a scale-down is triggered.

VPA can work in three different modes:

  • Off. In this mode, also known as recommendation mode, VPA does not apply any change to your Pod. The recommendations are calculated and can be inspected in the VPA object.
  • Initial: VPA assigns resource requests only at Pod creation and never changes them later.
  • Auto: VPA updates CPU and memory requests during the life of a Pod. That means, the Pod is deleted, CPU and memory are adjusted, and then a new Pod is started.

If you plan to use VPA, the best practice is to start with the Off mode for pulling VPA recommendations. Make sure it's running for 24 hours, ideally one week or more, before pulling recommendations. Then, only when you feel confident, consider switching to either Initial or Auto mode.

Follow these best practices for enabling VPA, either in Initial or Auto mode, in your application:

Whether you are considering using Auto mode, make sure you also follow these practices:

  • Make sure your application can be restarted while receiving traffic.
  • Add Pod Disruption Budget (PDB) to control how many Pods can be taken down at the same time.

For more information, see Configuring Vertical Pod Autoscaling.

Mixing HPA and VPA

The official recommendation is that you must not mix VPA and HPA on either CPU or memory. However, you can mix them safely when using recommendation mode in VPA or custom metrics in HPA—for example, requests per second. When mixing VPA with HPA, make sure your deployments are receiving enough traffic—meaning, they are consistently running above the HPA min-replicas. This lets VPA understand your Pod's resource needs.

For more information about VPA limitations, see Limitations for Vertical Pod autoscaling.

Cluster Autoscaler

Cluster Autoscaler (CA) automatically resizes the underlying computer infrastructure. CA provides nodes for Pods that don't have a place to run in the cluster and removes under-utilized nodes. CA is optimized for the cost of infrastructure. In other words, if there are two or more node types in the cluster, CA chooses the least expensive one that fits the given demand.

Unlike HPA and VPA, CA doesn't depend on load metrics. Instead, it's based on scheduling simulation and declared Pod requests. It's a best practice to enable CA whenever you are using either HPA or VPA. This practice ensures that if your Pod autoscalers determine that you need more capacity, your underlying infrastructure grows accordingly.

CA automatically adds and removes compute capacity to handle traffic spikes.

As these diagrams show, CA automatically adds and removes compute capacity to handle traffic spikes and save you money when your customers are sleeping. It is a best practice to define Pod Disruption Budget (PDB) for all your applications. It is particularly important at the CA scale-down phase when PDB controls the number of replicas that can be taken down at one time.

Certain Pods cannot be restarted by any autoscaler when they cause some temporary disruption, so the node they run on can't be deleted. For example, system Pods (such as metrics-server and kube-dns), and Pods using local storage won't be restarted. However, you can change this behavior by defining PDBs for these system Pods and by setting "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" annotation for Pods using local storage that are safe for the autoscaler to restart. Moreover, consider running long-lived Pods that can't be restarted on a separate node pool, so they don't block scale-down of other nodes. Finally, learn how to analyze CA events in the logs to understand why a particular scaling activity didn't happen as expected.

If your workloads are resilient to nodes restarting inadvertently and to capacity losses, you can save more money by creating a cluster or node pool with preemptible VMs. For CA to work as expected, Pod resource requests need to be large enough for the Pod to function normally. If resource requests are too small, nodes might not have enough resources and your Pods might crash or have troubles during runtime.

The following is a summary of the best practices for enabling Cluster Autoscaler in your cluster:

  • Use either HPA or VPA to autoscale your workloads.
  • Make sure you are following the best practices described in the chosen Pod autoscaler.
  • Size your application correctly by setting appropriate resource requests and limits or use VPA.
  • Define a PDB for your applications.
  • Define PDB for system Pods that might block your scale-down. For example, kube-dns. To avoid temporary disruption in your cluster, don't set PDB for system Pods that have only 1 replica (such as metrics-server).
  • Run short-lived Pods and Pods that can be restarted in separate node pools, so that long-lived Pods don't block their scale-down.
  • Avoid over-provisioning by configuring idle nodes in your cluster. For that, you must know your minimum capacity—for many companies it's during the night—and set the minimum number of nodes in your node pools to support that capacity.
  • If you need extra capacity to handle requests during spikes, use pause Pods, which are discussed in Autoscaler and over-provisioning.

For more information, see Autoscaling a cluster.

Node auto-provisioning

Node auto-provisioning (NAP) is a mechanism of Cluster Autoscaler that automatically adds new node pools in addition to managing their size on the user's behalf. Without node auto-provisioning, GKE considers starting new nodes only from the set of user-created node pools. With node auto-provisioning, GKE can create and delete new node pools automatically.

Node auto-provisioning tends to reduce resource waste by dynamically creating node pools that best fit with the scheduled workloads. However, the autoscale latency can be slightly higher when new node pools need to be created. If your workloads are resilient to nodes restarting inadvertently and to capacity losses, you can further lower costs by configuring a preemptible VM's toleration in your Pod.

The following are best practices for enabling node auto-provisioning:

  • Follow all the best practice of Cluster Autoscaler.
  • Set minimum and maximum resources sizes to avoid NAP making significant changes in your cluster when your application is not receiving traffic.
  • When using Horizontal Pod Autoscaler for serving workloads, consider reserving a slightly larger target utilization buffer because NAP might increase autoscaling latency in some cases.

For more information, see Using node auto-provisioning and Unsupported features.

Autoscaler and over-provisioning

In order to control your costs, we strongly recommend that you enable autoscaler according to the previous sections. No one configuration fits all possible scenarios, so you must fine-tune the settings for your workload to ensure that autoscalers respond correctly to increases in traffic.

However, as noted in the Horizontal Pod Autoscaler section, scale-ups might take some time due to infrastructure provisioning. To visualize this difference in time and possible scale-up scenarios, consider the following image.

Visualizing the difference in time and possible scale-up scenarios.

When your cluster has enough room for deploying new Pods, one of the Workload scale-up scenarios is triggered. Meaning, if an existing node never deployed your application, it must download its container images before starting the Pod (scenario 1). However, if the same node must start a new Pod replica of your application, the total scale-up time decreases because no image download is required (scenario 2).

When your cluster doesn't have enough room for deploying new Pods, one of the Infrastructure and Workload scale-up scenarios is triggered. This means that Cluster Autoscaler must provision new nodes and start the required software before approaching your application (scenario 1). If you use node auto-provisioning, depending on the workload scheduled, new node pools might be required. In this situation, the total scale-up time increases because Cluster Autoscaler has to provision nodes and node pools (scenario 2).

For scenarios where new infrastructure is required, don't squeeze your cluster too much—meaning, you must over-provision but only for reserving the necessary buffer to handle the expected peak requests during scale-ups.

There are two main strategies for this kind of over-provisioning:

  • Fine-tune the HPA utilization target. The following equation is a simple and safe way to find a good CPU target:

    (1 - buff)/(1 + perc)

    • buff is a safety buffer that you can set to avoid reaching 100% CPU. This variable is useful because reaching 100% CPU means that the latency of request processing is much higher than usual.
    • perc is the percentage of traffic growth you expect in two or three minutes.

    For example, if you expect a growth of 30% in your requests and you want to avoid reaching 100% of CPU by defining a 10% safety buffer, your formula would look like this:

    (1 - 0.1)/(1 + 0.3) = 0.69

  • Configure pause Pods. There is no way to configure Cluster Autoscaler to spin up nodes upfront. Instead, you can set an HPA utilization target to provide a buffer to help handle spikes in load. However, if you expect large bursts, setting a small HPA utilization target might not be enough or might become too expensive.

    An alternative solution for this problem is to use pause Pods. Pause Pods are low-priority deployments that do nothing but reserve room in your cluster. Whenever a high-priority Pod is scheduled, pause Pods get evicted and the high-priority Pod immediately takes their place. The evicted pause Pods are then rescheduled, and if there is no room in the cluster, Cluster Autoscaler spins up new nodes for fitting them. It's a best practice to have only a single pause Pod per node. For example, if you are using 4 CPU nodes, configure the pause Pods