Troubleshoot cluster autoscaler not scaling up


This page shows you how to discover and resolve issues with cluster autoscaler not scaling up nodes in your Google Kubernetes Engine (GKE) clusters.

This page is for Application developers who want to resolve an unexpected or negative situation with their app or service and Platform admins and operators who want to prevent interruption to delivery of products and services.

Understand when cluster autoscaler scales up your nodes

Before you proceed to the troubleshooting steps, it can be helpful to understand when cluster autoscaler would try to scale up your nodes. Cluster autoscaler only adds nodes when existing resources are insufficient.

Every 10 seconds, cluster autoscaler checks if there are any Pods that are unschedulable. A Pod becomes unschedulable when the Kubernetes scheduler cannot place it on any existing node due to insufficient resources, node constraints, or unmet Pod requirements.

When cluster autoscaler finds unschedulable Pods, it evaluates if adding a node would allow the Pod to get scheduled. If adding a node would let a Pod get scheduled cluster autoscaler adds a new node to the managed instance group (MIG). The Kubernetes scheduler can then schedule the Pod on the newly provisioned node.

Check if you have unschedulable Pods

To determine if your cluster needs to scale up, check for unscheduled Pods:

  1. In the Google Cloud console, go to the Workloads page.

    Go to Workloads

  2. In the Filter field, enter unschedulable and press Enter.

    If there are any Pods listed, then you have unschedulable Pods. To troubleshoot unschedulable Pods, see Error: Pod unschedulable. Resolving the underlying cause of unschedulable Pods can often enable cluster autoscaler to scale up. To identify and resolve errors that are specific to cluster autoscaler, explore the following sections.

    If there are no Pods listed, cluster autoscaler doesn't need to scale up and is working as expected.

Check if you previously had unschedulable Pods

If you're investigating what caused cluster autoscaler to fail in the past, check for previously unschedulable Pods:

  1. In the Google Cloud console, go to the Logs Explorer page.

    Go to Logs Explorer

  2. Specify a time range for the log entries that you want to view.

  3. In the query pane, enter the following query:

    logName="projects/PROJECT_ID/logs/events"
    jsonPayload.source.component="default-scheduler"
    jsonPayload.reason="FailedScheduling"
    

    Replace PROJECT_ID with your project ID.

  4. Click Run query.

    If there are any results listed, then you had unschedulable Pods in the time range that you specified.

Check if the issue is caused by a limitation

After you've confirmed that you have unscheduled Pods, make sure your issue with cluster autoscaler isn't caused by one of the limitations for the cluster autoscaler.

View errors

You can often diagnose the cause of scale up issues by viewing error messages:

View errors in notifications

If the issue you observed happened less than 72 hours ago, view notifications about errors in the Google Cloud console. These notifications provide valuable insights into why cluster autoscaler didn't scale up and offer advice on how to resolve the error and view relevant logs for further investigation.

To view the notifications in the Google Cloud console, complete the following steps:

  1. In the Google Cloud console, go to the Kubernetes clusters page.

    Go to Kubernetes clusters

  2. Review the Notifications column. The following notifications are associated with scale up issues:

    • Can't scale up
    • Can't scale up pods
    • Can't scale up a node pool
  3. Click the relevant notification to see a pane with details about what caused the issue and recommended actions to resolve it.

  4. Optional: To view the logs for this event, click Logs. This action takes you to Logs Explorer with a pre-populated query to help you further investigate the scaling event. To learn more about how scale up events work, see View cluster autoscaler events.

If you're still experiencing issues after reviewing the advice in the notification, consult the error messages tables for further help.

View errors in events

If the issue you observed happened more than 72 hours ago, view events in Cloud Logging. When there has been an error, it's often recorded in an event.

To view cluster autoscaler logs in the Google Cloud console, complete the following steps:

  1. In the Google Cloud console, go to the Kubernetes clusters page.

    Go to Kubernetes clusters

  2. Select the name of the cluster that you want to investigate to view its Cluster details page.

  3. On the Cluster details page, click the Logs tab.

  4. On the Logs tab, click the Autoscaler Logs tab to view the logs.

  5. Optional: To apply more advanced filters to narrow the results, click the button with the arrow on the right side of the page to view the logs in Logs Explorer.

To learn more about scale up events work, see View cluster autoscaler events. For one example of how to use Cloud Logging, see the following troubleshooting example.

Example: Troubleshoot an issue over 72 hours old

The following example shows you how you might investigate and resolve an issue with a cluster not scaling up.

Scenario: For the past hour, a Pod has been marked as unschedulable. Cluster autoscaler did not provision any new nodes to schedule the Pod.

Solution:

  1. Because the issue happened over 72 hours ago, you investigate the issue using Cloud Logging instead of looking at the notification messages.
  2. In Cloud Logging, you find the logging details for cluster autoscaler events, as described in View errors in events.
  3. You search for scaleUp events that contain the Pod that you're investigating in the triggeringPods field. You could filter the log entries, including filtering by a particular JSON field value. Learn more in Advanced logs queries.

  4. You don't find any scale up events. However, if you did, you could try to find an EventResult that contains the same eventId as the scaleUp event. You could then look at the errorMsg field and consult the list of possible scaleUp error messages.

  5. Because you didn't find any scaleUp events, you continue to search for noScaleUp events and review the following fields:

    • unhandledPodGroups: contains information about the Pod (or Pod's controller).
    • reason: provides global reasons indicating scaling up could be blocked.
    • skippedMigs: provides reasons why some MIGs might be skipped.
  6. You find a noScaleUp event for your Pod, and all MIGs in the rejectedMigs field have the same reason message ID of "no.scale.up.mig.failing.predicate" with two parameters: "NodeAffinity" and "node(s) did not match node selector".

Resolution:

After consulting the list of error messages, you discover that cluster autoscaler can't scale up a node pool because of a failing scheduling predicate for the pending Pods. The parameters are the name of the failing predicate and the reason why it failed.

To resolve the issue, you review the manifest of the Pod, and discover that it has a node selector that doesn't match any MIG in the cluster. You delete the selector from the manifest of the Pod and recreate the Pod. Cluster autoscaler adds a new node and the Pod is scheduled.

Resolve scale up errors

After you have identified your error, use the following tables to help you understand what caused the error and how to resolve it.

ScaleUp errors

You can find event error messages for scaleUp events in the corresponding eventResult event, in the resultInfo.results[].errorMsg field.

Message Details Parameters Mitigation
"scale.up.error.out.of.resources" Resource errors occur when you try to request new resources in a zone that cannot accommodate your request due to the current unavailability of a Compute Engine resource, such as GPUs or CPUs. Failing MIG IDs. Follow the resource availability troubleshooting steps in the Compute Engine documentation.
"scale.up.error.quota.exceeded" The scaleUp event failed because some of the MIGs couldn't be increased, due to exceeded Compute Engine quota. Failing MIG IDs. Check the Errors tab of the MIG in the Google Cloud console to see what quota is being exceeded. After you know which quota is being exceeded, follow the instructions to request a quota increase.
"scale.up.error.waiting.for.instances.timeout" Scale up of managed instance group failed to scale up due to timeout. Failing MIG IDs. This message should be transient. If it persists, contact Cloud Customer Care for further investigation.
"scale.up.error.ip.space.exhausted" Can't scale up because instances in some of the managed instance groups ran out of IPs. This means that the cluster doesn't have enough unallocated IP address space to use to add new nodes or Pods. Failing MIG IDs. Follow the troubleshooting steps in Not enough free IP address space for Pods.
"scale.up.error.service.account.deleted" Can't scale up because the service account was deleted. Failing MIG IDs. Try to undelete the service account. If that procedure is unsuccessful, contact Cloud Customer Care for further investigation.

Reasons for a noScaleUp event

A noScaleUp event is periodically emitted when there are unschedulable Pods in the cluster and cluster autoscaler cannot scale the cluster up to schedule the Pods. noScaleUp events are best-effort, and don't cover all possible cases.

NoScaleUp top-level reasons

Top-level reason messages for noScaleUp events appear in the noDecisionStatus.noScaleUp.reason field. The message contains a top-level reason for why cluster autoscaler cannot scale the cluster up.

Message Details Mitigation
"no.scale.up.in.backoff" No scale up because scaling up is in a backoff period (temporarily blocked). This message that can occur during scale up events with a large number of Pods. This message should be transient. Check this error after a few minutes. If this message persists, contact Cloud Customer Care for further investigation.

NoScaleUp top-level node auto-provisioning reasons

Top-level node auto-provisioning reason messages for noScaleUp events appear in the noDecisionStatus.noScaleUp.napFailureReason field. The message contains a top-level reason for why cluster autoscaler cannot provision new node pools.

Message Details Mitigation
"no.scale.up.nap.disabled"

Node auto provisioning couldn't scale up because node auto provisioning is not enabled at cluster level.

If node auto-provisioning is disabled, new nodes won't be automatically provisioned if the pending Pod has requirements that can't be satisfied by any existing node pools.

Review the cluster configuration and consider enabling node auto-provisioning.

NoScaleUp MIG-level reasons

MIG-level reason messages for noScaleUp events appear in the noDecisionStatus.noScaleUp.skippedMigs[].reason and noDecisionStatus.noScaleUp.unhandledPodGroups[].rejectedMigs[].reason fields. The message contains a reason why cluster autoscaler can't increase the size of a particular MIG.

Message Details Parameters Mitigation
"no.scale.up.mig.skipped" Cannot scale up a MIG because it was skipped during the simulation. Reasons why the MIG was skipped (for example, missing a Pod requirement). Review the parameters included in the error message and address why the MIG was skipped.
"no.scale.up.mig.failing.predicate" Can't scale up a node pool because of a failing scheduling predicate for the pending Pods. Name of the failing predicate and reasons why it failed. Review the Pod requirements, such as affinity rules, taints or tolerations, and resource requirements.

NoScaleUp Pod-group-level node auto-provisioning reasons

Pod-group-level node auto-provisioning reason messages for noScaleUp events appear in the noDecisionStatus.noScaleUp.unhandledPodGroups[].napFailureReasons[] field. The message contains a reason why cluster autoscaler cannot provision a new node pool to schedule a particular Pod group.

Message Details Parameters Mitigation
"no.scale.up.nap.pod.gpu.no.limit.defined" Node auto-provisioning couldn't provision any node group because a pending Pod has a GPU request, but GPU resource limits are not defined at the cluster level. Requested GPU type. Review the pending Pod's GPU request, and update the cluster-level node auto-provisioning configuration for GPU limits.
"no.scale.up.nap.pod.gpu.type.not.supported" Node auto-provisioning did not provision any node group for the Pod because it has requests for an unknown GPU type. Requested GPU type. Check the pending Pod's configuration for the GPU type to ensure that it matches a supported GPU type.
"no.scale.up.nap.pod.zonal.resources.exceeded" Node auto-provisioning did not provision any node group for the Pod in this zone because doing so would either violate the cluster-wide maximum resource limits, exceed the available resources in the zone, or there is no machine type that could fit the request. Name of the considered zone. Review and update cluster-wide maximum resource limits, the Pod resource requests, or the available zones for node auto-provisioning.
"no.scale.up.nap.pod.zonal.failing.predicates" Node auto-provisioning did not provision any node group for the Pod in this zone because of failing predicates. Name of the considered zone and reasons why predicates failed. Review the pending Pod's requirements, such as affinity rules, taints, tolerations, or resource requirements.

Conduct further investigation

The following sections provide guidance on how to use Logs Explorer and gcpdiag to gain additional insights into your errors.

Investigate errors in Logs Explorer

If you want to further investigate your error message, view logs specific to your error:

  1. In the Google Cloud console, go to the Logs Explorer page.

    Go to Logs Explorer

  2. In the query pane, enter the following query:

    resource.type="k8s_cluster"
    log_id("container.googleapis.com/cluster-autoscaler-visibility")
    jsonPayload.resultInfo.results.errorMsg.messageId="ERROR_MESSAGE"
    

    Replace ERROR_MESSAGE with the message that you want to investigate. For example, scale.up.error.out.of.resources.

  3. Click Run query.

Debug some errors with gcpdiag

gcpdiag is an open source tool created with support from Google Cloud technical engineers. It isn't an officially supported Google Cloud product.

If you've experienced one of the following error messages, you can use gcpdiag to help troubleshoot the issue:

  • scale.up.error.out.of.resources
  • scale.up.error.quota.exceeded
  • scale.up.error.waiting.for.instances.timeout
  • scale.up.error.ip.space.exhausted
  • scale.up.error.service.account.deleted

For a list and description of all gcpdiag tool flags, see the gcpdiag usage instructions.

Resolve complex scale up errors

The following sections offer guidance on resolving errors where the mitigations involve multiple steps and errors that don't have a cluster autoscaler event message associated with them.

Issue: Pod doesn't fit on node

Cluster autoscaler only schedules a Pod on a node if there is a node with sufficient resources such as GPUs, memory, and storage to meet the Pod's requirements. To determine if this is why cluster autoscaler didn't scale up, compare resource requests with the resources provided.

The following example shows you how to check CPU resources but the same steps are applicable for GPUs, memory, and storage resources. To compare CPU requests with CPUs provisioned, complete the following steps:

  1. In the Google Cloud console, go to the Workloads page.

    Go to Workloads

  2. Click the PodUnschedulable error message.

  3. In the Details pane, click the name of the Pod. If there are multiple Pods, start with the first Pod and repeat the following process for each Pod.

  4. In the Pod details page, go to the Events tab.

  5. From the Events tab, go to the YAML tab.

  6. Make note each container's resource requests in the Pod to find what the resource requests total is. For example, in the following Pod configuration, the Pod needs 2 vCPUs:

    resources:
      limits:
        cpu: "3"
     requests:
        cpu: "2"
    
  7. View the node pool details from the cluster with the unscheduled Pod:

    1. In the Google Cloud console, go to the Kubernetes clusters page.

      Go to Kubernetes clusters

    2. Click the name of the cluster that has the Pods unschedulable error message.

    3. In the Cluster details page, go to the Nodes tab.

  8. In the Node pools section, make note of the value in the Machine type column. For example, n1-standard-1.

  9. Compare the resource request with the vCPUs provided by the machine type. For example, if a Pod requests 2 vCPUs, but the available nodes have the n1-standard-1 machine type, the nodes would only have 1 vCPU. With a configuration like this, cluster autoscaler wouldn't trigger scale up because even if it added a new node, this Pod wouldn't fit on it. If you want to know more about available machine types, see Machine families resource and comparison guide in the Compute Engine documentation.

Also keep in mind that the allocatable resources of a node are less than the total resources, as a portion is needed to run system components. To learn more about how this is calculated, see Node allocatable resources.

To resolve this issue, decide if the resource requests defined for the workload are suitable for your needs. If the machine type shouldn't be changed, create a node pool with a machine type that can support the request coming from the Pod. If the Pod resource requests aren't accurate, update the Pod's definition so that the Pods can fit on nodes.

Issue: Unhealthy clusters preventing scale up

Cluster autoscaler might not perform scale up if it considers a cluster to be unhealthy. Cluster unhealthiness isn't based on the control plane being healthy, but on the ratio of healthy and ready nodes. If 45% of nodes in a cluster are unhealthy or not ready, cluster autoscaler halts all operations.

If this is why your cluster autoscaler isn't scaling up, there is an event in the cluster autoscaler ConfigMap with the type Warning with ClusterUnhealthy listed as the reason.

To view the ConfigMap, run the following command:

kubectl describe configmap cluster-autoscaler-status -n kube-system

To resolve this issue, decrease the number of unhealthy nodes.

It's also possible that some of the nodes are ready, though not considered ready by cluster autoscaler. This happens when a taint with the prefix ignore-taint.cluster-autoscaler.kubernetes.io/ is present on a node. Cluster autoscaler considers a node to be NotReady as long as that taint is present.

If the behavior is caused by the presence of ignore-taint.cluster-autoscaler.kubernetes.io/.* taint, remove it.

What's next