This page describes the decisions the Google Kubernetes Engine (GKE) cluster autoscaler makes about autoscaling.
The GKE cluster autoscaler emits visibility events, which are available as log entries in Cloud Logging.
The events described in this guide are separate from the Kubernetes events produced by the cluster autoscaler.
Availability requirements
The ability to view logged events for cluster autoscaler is available in the following cluster versions:
Event type | Cluster version |
---|---|
status , scaleUp , scaleDown , eventResult |
1.15.4-gke.7 and later |
nodePoolCreated , nodePoolDeleted . |
1.15.4-gke.18 and later |
noScaleUp |
1.16.6-gke.3 and later |
noScaleDown |
1.16.8-gke.2 and later |
To see autoscaler events, you must enable Cloud Logging in your cluster. The events won't be produced if Logging is disabled.
Viewing events
The visibility events for the cluster autoscaler are stored in a Cloud Logging log, in the same project as where your GKE cluster is located. You can also view these events from the notifications in the Google Kubernetes Engine page in Google Cloud console.
Viewing visibility event logs
To view the logs, perform the following:
In the Google Cloud console, go to the Kubernetes Clusters page.
Select the name of your cluster to views its Cluster Details page.
On the Cluster Details page, click on the Logs tab.
On the Logs tab, click on the Autoscaler Logs tab to view the logs.
(Optional) To apply more advanced filters to narrow the results, click the button with the arrow on the right side of the page to view the logs in Logs Explorer.
Viewing visibility event notifications
To view the visibility event notifications on the Google Kubernetes Engine page, perform the following:
Go to the Google Kubernetes Engine page in the Google Cloud console:
Check the Notifications column for specific clusters to find notifications related to scaling.
Click on the notification for detailed information, recommended actions, and to access the logs for this event.
Types of events
All logged events are in the JSON format and can be found in the jsonPayload field of a log entry. All timestamps in the events are UNIX second timestamps.
Here's a summary of the types of events emitted by the cluster autoscaler:
Event type | Description |
---|---|
status |
Occurs periodically and describes the size of all autoscaled node pools and the target size of all autoscaled node pools as observed by the cluster autoscaler. |
scaleUp |
Occurs when cluster autoscaler scales the cluster up. |
scaleDown |
Occurs when cluster autoscaler scales the cluster down. |
eventResult |
Occurs when a scaleUp or a scaleDown event completes successfully or unsuccessfully. |
nodePoolCreated |
Occurs when cluster autoscaler with node auto-provisioning enabled creates a new node pool. |
nodePoolDeleted |
Occurs when cluster autoscaler with node auto-provisioning enabled deletes a node pool. |
noScaleUp |
Occurs when there are unschedulable Pods in the cluster, and cluster autoscaler cannot scale the cluster up to accommodate the Pods. |
noScaleDown |
Occurs when there are nodes that are blocked from being deleted by cluster autoscaler. |
Status event
A status
event is emitted periodically, and describes the actual size of all
autoscaled node pools and the target size of all autoscaled node pools as
observed by cluster autoscaler.
Example
The following log sample shows a status
event:
{
"status": {
"autoscaledNodesCount": 4,
"autoscaledNodesTarget": 4,
"measureTime": "1582898536"
}
}
ScaleUp event
A scaleUp
event is emitted when the cluster autoscaler scales the cluster up.
The autoscaler increases the size of the cluster's node pools by scaling up the
underlying Managed Instance Groups (MIGs) for
the node pools. To learn more about how scale up works, see
How does scale up work?
in the Kubernetes Cluster Autoscaler FAQ.
The event contains information on which MIGs were scaled up, by how many nodes, and which unschedulable Pods triggered the event.
The list of triggering Pods is truncated to 50 arbitrary entries. The
actual number of triggering Pods can be found in the triggeringPodsTotalCount
field.
Example
The following log sample shows a scaleUp
event:
{
"decision": {
"decideTime": "1582124907",
"eventId": "ed5cb16d-b06f-457c-a46d-f75dcca1f1ee",
"scaleUp": {
"increasedMigs": [
{
"mig": {
"name": "test-cluster-default-pool-a0c72690-grp",
"nodepool": "default-pool",
"zone": "us-central1-c"
},
"requestedNodes": 1
}
],
"triggeringPods": [
{
"controller": {
"apiVersion": "apps/v1",
"kind": "ReplicaSet",
"name": "test-85958b848b"
},
"name": "test-85958b848b-ptc7n",
"namespace": "default"
}
],
"triggeringPodsTotalCount": 1
}
}
}
ScaleDown event
A scaleDown
event is emitted when cluster autoscaler scales the cluster down.
To learn more about how scale down works, see
How does scale down work?
in the Kubernetes Cluster Autoscaler FAQ.
The cpuRatio
and memRatio
fields describe the CPU and memory utilization of
the node, as a percentage. This utilization is a sum of Pod requests divided by
node allocatable, not real utilization.
The list of evicted Pods is truncated to 50 arbitrary entries. The actual
number of evicted Pods can be found in the evictedPodsTotalCount
field.
Use the following query to verify if the cluster autoscaler scaled down the nodes
resource.type="k8s_cluster" \
resource.labels.location=COMPUTE_REGION \
resource.labels.cluster_name=CLUSTER_NAME \
log_id("container.googleapis.com/cluster-autoscaler-visibility") \
( "decision" NOT "noDecisionStatus" )
Replace the following:
CLUSTER_NAME
: the name of the cluster.COMPUTE_REGION
: the cluster's Compute Engine region, such asus-central1
.
Example
The following log sample shows a scaleDown
event:
{
"decision": {
"decideTime": "1580594665",
"eventId": "340dac18-8152-46ff-b79a-747f70854c81",
"scaleDown": {
"nodesToBeRemoved": [
{
"evictedPods": [
{
"controller": {
"apiVersion": "apps/v1",
"kind": "ReplicaSet",
"name": "kube-dns-5c44c7b6b6"
},
"name": "kube-dns-5c44c7b6b6-xvpbk"
}
],
"evictedPodsTotalCount": 1,
"node": {
"cpuRatio": 23,
"memRatio": 5,
"mig": {
"name": "test-cluster-default-pool-c47ef39f-grp",
"nodepool": "default-pool",
"zone": "us-central1-f"
},
"name": "test-cluster-default-pool-c47ef39f-p395"
}
}
]
}
}
}
You can also view the scale-down
event on the nodes with no workload running
(typically only system pods created by DaemonSets).
Use the following query to see the event logs:
resource.type="k8s_cluster" \
resource.labels.project_id=PROJECT_ID \
resource.labels.location=COMPUTE_REGION \
resource.labels.cluster_name=CLUSTER_NAME \
severity>=DEFAULT \
logName="projects/PROJECT_ID/logs/events" \
("Scale-down: removing empty node")
Replace the following:
PROJECT_ID
: your project ID.CLUSTER_NAME
: the name of the cluster.COMPUTE_REGION
: the cluster's Compute Engine region, such asus-central1
.
EventResult event
An eventResult
event is emitted when a scaleUp or a scaleDown event
completes successfully or unsuccessfully. This event contains a list of event IDs
(from the eventId
field in scaleUp or scaleDown events), along with error
messages. An empty error message indicates the event completed successfully. A
list of eventResult events are aggregated in the results
field.
To diagnose errors, consult the ScaleUp errors and ScaleDown errors sections.
Example
The following log sample shows an eventResult
event:
{
"resultInfo": {
"measureTime": "1582878896",
"results": [
{
"eventId": "2fca91cd-7345-47fc-9770-838e05e28b17"
},
{
"errorMsg": {
"messageId": "scale.down.error.failed.to.delete.node.min.size.reached",
"parameters": [
"test-cluster-default-pool-5c90f485-nk80"
]
},
"eventId": "ea2e964c-49b8-4cd7-8fa9-fefb0827f9a6"
}
]
}
}
NodePoolCreated event
A nodePoolCreated
event is emitted when cluster autoscaler with node auto-provisioning
enabled creates a new node pool. This event contains the name of the created
node pool and a list of its underlying MIGs. If the node pool was created because of a
scaleUp event, the eventId
of the corresponding scaleUp event is included in
the triggeringScaleUpId
field.
Example
The following log sample shows a nodePoolCreated
event:
{
"decision": {
"decideTime": "1585838544",
"eventId": "822d272c-f4f3-44cf-9326-9cad79c58718",
"nodePoolCreated": {
"nodePools": [
{
"migs": [
{
"name": "test-cluster-nap-n1-standard--b4fcc348-grp",
"nodepool": "nap-n1-standard-1-1kwag2qv",
"zone": "us-central1-f"
},
{
"name": "test-cluster-nap-n1-standard--jfla8215-grp",
"nodepool": "nap-n1-standard-1-1kwag2qv",
"zone": "us-central1-c"
}
],
"name": "nap-n1-standard-1-1kwag2qv"
}
],
"triggeringScaleUpId": "d25e0e6e-25e3-4755-98eb-49b38e54a728"
}
}
}
NodePoolDeleted event
A nodePoolDeleted
event is emitted when cluster autoscaler with
node auto-provisioning
enabled deletes a node pool.
Example
The following log sample shows a nodePoolDeleted
event:
{
"decision": {
"decideTime": "1585830461",
"eventId": "68b0d1c7-b684-4542-bc19-f030922fb820",
"nodePoolDeleted": {
"nodePoolNames": [
"nap-n1-highcpu-8-ydj4ewil"
]
}
}
}
NoScaleUp event
A noScaleUp
event is periodically emitted when there are unschedulable Pods in
the cluster and cluster autoscaler cannot scale the cluster up to accommodate
the Pods.
- noScaleUp events are best-effort, that is, these events do not cover all possible reasons for why cluster autoscaler cannot scale up.
- noScaleUp events are throttled to limit the produced log volume. Each persisting reason is only emitted every couple of minutes.
- All the reasons can be arbitrarily split across multiple events. For example, there is no guarantee that all rejected MIG reasons for a single Pod group will appear in the same event.
- The list of unhandled Pod groups is truncated to 50 arbitrary entries. The
actual number of unhandled Pod groups can be found in the
unhandledPodGroupsTotalCount
field.
Reason fields
The following fields help to explain why scaling up did not occur:
reason
: Provides a global reason for why cluster autoscaler is prevented from scaling up. Refer to the NoScaleUp top-level reasons section for details.napFailureReason
: Provides a global reason preventing cluster autoscaler from provisioning additional node pools (for example, node auto-provisioning is disabled). Refer to the NoScaleUp top-level node auto-provisioning reasons section for details.skippedMigs[].reason
: Provides information about why a particular MIG was skipped. Cluster autoscaler skips some MIGs from consideration for any Pod during a scaling up attempt (for example, because adding another node would exceed cluster-wide resource limits). Refer to the NoScaleUp MIG-level reasons section for details.unhandledPodGroups
: Contains information about why a particular group of unschedulable Pods does not trigger scaling up. The Pods are grouped by their immediate controller. Pods without a controller are in groups by themselves. Each Pod group contains an arbitrary example Pod and the number of Pods in the group, as well as the following reasons:napFailureReasons
: Reasons why cluster autoscaler cannot provision a new node pool to accommodate this Pod group (for example, Pods have affinity constraints). Refer to the NoScaleUp Pod-level node auto-provisioning reasons section for detailsrejectedMigs[].reason
: Per-MIG reasons why cluster autoscaler cannot increase the size of a particular MIG to accommodate this Pod group (for example, the MIG's node is too small for the Pods). Refer to the NoScaleUp MIG-level reasons section for details.
Example
The following log sample shows a noScaleUp
event:
{
"noDecisionStatus": {
"measureTime": "1582523362",
"noScaleUp": {
"skippedMigs": [
{
"mig": {
"name": "test-cluster-nap-n1-highmem-4-fbdca585-grp",
"nodepool": "nap-n1-highmem-4-1cywzhvf",
"zone": "us-central1-f"
},
"reason": {
"messageId": "no.scale.up.mig.skipped",
"parameters": [
"max cluster cpu limit reached"
]
}
}
],
"unhandledPodGroups": [
{
"napFailureReasons": [
{
"messageId": "no.scale.up.nap.pod.zonal.resources.exceeded",
"parameters": [
"us-central1-f"
]
}
],
"podGroup": {
"samplePod": {
"controller": {
"apiVersion": "v1",
"kind": "ReplicationController",
"name": "memory-reservation2"
},
"name": "memory-reservation2-6zg8m",
"namespace": "autoscaling-1661"
},
"totalPodCount": 1
},
"rejectedMigs": [
{
"mig": {
"name": "test-cluster-default-pool-b1808ff9-grp",
"nodepool": "default-pool",
"zone": "us-central1-f"
},
"reason": {
"messageId": "no.scale.up.mig.failing.predicate",
"parameters": [
"NodeResourcesFit",
"Insufficient memory"
]
}
}
]
}
],
"unhandledPodGroupsTotalCount": 1
}
}
}
NoScaleDown event
A noScaleDown
event is periodically emitted when there are nodes which are
blocked from being deleted by cluster autoscaler.
- Nodes that cannot be removed because their utilization is high are not included in noScaleDown events.
- NoScaleDown events are best effort, that is, these events do not cover all possible reasons for why cluster autoscaler cannot scale down.
- NoScaleDown events are throttled to limit the produced log volume. Each persisting reason will only be emitted every couple of minutes.
- The list of nodes is truncated to 50 arbitrary entries. The actual number of
nodes can be found in the
nodesTotalCount
field.
Reason fields
The following fields help to explain why scaling down did not occur:
reason
: Provides a global reason for why cluster autoscaler is prevented from scaling down (for example, a backoff period after recently scaling up). Refer to the NoScaleDown top-level reasons section for details.nodes[].reason
: Provides per-node reasons for why cluster autoscaler is prevented from deleting a particular node (for example, there's no place to move the node's Pods to). Refer to the NoScaleDown node-level reasons section for details.
Example
The following log sample shows a noScaleDown
event:
{
"noDecisionStatus": {
"measureTime": "1582858723",
"noScaleDown": {
"nodes": [
{
"node": {
"cpuRatio": 42,
"mig": {
"name": "test-cluster-default-pool-f74c1617-grp",
"nodepool": "default-pool",
"zone": "us-central1-c"
},
"name": "test-cluster-default-pool-f74c1617-fbhk"
},
"reason": {
"messageId": "no.scale.down.node.no.place.to.move.pods"
}
}
],
"nodesTotalCount": 1,
"reason": {
"messageId": "no.scale.down.in.backoff"
}
}
}
}
Troubleshooting scaling issues
This section provides guidance for how to troubleshoot scaling events.
Cluster not scaling up
Scenario: I created a Pod in my cluster but it's stuck in the Pending state for the past hour. Cluster autoscaler did not provision any new nodes to accommodate the Pod.
Solution:
- In the Logs Explorer, find the logging details for cluster autoscaler events, as described in the Viewing events section.
Search for
scaleUp
events that contain the desired Pod in thetriggeringPods
field. You can filter the log entries, including filtering by a particular JSON field value. Learn more in Advanced logs queries.- Find an
EventResult
that contains the sameeventId
as thescaleUp
event. - Look at the
errorMsg
field and consult the list of possible scaleUp error messages.
ScaleUp error example: For a
scaleUp
event, you discover the error is"scale.up.error.quota.exceeded"
, which indicates that "A scaleUp event failed because some of the MIGs could not be increased due to exceeded quota". To resolve the issue, you review your quota settings and increase the settings that are close to being exceeded. Cluster autoscaler adds a new node and the Pod is scheduled.- Find an
Otherwise, search for
noScaleUp
events and review the following fields:unhandledPodGroups
: contains information about the Pod (or Pod's controller).reason
: provides global reasons indicating scaling up could be blocked.skippedMigs
: provides reasons why some MIGs might be skipped.
Refer to the following sections that contain possible reasons for
noScaleUp
events:- NoScaleUp top-level reasons
- NoScaleUp top-level node auto-provisioning reasons
- NoScaleUp MIG-level reasons
- NoScaleUp Pod-group-level node auto-provisioning reasons
NoScaleUp example: You found a
noScaleUp
event for your Pod, and all MIGs in therejectedMigs
field have the same reason message ID of"no.scale.up.mig.failing.predicate"
with two parameters:"NodeAffinity"
and"node(s) did not match node selector"
. After consulting the list of error messages, you discover that you "cannot scale up a MIG because a predicate failed for it"; the parameters are the name of the failing predicate and the reason why it failed. To resolve the issue, you review the Pod spec, and discover that it has a node selector that doesn't match any MIG in the cluster. You delete the selector from the Pod spec and recreate the Pod. Cluster autoscaler adds a new node and the Pod is scheduled.If there are no
noScaleUp
events, use other debugging methods to resolve the issue.
Cluster not scaling down
Scenario: I have a node in my cluster that has utilized only 10% of its CPU and memory for the past couple of days. Despite the low utilization, cluster autoscaler did not delete the node as expected.
Solution:
- In the Logs Explorer, find the logging details for cluster autoscaler events, as described in the Viewing events section.
- Search for
scaleDown
events that contain the desired node in thenodesToBeRemoved
field. You can filter the log entries, including filtering by a particular JSON field value. Learn more in Advanced logs queries.- In the
scaleDown
event, search for anEventResult
event that contains the associatedeventId
. - Look at the
errorMsg
field and consult the list of possible scaleDown error messages.
- In the
- Otherwise, search for
noScaleDown
events that have the desired node in thenodes
field. Review thereason
field for any global reasons indicating that scaling down could be blocked. Refer to the following sections that contain possible reasons for
noScaleDown
events:NoScaleDown example: You found a
noScaleDown
event that contains a per-node reason for your node. The message ID is"no.scale.down.node.pod.has.local.storage"
and there is a single parameter:"test-single-pod"
. After consulting the list of error messages, you discover this means that the "Pod is blocking scale down because it requests local storage". You consult the Kubernetes Cluster Autoscaler FAQ and find out that the solution is to add a"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
annotation to the Pod. After applying the annotation, cluster autoscaler scales down the cluster correctly.If there are no
noScaleDown
events, use other debugging methods to resolve the issue.
Messages
The events emitted by the cluster autoscaler use parameterized messages to
provide explanations for the event. The parameters
field is available
with the messageId
field, such as in this example log for a NoScaleUp event.
This section provides descriptions for various messageId
and its corresponding
parameters. However, this section does not contain all possible messages, and
may be extended at any time.
ScaleUp errors
Error messages for scaleUp
events are found in the corresponding
eventResult
event, in the resultInfo.results[].errorMsg
field.
Message | Description | Mitigation |
---|---|---|
"scale.up.error.out.of.resources" |
The scaleUp event failed because some of the MIGs could not be
increased due to lack of resources.
Parameters: Failing MIG IDs. |
Follow the resource availability troubleshooting steps. |
"scale.up.error.quota.exceeded" |
The scaleUp event failed because some of the MIGs could not be increased,
due to exceeded Compute Engine quota.
Parameters: Failing MIG IDs. |
Check the Errors tab of the MIG in Google Cloud console to see what quota is being exceeded. Follow the instructions to request a quota increase. |
"scale.up.error.waiting.for.instances.timeout" |
The scaleUp event failed because instances in some of the MIGs failed
to appear in time.
Parameters: Failing MIG IDs. |
This message is transient. If it persists, engage Google Cloud Support for further investigation. |
"scale.up.error.ip.space.exhausted" |
The scaleUp event failed because the cluster doesn't have enough
unallocated IP address space to use to add new nodes or Pods.
Parameters: Failing MIG IDs. |
Refer to the troubleshooting steps to address the lack of IP address space for the nodes or pods. |
"scale.up.error.service.account.deleted" |
The scaleUp event failed because a service account used by Cluster
Autoscaler has been deleted.
Parameters: Failing MIG IDs. |
Engage Google Cloud Support for further investigation. |
ScaleDown errors
Error messages for scaleDown
events are found in the corresponding
eventResult
event, in the resultInfo.results[].errorMsg
field.
Message | Description | Mitigation |
---|---|---|
"scale.down.error.failed.to.mark.to.be.deleted" |
The scaleDown event failed because a node could not be marked for deletion.
Parameters: Failing node name. |
This message is transient. If it persists, engage Google Cloud Support for further investigation. |
"scale.down.error.failed.to.evict.pods" |
The scaleDown event failed because some of the Pods could not be
evicted from a node.
Parameters: Failing node name. |
Review best practices for Pod Disruption Budgets to ensure that the rules allow for eviction of application replicas when acceptable. |
"scale.down.error.failed.to.delete.node.min.size.reached" |
The scaleDown event failed because a node could not be deleted due to
the cluster already being at minimal size.
Parameters: Failing node name. |
Review the minimum value set for node pool autoscaling and adjust the settings as necessary. |
Reasons for a NoScaleUp event
NoScaleUp top-level reasons
Top-level reason messages for noScaleUp
events appear in the
noDecisionStatus.noScaleUp.reason
field. The message contains a top-level
reason for why cluster autoscaler cannot scale the cluster up.
Message | Description | Mitigation | |
---|---|---|---|
"no.scale.up.in.backoff" |
A noScaleUp occurred because scaling-up is in a backoff period (temporarily blocked). This is a transient message that may occur during scale up events with a large number of Pods. | If this message persists, engage Google Cloud Support for further investigation. |
NoScaleUp top-level node auto-provisioning reasons
Top-level node auto-provisioning reason messages for noScaleUp
events appear
in the noDecisionStatus.noScaleUp.napFailureReason
field. The message contains
a top-level reason for why cluster autoscaler cannot provision new node pools.
Message | Description | Mitigation |
---|---|---|
"no.scale.up.nap.disabled" |
Node auto-provisioning is not enabled at the cluster level. If node auto-provisioning is disabled, new nodes will not be automatically provisioned if the pending Pod has requirements that can't be satisfied by any existing node pools. | Review the cluster configuration and see Enabling Node auto-provisioning. |
NoScaleUp MIG-level reasons
MIG-level reason messages for noScaleUp
events appear in the
noDecisionStatus.noScaleUp.skippedMigs[].reason
and
noDecisionStatus.noScaleUp.unhandledPodGroups[].rejectedMigs[].reason
fields.
The message contains a reason why cluster autoscaler cannot increase the size of
a particular MIG.
Message | Description | Mitigation |
---|---|---|
"no.scale.up.mig.skipped" |
Cannot scale up a MIG because it was skipped during the simulation.
Parameters: human-readable reasons why it was skipped (for example, missing a pod requirement). |
Review the parameters included in the error message and address why the MIG was skipped. |
"no.scale.up.mig.failing.predicate" |
Cannot scale up a MIG because it does not meet the predicate requirements
for the pending Pods.
Parameters: Name of the failing predicate, human-readable reasons why it failed. |
Review Pod requirements, such as affinity rules, taints or tolerations, and resource requirements. |
NoScaleUp Pod-group-level node auto-provisioning reasons
Pod-group-level node auto-provisioning reason messages for noScaleUp
events appear in the
noDecisionStatus.noScaleUp.unhandledPodGroups[].napFailureReasons[]
field. The
message contains a reason why cluster autoscaler cannot provision a new node
pool to accommodate a particular Pod group.
Message | Description | Mitigation |
---|---|---|
"no.scale.up.nap.pod.gpu.no.limit.defined" |
Node auto-provisioning could not provision any node group because a
pending Pod has a GPU request, but GPU resource limits are not defined at the cluster level.
Parameters: Requested GPU type. |
Review the pending Pod's GPU request, and update the cluster-level node auto-provisioning configuration for GPU limits. |
"no.scale.up.nap.pod.gpu.type.not.supported" |
Node auto-provisioning did not provision any node group for the Pod
because it has requests for an unknown GPU type.
Parameters: Requested GPU type. |
Check the pending Pod's configuration for the GPU type to ensure that it matches a supported GPU type. |
"no.scale.up.nap.pod.zonal.resources.exceeded" |
Node auto-provisioning did not provision any node group for the Pod in this zone because
doing so would either violate the cluster-wide maximum resource limits, exceed the available resources in the zone, or there is no machine type that could fit the request.
Parameters: Name of the considered zone. |
Review and update cluster-wide maximum resource limits, the Pod resource requests, or the available zones for node auto-provisioning. |
"no.scale.up.nap.pod.zonal.failing.predicates" |
Node auto-provisioning did not provision any node group for the Pod in
this zone because of failing predicates.
Parameters: Name of the considered zone, human-readable reasons why predicates failed. |
Review the pending Pod's requirements, such as affinity rules, taints, tolerations, or resource requirements. |
Reasons for a NoScaleDown event
NoScaleDown top-level reasons
Top-level reason messages for noScaleDown
events appear in the
noDecisionStatus.noScaleDown.reason
field. The message contains a top-level
reason why cluster autoscaler cannot scale the cluster down.
Message | Description | Mitigation |
---|---|---|
"no.scale.down.in.backoff" |
A noScaleDown event occurred because scaling-down is in a backoff period (temporarily blocked). This event should be transient, and may occur when there has been a recent scale up event. | Follow the mitigation steps associated with the lower-level reasons for failure to scale down. When the underlying reasons are resolved, cluster autoscaler will exit backoff. If the message persists after addressing the underlying reasons, engage Google Cloud Support for further investigation. |
"no.scale.down.in.progress" |
A noScaleDown event occurred because scaling down is blocked until the previous node scheduled for removal is deleted. | This event should be transient, as the Pod will eventually be forcibly removed. If this message occurs frequently, you can review the gracefulTerminationPeriod value for the Pod(s) blocking scale down. If you would like to speed up the resolution, you can also forcibly delete the Pod if it is no longer needed. |
NoScaleDown node-level reasons
Node-level reason messages for noScaleDown
events appear in the
noDecisionStatus.noScaleDown.nodes[].reason
field. The message contains a
reason why cluster autoscaler cannot remove a particular node.
Message | Description | Mitigation |
---|---|---|
"no.scale.down.node.scale.down.disabled.annotation" |
Node cannot be removed because it has a scale-down-disabled
annotation. |
Review the annotation that is preventing scale down following the instructions in the Kubernetes Cluster Autoscaler FAQ. |
"no.scale.down.node.node.group.min.size.reached" |
Node cannot be removed because its node group is already at its minimum size. | Review and adjust the minimum value set for node pool autoscaling. |
"no.scale.down.node.minimal.resource.limits.exceeded" |
Scale down of an underutilized node is blocked because it would violate cluster-wide minimum resource limits set for node auto-provisioning. | Review the cluster-wide minimum resource limits. |
"no.scale.down.node.no.place.to.move.pods" |
Scale down of an underutilized node is blocked because it is running a Pod which can't be moved to another node in the cluster. | If you expect that the Pod should be rescheduled, review the scheduling requirements of the Pods on the underutilized node to determine if they can be moved to another node in the cluster. This message is expected if you do not expect the Pod to be rescheduled as there are no other nodes on which it could be scheduled. |
"no.scale.down.node.pod.not.backed.by.controller" |
Pod is blocking scale down of an underutilized node because the Pod doesn't have a controller known to Kubernetes Cluster Autoscaler (ReplicationController, DaemonSet, Job, StatefulSet, or ReplicaSet). Learn more from the Kubernetes Cluster Autoscaler FAQ about what types of pods can prevent cluster autoscaler from removing a node.
Parameters: Name of the blocking pod. |
Set an annotation "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" for the Pod or define a controller (ReplicationController, DaemonSet, Job, StatefulSet, or ReplicaSet). |
"no.scale.down.node.pod.has.local.storage" |
Pod is blocking scale down because it requests local storage. Learn more from the
Kubernetes Cluster Autoscaler FAQ
about what types of Pods can prevent cluster autoscaler from removing a node.
Parameters: Name of the blocking pod. |
Set an annotation "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" for the Pod if the data in the local storage for the Pod is not critical. |
"no.scale.down.node.pod.not.safe.to.evict.annotation" |
Pod is blocking scale down because it has a "not safe to evict"
annotation. See the Kubernetes Cluster Autoscaler FAQ for more details.
Parameters: Name of the blocking pod. |
If the Pod can be safely evicted, update the annotation to "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" . |
"no.scale.down.node.pod.kube.system.unmovable" |
Pod is blocking scale down because it's a non-DaemonSet, non-mirrored,
Pod without a PodDisruptionBudget in the kube-system namespace.
Parameters: Name of the blocking pod. |
Follow
the instructions in the Kubernetes Cluster Autoscaler FAQ to set a PodDisruptionBudget to enable cluster autoscaler to move Pods in the kube-system namespace. |
"no.scale.down.node.pod.not.enough.pdb" |
Pod is blocking scale down because it doesn't have enough PodDisruptionBudget left.
See the Kubernetes Cluster Autoscaler FAQ for more details.
Parameters: Name of the blocking pod. |
Review the PodDisruptionBudget for the Pod, see best practices for PodDisruptionBudget . You may be able to resolve the message by scaling the application or changing the PodDisruptionBudget to allow for more unavailable Pods. |
"no.scale.down.node.pod.controller.not.found" |
Pod is blocking scale down because its controller (e.g. Deployment or ReplicaSet) can't be found. | Review the logs to determine what actions were taken that left a Pod running after its controller was removed. To resolve, you can manually delete the Pod. |
"no.scale.down.node.pod.unexpected.error" |
Scale down of an underutilized node is blocked because it has a Pod in an unexpected error state. | Engage GCP Support for further investigation. |
What's next
- Learn more about cluster autoscaler.
- Learn about how to using node auto-provisioning
- Learn about troubleshooting and resolving scaling issues.