Troubleshooting system metrics


This page shows you how to resolve system metrics-related issues on your Google Kubernetes Engine (GKE) clusters.

If you need additional assistance, reach out to Cloud Customer Care.

Confirm that the metrics agent has sufficient memory

In most cases, the default allocation of resources to the GKE metrics agent is sufficient. However, if the DaemonSet crashes repeatedly, you can check the termination reason with the following instructions:

  1. Get the names of the GKE metrics agent Pods:

    kubectl get pods -n kube-system -l component=gke-metrics-agent
    

    Find the Pod with the status CrashLoopBackOff.

    The output is similar to the following:

    NAME                    READY STATUS           RESTARTS AGE
    gke-metrics-agent-5857x 0/1   CrashLoopBackOff 6        12m
    
  2. Describe the Pod that has the status CrashLoopBackOff:

    kubectl describe pod POD_NAME -n kube-system
    

    Replace POD_NAME with the name of the Pod from the previous step.

    If the termination reason of the Pod is OOMKilled, the agent needs additional memory.

    The output is similar to the following:

      containerStatuses:
      ...
      lastState:
        terminated:
          ...
          exitCode: 1
          finishedAt: "2021-11-22T23:36:32Z"
          reason: OOMKilled
          startedAt: "2021-11-22T23:35:54Z"
    
  3. Add a node label to the node with the failing metrics agent. You can use either a persistent or temporary node label. We recommend you try adding an additional 20 MB. If the agent keeps crashing, you can run this command again, replacing the node label with one requesting a higher amount of additional memory.

    To update a node pool with a persistent label, run the following command:

    gcloud container node-pools update NODEPOOL_NAME \
        --cluster=CLUSTER_NAME \
        --node-labels=ADDITIONAL_MEMORY_NODE_LABEL \
        --location=COMPUTE_LOCATION
    

    Replace the following:

    • NODEPOOL_NAME: the name of the node pool.
    • CLUSTER_NAME: the name of the existing cluster.
    • ADDITIONAL_MEMORY_NODE_LABEL: one of the additional memory node labels; use one one of the following:
      • To add 10 MB: cloud.google.com/gke-metrics-agent-scaling-level=10
      • To add 20 MB: cloud.google.com/gke-metrics-agent-scaling-level=20
      • To add 50 MB: cloud.google.com/gke-metrics-agent-scaling-level=50
      • To add 100 MB: cloud.google.com/gke-metrics-agent-scaling-level=100
      • To add 200 MB: cloud.google.com/gke-metrics-agent-scaling-level=200
      • To add 500 MB: cloud.google.com/gke-metrics-agent-scaling-level=500
    • COMPUTE_LOCATION: the Compute Engine location of the cluster.

    Alternatively, you can add add a temporary node label that won't persist after an upgrade by using the following command:

    kubectl label node/NODE_NAME \
    ADDITIONAL_MEMORY_NODE_LABEL --overwrite
    

    Replace the following:

    • NODE_NAME: the name of the node of the affected metrics agent.
    • ADDITIONAL_MEMORY_NODE_LABEL: one of the additional memory node labels; use one one of the values from the preceding example.

What's next

If you need additional assistance, reach out to Cloud Customer Care.