Create snapshots to help diagnose cluster problems

When you experience a problem with one of your clusters, you can get help from Cloud Customer Care. Customer Care may ask you to take a 'snapshot' of the cluster, which they can use to diagnose the problem. A snapshot captures cluster and node configuration files, and packages that information into a single tar file.

This document describes how to create default snapshots or more customized snapshots of a cluster. It also explains how to create snapshots when a cluster is experiencing particular errors.

Default snapshots

The following sections describe what's in a standard snapshot and how to create one. For information about customized snapshots, see the section Customized snapshots.

What information does a default snapshot contain?

The snapshot of a cluster is a tar file of configuration files and logs about the cluster. Specifically, the default configuration of the command captures the following information about your cluster:

  • Kubernetes version

  • Status of Kubernetes resources in the kube-system and gke-system namespaces: cluster, machine, nodes, Services, Endpoints, ConfigMaps, ReplicaSets, CronJobs, Pods, and the owners of those Pods, including Deployments, DaemonSets, and StatefulSets

  • Details about each node configuration including IP addresses, iptables rules, mount points, file system, network connections, and running processes

  • Logs from the bmctl check cluster --snapshot command

A cluster's credential information is not included in the default snapshot. If Cloud Customer Care requests that information, see Retrieving cluster information.

For a comprehensive list of the information collected when you run the snapshot command, see the configuration file shown in the section The configuration file in detail. This configuration file shows which commands are run when taking a default snapshot.

How to create a default snapshot

The bmctl check cluster command takes a snapshot of a cluster. You can use this command to perform either of the following actions: * create a snapshot and automatically upload that snapshot to a Cloud Storage bucket * create a snapshot of a cluster and save the snapshot file on the local machine on which you are running the command.

Method #1: create default snapshot and automatically upload to Cloud Storage bucket

To create and upload a snapshot to a Cloud Storage bucket, do the following:

  1. Set up API and service account:

    1. Enable the Cloud Storage API within your Google Cloud project.
    2. Grant a storage.admin role to the service account so that the service account can upload data to Cloud Storage.
    3. Download the JSON key for the service account.

    See Enabling Google service and services accounts for details.

  2. Run the following bmctl command to create and automatically upload a snapshot to a Cloud Storage bucket:

    bmctl check cluster --snapshot --cluster=CLUSTER_NAME
    --kubeconfig=KUBECONFIG_PATH
    --upload-to BUCKET_NAME
    [--service-account-key-file SERVICE_ACCOUNT_KEY_FILE]
    

    In the command, replace the following entries with information specific to your cluster environment:

    • CLUSTER_NAME: the name of the cluster you want to take a snapshot of.
    • KUBECONFIG_PATH: the path to the admin cluster kubeconfig file. (The path to the kubeconfig file is usually bmctl-workspace/CLUSTER_NAME/CLUSTER_NAME-kubeconfig. However, if you specified your workspace with the WORKSPACE_DIR flag, the path is WORKSPACE_DIR/CLUSTER_NAME/CLUSTER_NAME-kubeconfig).
    • BUCKET_NAME: the name of a Cloud Storage bucket you own.
    • SERVICE_ACCOUNT_KEY_FILE_PATH: If you don't provide the --service-account-key-file flag, bmctl tries to get the path to the service account's key file from the environment variable GOOGLE_APPLICATION_CREDENTIALS.
  3. Grant Cloud Customer Care read access to the bucket containing the snapshot:

    gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME\
    --member=serviceAccount:service-PROJECT_ID@anthos-support.iam.gserviceaccount.com\
    --role=roles/storage.objectViewer
    

Method #2: create default snapshot on local machine

You can capture the state of your created clusters with the following command:

bmctl check cluster --snapshot --cluster=CLUSTER_NAME \
    --kubeconfig=KUBECONFIG_PATH

Replace the following:

  • CLUSTER_NAME: the name of the target cluster.

  • KUBECONFIG_PATH: the path to the admin cluster kubeconfig file. (The path to the kubeconfig file is usually bmctl-workspace/CLUSTER_NAME/CLUSTER_NAME-kubeconfig. However, if you specified your workspace with the WORKSPACE_DIR flag, the path is WORKSPACE_DIR/CLUSTER_NAME/CLUSTER_NAME-kubeconfig).

This command outputs a tar file to your local machine. The name of this tar file is in the form snapshot-CLUSTER_NAME-TIMESTAMP.tar.gz, where TIMESTAMP indicates the date and time the file was created. This tar file includes relevant debug information about a cluster's system components and machines.

When you execute this command, information is gathered about Pods from the following namespaces: gke-system, gke-connect, capi-system, capi-webhook-system, cert-manager, and capi-kubeadm-bootstrap-system

However, you can widen the scope of the diagnostic information collected by using the flag --snapshot-scenario all. This flag increases the scope of the diagnostic snapshot to include all the Pods in a cluster:

bmctl check cluster --snapshot --snapshot-scenario all \
--cluster=CLUSTER_NAME \
--kubeconfig=KUBECONFIG_PATH

Customized snapshots

You might want to create a customized snapshot of a cluster for the following reasons:

How to create a customized snapshot

Creating a customized snapshot requires the use of a snapshot configuration file. The following steps explain how to create the configuration file, modify it, and use it to create a customized snapshot of a cluster:

  1. Create a snapshot configuration file by running the following command on your cluster and writing the output to a file:

     bmctl check cluster
    --snapshot --snapshot-dry-run --cluster CLUSTER_NAME
    --kubeconfig KUBECONFIG_PATH
    
  2. Define what kind of information you want to appear in your customized snapshot. To do that, modify the snapshot configuration file that you created in step 1. For example, if you want the snapshot to contain additional information, such as how long a particular node has been running, add the Linux command uptime to the relevant section of the configuration file. The following snippet of a configuration file shows how to make the snapshot command provide uptime information about node 10.200.0.3. This information doesn't appear in a standard snapshot.

    ...
    nodeCommands:
    - nodes:
      - 10.200.0.3
      commands:
      - uptime
    ...
    
  3. Once you have modified the configuration file to define what kind of snapshot you desire, create the customized snapshot by running the following command:

     bmctl check cluster
    --snapshot --snapshot-config SNAPSHOT_CONFIG_FILE --cluster
    CLUSTER_NAME --kubeconfig KUBECONFIG_PATH
    

    The --snapshot-config flag directs the bmctl command to use the contents of the snapshot configuration file to define what information appears in the snapshot.

The configuration file in detail

The following sample snapshot configuration file shows the standard commands and files used for creating a snapshot, but you can add more commands and files when additional diagnostic information is needed:

numOfParallelThreads: 10
excludeWords:
- password
nodeCommands:
- nodes:
  - 10.200.0.3
  - 10.200.0.4
  commands:
  - uptime
  - df --all --inodes
  - ip addr
  - ip neigh
  - iptables-save --counters
  - mount
  - ip route list table all
  - top -bn1 || true
  - docker info || true
  - docker ps -a || true
  - crictl ps -a || true
  - docker ps -a | grep anthos-baremetal-haproxy | cut -d ' ' -f1 | head -n 1 | xargs
    sudo docker logs || true
  - docker ps -a | grep anthos-baremetal-keepalived | cut -d ' ' -f1 | head -n 1 |
    xargs sudo docker logs || true
  - crictl ps -a | grep anthos-baremetal-haproxy | cut -d ' ' -f1 | head -n 1 | xargs
    sudo crictl logs || true
  - crictl ps -a | grep anthos-baremetal-keepalived | cut -d ' ' -f1 | head -n 1 |
    xargs sudo crictl logs || true
  - ps -edF
  - ps -eo pid,tid,ppid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:14,comm,args,cgroup
  - conntrack --count
  - dmesg
  - systemctl status -l docker || true
  - journalctl --utc -u docker
  - journalctl --utc -u docker-monitor.service
  - systemctl status -l kubelet
  - journalctl --utc -u kubelet
  - journalctl --utc -u kubelet-monitor.service
  - journalctl --utc --boot --dmesg
  - journalctl --utc -u node-problem-detector
  - systemctl status -l containerd || true
  - journalctl --utc -u containerd
  - systemctl status -l docker.haproxy || true
  - journalctl --utc -u docker.haproxy
  - systemctl status -l docker.keepalived || true
  - journalctl --utc -u docker.keepalived
  - systemctl status -l container.haproxy || true
  - journalctl --utc -u container.haproxy
  - systemctl status -l container.keepalived || true
  - journalctl --utc -u container.keepalived
nodeFiles:
- nodes:
  - 10.200.0.3
  - 10.200.0.4
  files:
  - /proc/sys/fs/file-nr
  - /proc/sys/net/netfilter/nf_conntrack_max
  - /proc/sys/net/ipv4/conf/all/rp_filter
  - /lib/systemd/system/kubelet.service
  - /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
  - /lib/systemd/system/docker.service || true
  - /etc/systemd/system/containerd.service || true
  - /etc/docker/daemon.json || true
  - /etc/containerd/config.toml || true
  - /etc/systemd/system/container.keepalived.service || true
  - /etc/systemd/system/container.haproxy.service || true
  - /etc/systemd/system/docker.keepalived.service || true
  - /etc/systemd/system/docker.haproxy.service || true
nodeSSHKey: ~/.ssh/id_rsa # path to your ssh key file

The following entries in your configuration file likely differ from the ones appearing in the sample configuration file above:

  • The IP addresses of nodes in the nodeCommands and nodeFiles sections
  • The path to your cluster's nodeSSHKey

Fields in the configuration file

A snapshot configuration file is in YAML format. The configuration file includes the following fields:

  • numOfParallelThreads: the snapshot routine typically runs numerous commands. Multiple parallel threads help the routine execute faster. We recommend that you set numOfParallelThreads to 10 as shown in the preceding sample configuration file. If your snapshots take too long, increase this value.

  • excludeWords: the snapshot contains a large quantity of data for your cluster nodes. Use excludeWords to reduce security risks when you share your snapshot. For example, exclude password so that corresponding password strings can't be identified.

  • nodeCommands: this section specifies the following information:

    • nodes: a list of IP addresses for the cluster nodes from which you want to collect information. To create a snapshot when the admin cluster is not reachable, specify at least one node IP address.

    • commands: a list of commands (and arguments) to run on each node. The output of each command is included in the snapshot.

  • nodeFiles: this section specifies the following information:

    • nodes: a list of IP addresses of cluster nodes from which you want to collect files. To create a snapshot when the admin cluster is not reachable, specify at least one node IP address.

    • files: a list of files to retrieve from each node. When the specified files are found on a node, they are included in the snapshot.

  • nodeSSHKey: path to your SSH key file. When the admin cluster is unreachable, this field is required.

Creating snapshots when experiencing particular errors

How to create a default snapshot during stalled installs or upgrades

When installing or upgrading admin, hybrid, or standalone clusters, bmctl can sometimes stall at points in which the following outputs can be seen:

  • Waiting for cluster kubeconfig to become ready.
  • Waiting for cluster to become ready.
  • Waiting for node pools to become ready.
  • Waiting for upgrade to complete.

If you experience a stalled install or upgrade, you can nonetheless take a snapshot of a cluster, using the bootstrap cluster, by running the following command:

bmctl check cluster --snapshot --cluster=CLUSTER_NAME \
    --kubeconfig=WORKSPACE_DIR/.kindkubeconfig

How to create a customized snapshot during stalled installs or upgrades

The following steps show how to create a customized snapshot of a cluster when an install or upgrade is stalled:

  1. Retrieve from your archives a snapshot configuration file of the cluster.

  2. Modify the snapshot configuration file so that the snapshot contains the information you want.

  3. Create the customized snapshot by running the following command:

    bmctl check cluster --snapshot
    --snapshot-config=SNAPSHOT_CONFIG_FILE
    --cluster=CLUSTER_NAME
    --kubeconfig=WORKSPACE_DIR/.kindkubeconfig
    

How to create a customized snapshot when the admin cluster is unreachable

If your cluster is producing an error which states that the admin cluster is unreachable, you can't take a default snapshot of a cluster. That's because the default bmctl command tries to, among other things, retrieve information from the admin cluster. When the default command attempts to retrieve information from an unreachable admin cluster, the snapshot command will fail.

Therefore, when the admin cluster is unreachable, you should take a customized snapshot of the cluster rather than a default snapshot. That way, you can create a customized snapshot that doesn't request information from a faulty admin cluster.

The following steps show how to create a customized snapshot of a cluster when the admin cluster is unreachable:

  1. Retrieve from your archives a snapshot configuration file of the cluster.

  2. In the nodes section, list the IP addresses of nodes for which you want information, but make sure to exclude the IP address of the admin cluster node.

  3. Create the customized snapshot by running the following command:

    bmctl check cluster
    --snapshot --snapshot-config=SNAPSHOT_CONFIG_FILE
    --cluster=CLUSTER_NAME
    --kubeconfig=KUBECONFIG_PATH
    

Collect logs for ingress or Cloud Service Mesh issues

The bmctl snapshots don't contain information to troubleshoot ingress or Cloud Service Mesh problems. For instructions on how to collect the relevant diagnostic logs, see Collecting Cloud Service Mesh logs in the Cloud Service Mesh documentation.