When you experience a problem with one of your clusters, you can get help from Cloud Customer Care. Customer Care may ask you to take a 'snapshot' of the cluster, which they can use to diagnose the problem. A snapshot captures cluster and node configuration files, and packages that information into a single tar file.
This document describes how to create default snapshots or more customized snapshots of a cluster. It also explains how to create snapshots when a cluster is experiencing particular errors.
Default snapshots
The following sections describe what's in a standard snapshot and how to create one. For information about customized snapshots, see the section Customized snapshots.
What information does a default snapshot contain?
The snapshot of a cluster is a tar file of configuration files and logs about the cluster. Specifically, the default configuration of the command captures the following information about your cluster:
Kubernetes version
Status of Kubernetes resources in the kube-system and gke-system namespaces: cluster, machine, nodes, Services, Endpoints, ConfigMaps, ReplicaSets, CronJobs, Pods, and the owners of those Pods, including Deployments, DaemonSets, and StatefulSets
Details about each node configuration including IP addresses, iptables rules, mount points, file system, network connections, and running processes
Logs from the
bmctl check cluster --snapshot
command
A cluster's credential information is not included in the default snapshot. If Cloud Customer Care requests that information, see Retrieving cluster information.
For a comprehensive list of the information collected when you run the snapshot command, see the configuration file shown in the section The configuration file in detail. This configuration file shows which commands are run when taking a default snapshot.
How to create a default snapshot
The bmctl check cluster
command takes a snapshot of a cluster. You can use
this command to perform either of the following actions:
* create a snapshot and automatically upload that snapshot to a
Cloud Storage bucket
* create a snapshot of a cluster and save the snapshot file on the local machine
on which you are running the command.
Method #1: create default snapshot and automatically upload to Cloud Storage bucket
To create and upload a snapshot to a Cloud Storage bucket, do the following:
Set up API and service account:
- Enable the Cloud Storage API within your Google Cloud project.
- Grant a
storage.admin
role to the service account so that the service account can upload data to Cloud Storage. - Download the JSON key for the service account.
See Enabling Google service and services accounts for details.
Run the following
bmctl
command to create and automatically upload a snapshot to a Cloud Storage bucket:bmctl check cluster --snapshot --cluster=CLUSTER_NAME --kubeconfig=KUBECONFIG_PATH --upload-to BUCKET_NAME [--service-account-key-file SERVICE_ACCOUNT_KEY_FILE]
In the command, replace the following entries with information specific to your cluster environment:
- CLUSTER_NAME: the name of the cluster you want to take a snapshot of.
- KUBECONFIG_PATH: the path to the admin cluster
kubeconfig
file. (The path to the kubeconfig file is usuallybmctl-workspace/CLUSTER_NAME/CLUSTER_NAME-kubeconfig
. However, if you specified your workspace with theWORKSPACE_DIR
flag, the path isWORKSPACE_DIR/CLUSTER_NAME/CLUSTER_NAME-kubeconfig
). - BUCKET_NAME: the name of a Cloud Storage bucket you own.
- SERVICE_ACCOUNT_KEY_FILE_PATH: If you don't provide the
--service-account-key-file
flag,bmctl
tries to get the path to the service account's key file from the environment variableGOOGLE_APPLICATION_CREDENTIALS
.
Grant Cloud Customer Care read access to the bucket containing the snapshot:
gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME\ --member=serviceAccount:service-PROJECT_ID@anthos-support.iam.gserviceaccount.com\ --role=roles/storage.objectViewer
Method #2: create default snapshot on local machine
You can capture the state of your created clusters with the following command:
bmctl check cluster --snapshot --cluster=CLUSTER_NAME \
--kubeconfig=KUBECONFIG_PATH
Replace the following:
CLUSTER_NAME: the name of the target cluster.
KUBECONFIG_PATH: the path to the admin cluster
kubeconfig
file. (The path to the kubeconfig file is usuallybmctl-workspace/CLUSTER_NAME/CLUSTER_NAME-kubeconfig
. However, if you specified your workspace with theWORKSPACE_DIR
flag, the path isWORKSPACE_DIR/CLUSTER_NAME/CLUSTER_NAME-kubeconfig
).
This command outputs a tar file to your local machine. The name of this tar file
is in the form snapshot-CLUSTER_NAME-TIMESTAMP.tar.gz
, where TIMESTAMP
indicates the date and time the file was created. This tar file includes
relevant debug information about a cluster's system components and machines.
When you execute this command, information is gathered about Pods from the
following namespaces: gke-system
, gke-connect
, capi-system
,
capi-webhook-system
, cert-manager
, and capi-kubeadm-bootstrap-system
However, you can widen the scope of the diagnostic information collected by
using the flag --snapshot-scenario all
. This flag increases the scope of the
diagnostic snapshot to include all the Pods in a cluster:
bmctl check cluster --snapshot --snapshot-scenario all \
--cluster=CLUSTER_NAME \
--kubeconfig=KUBECONFIG_PATH
Customized snapshots
You might want to create a customized snapshot of a cluster for the following reasons:
- To include more information about your cluster than what's provided in the default snapshot.
- To exclude some information that's in the default snapshot.
- To take a snapshot of a cluster when you encounter the
admin cluster is unreachable
error. See section How to create a customized snapshot when the admin cluster is unreachable for details.
How to create a customized snapshot
Creating a customized snapshot requires the use of a snapshot configuration file. The following steps explain how to create the configuration file, modify it, and use it to create a customized snapshot of a cluster:
Create a snapshot configuration file by running the following command on your cluster and writing the output to a file:
bmctl check cluster --snapshot --snapshot-dry-run --cluster CLUSTER_NAME --kubeconfig KUBECONFIG_PATH
Define what kind of information you want to appear in your customized snapshot. To do that, modify the snapshot configuration file that you created in step 1. For example, if you want the snapshot to contain additional information, such as how long a particular node has been running, add the Linux command
uptime
to the relevant section of the configuration file. The following snippet of a configuration file shows how to make the snapshot command provideuptime
information about node 10.200.0.3. This information doesn't appear in a standard snapshot.... nodeCommands: - nodes: - 10.200.0.3 commands: - uptime ...
Once you have modified the configuration file to define what kind of snapshot you desire, create the customized snapshot by running the following command:
bmctl check cluster --snapshot --snapshot-config SNAPSHOT_CONFIG_FILE --cluster CLUSTER_NAME --kubeconfig KUBECONFIG_PATH
The
--snapshot-config
flag directs thebmctl
command to use the contents of the snapshot configuration file to define what information appears in the snapshot.
The configuration file in detail
The following sample snapshot configuration file shows the standard commands and files used for creating a snapshot, but you can add more commands and files when additional diagnostic information is needed:
numOfParallelThreads: 10
excludeWords:
- password
nodeCommands:
- nodes:
- 10.200.0.3
- 10.200.0.4
commands:
- uptime
- df --all --inodes
- ip addr
- ip neigh
- iptables-save --counters
- mount
- ip route list table all
- top -bn1 || true
- docker info || true
- docker ps -a || true
- crictl ps -a || true
- docker ps -a | grep anthos-baremetal-haproxy | cut -d ' ' -f1 | head -n 1 | xargs
sudo docker logs || true
- docker ps -a | grep anthos-baremetal-keepalived | cut -d ' ' -f1 | head -n 1 |
xargs sudo docker logs || true
- crictl ps -a | grep anthos-baremetal-haproxy | cut -d ' ' -f1 | head -n 1 | xargs
sudo crictl logs || true
- crictl ps -a | grep anthos-baremetal-keepalived | cut -d ' ' -f1 | head -n 1 |
xargs sudo crictl logs || true
- ps -edF
- ps -eo pid,tid,ppid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:14,comm,args,cgroup
- conntrack --count
- dmesg
- systemctl status -l docker || true
- journalctl --utc -u docker
- journalctl --utc -u docker-monitor.service
- systemctl status -l kubelet
- journalctl --utc -u kubelet
- journalctl --utc -u kubelet-monitor.service
- journalctl --utc --boot --dmesg
- journalctl --utc -u node-problem-detector
- systemctl status -l containerd || true
- journalctl --utc -u containerd
- systemctl status -l docker.haproxy || true
- journalctl --utc -u docker.haproxy
- systemctl status -l docker.keepalived || true
- journalctl --utc -u docker.keepalived
- systemctl status -l container.haproxy || true
- journalctl --utc -u container.haproxy
- systemctl status -l container.keepalived || true
- journalctl --utc -u container.keepalived
nodeFiles:
- nodes:
- 10.200.0.3
- 10.200.0.4
files:
- /proc/sys/fs/file-nr
- /proc/sys/net/netfilter/nf_conntrack_max
- /proc/sys/net/ipv4/conf/all/rp_filter
- /lib/systemd/system/kubelet.service
- /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
- /lib/systemd/system/docker.service || true
- /etc/systemd/system/containerd.service || true
- /etc/docker/daemon.json || true
- /etc/containerd/config.toml || true
- /etc/systemd/system/container.keepalived.service || true
- /etc/systemd/system/container.haproxy.service || true
- /etc/systemd/system/docker.keepalived.service || true
- /etc/systemd/system/docker.haproxy.service || true
nodeSSHKey: ~/.ssh/id_rsa # path to your ssh key file
The following entries in your configuration file likely differ from the ones appearing in the sample configuration file above:
- The IP addresses of nodes in the
nodeCommands
andnodeFiles
sections - The path to your cluster's
nodeSSHKey
Fields in the configuration file
A snapshot configuration file is in YAML format. The configuration file includes the following fields:
numOfParallelThreads
: the snapshot routine typically runs numerous commands. Multiple parallel threads help the routine execute faster. We recommend that you setnumOfParallelThreads
to10
as shown in the preceding sample configuration file. If your snapshots take too long, increase this value.excludeWords
: the snapshot contains a large quantity of data for your cluster nodes. UseexcludeWords
to reduce security risks when you share your snapshot. For example, excludepassword
so that corresponding password strings can't be identified.nodeCommands
: this section specifies the following information:nodes
: a list of IP addresses for the cluster nodes from which you want to collect information. To create a snapshot when the admin cluster is not reachable, specify at least one node IP address.commands
: a list of commands (and arguments) to run on each node. The output of each command is included in the snapshot.
nodeFiles
: this section specifies the following information:nodes
: a list of IP addresses of cluster nodes from which you want to collect files. To create a snapshot when the admin cluster is not reachable, specify at least one node IP address.files
: a list of files to retrieve from each node. When the specified files are found on a node, they are included in the snapshot.
nodeSSHKey
: path to your SSH key file. When the admin cluster is unreachable, this field is required.
Creating snapshots when experiencing particular errors
How to create a default snapshot during stalled installs or upgrades
When installing or upgrading admin, hybrid, or standalone clusters, bmctl
can
sometimes stall at points in which the following outputs can be seen:
- Waiting for cluster kubeconfig to become ready.
- Waiting for cluster to become ready.
- Waiting for node pools to become ready.
- Waiting for upgrade to complete.
If you experience a stalled install or upgrade, you can nonetheless take a snapshot of a cluster, using the bootstrap cluster, by running the following command:
bmctl check cluster --snapshot --cluster=CLUSTER_NAME \
--kubeconfig=WORKSPACE_DIR/.kindkubeconfig
How to create a customized snapshot during stalled installs or upgrades
The following steps show how to create a customized snapshot of a cluster when an install or upgrade is stalled:
Retrieve from your archives a snapshot configuration file of the cluster.
Modify the snapshot configuration file so that the snapshot contains the information you want.
Create the customized snapshot by running the following command:
bmctl check cluster --snapshot --snapshot-config=SNAPSHOT_CONFIG_FILE --cluster=CLUSTER_NAME --kubeconfig=WORKSPACE_DIR/.kindkubeconfig
How to create a customized snapshot when the admin cluster is unreachable
If your cluster is producing an error which states that the admin cluster is
unreachable, you can't take a default snapshot of a cluster. That's
because the default bmctl
command tries to, among other things, retrieve
information from the admin cluster. When the default command attempts to
retrieve information from an unreachable admin cluster, the snapshot command
will fail.
Therefore, when the admin cluster is unreachable, you should take a customized snapshot of the cluster rather than a default snapshot. That way, you can create a customized snapshot that doesn't request information from a faulty admin cluster.
The following steps show how to create a customized snapshot of a cluster when the admin cluster is unreachable:
Retrieve from your archives a snapshot configuration file of the cluster.
In the nodes section, list the IP addresses of nodes for which you want information, but make sure to exclude the IP address of the admin cluster node.
Create the customized snapshot by running the following command:
bmctl check cluster --snapshot --snapshot-config=SNAPSHOT_CONFIG_FILE --cluster=CLUSTER_NAME --kubeconfig=KUBECONFIG_PATH
Collect logs for ingress or Cloud Service Mesh issues
The bmctl
snapshots don't contain information to troubleshoot ingress or
Cloud Service Mesh problems. For instructions on how to collect the relevant
diagnostic logs, see Collecting Cloud Service Mesh logs in the Cloud Service Mesh
documentation.