This page describes how to back up and restore clusters created with Google Distributed Cloud. These instructions apply to all cluster types supported by Google Distributed Cloud.
Back up a cluster
The backup process has two parts. First, a snapshot is made from the etcd store. Then, the related PKI certificates are saved to a tar file. The etcd store is the Kubernetes backing store for all cluster data and contains all the Kubernetes objects and custom objects required to manage cluster state. The PKI certificates are used for authentication over TLS. This data is backed up from the cluster's control plane or from one of the control planes for a high-availability (HA)
We recommend you back up your clusters regularly to ensure your snapshot data is relatively current. The rate of backups depends upon the frequency in which significant changes occur for your clusters.
Make a snapshot of the etcd store
In Google Distributed Cloud, a pod named etcd-CONTROL_PLANE_NAME
in the kube-system namespace runs the etcd for that control plane. To backup the
cluster's etcd store, perform the following steps from your admin workstation:
- Use - kubectl get poto identify the etcd Pod.- kubectl --kubeconfig CLUSTER_KUBECONFIG get po -n kube-system \ -l 'component=etcd,tier=control-plane'- The response includes the etcd Pod name and its status. 
- Use - kubectl describe podto see the containers running in the etcd pod, including the etcd container.- kubectl --kubeconfig CLUSTER_KUBECONFIG describe pod ETCD_POD_NAME -n kube-system
- Run a Bash shell in the etcd container: - kubectl --kubeconfig CLUSTER_KUBECONFIG exec -it \ ETCD_POD_NAME --container etcd --namespace kube-system \ -- bin/sh
- From the shell within the etcd container, use - etcdctl(version 3 of the API) to save a snapshot,- snapshot.db, of the etcd store.- ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \ snapshot save snapshotDATESTAMP.db- Replace DATESTAMP with the current date to prevent overwriting any subsequent snapshots. 
- Exit from the shell in the container and run the following command to copy the snapshot file to the admin workstation. - kubectl --kubeconfig CLUSTER_KUBECONFIG cp \ kube-system/ETCD_POD_NAME:snapshot.db \ --container etcd snapshot.db
- Store the snapshot file in a location that is outside of the cluster and is not dependent on the cluster's operation. 
Archive the PKI certificates
The certificates to be backed up are located in the /etc/kubernetes/pki
directory of the control plane. The PIK certificates together with the etcd
store snapshot.db file are needed to to recover a cluster in the event the
control plane goes down completely. The following steps create a tar file,
containing the PKI certificates.
- Use - sshto connect to the cluster's control plane as root.- ssh root@CONTROL_PLANE_NAME
- From the control plane, create a tar file, - certs_backup.tar.gzwith the contents of the- /etc/kubernetes/pkidirectory.- tar -czvf certs_backup.tar.gz -C /etc/kubernetes/pki .- Creating the tar file from within the control plane preserves all the certificate file permissions. 
- Exit the control plane and, from the workstation, copy tar file containing the certificates to a preferred location on the workstation. - sudo scp root@CONTROL_PLANE_NAME:certs_backup.tar.gz BACKUP_PATH
Restore a cluster
Restoring a cluster from a backup is a last resort and should be used when a cluster has failed catastrophically and cannot be returned to service any other way. For example, the etcd data is corrupted or the etcd Pod is in a crash loop.
The cluster restore process has two parts. First, the PKI certificates are restored on the control plane. Then, the etcd store data is restored.
Restore PKI certificates
Assuming you have backed up PKI certificates as described in Archive the PKI certificates, the following steps describe how to restore the certificates from the tar file to a control plane.
- Copy the PKI certificates tar file, - certs_backup.tar.gz, from workstation to the cluster control plane.- sudo scp -r BACKUP_PATH/certs_backup.tar.gz root@CONTROL_PLANE_NAME:~/
- Use - sshto connect to the cluster's control plane as root.- ssh root@CONTROL_PLANE_NAME
- From the control plane, extract the contents of the tar file to the - /etc/kubernetes/pkidirectory.- tar -xzvf certs_backup.tar.gz -C /etc/kubernetes/pki/
- Exit the control plane. 
Restore the etcd store
When restoring the etcd store, the process depends upon whether or not the cluster is running in high availability (HA) mode and, if so, whether or not quorum has been preserved. Use the following guidance to restore the etcd store for a given cluster failure situation:
- If the failed cluster is not running in HA mode, restore the etcd store on the control plane with the following steps. 
- If the cluster is running in HA mode and quorum is preserved, do nothing. As long a quorum is preserved, you don't need to restore failed clusters. 
- If the cluster is running in HA mode and quorum is lost, repeat the following steps to restore the etcd store for each failed member. 
Follow these steps from the workstation to remove and restore the etcd store on a control plane for a failed cluster:
- Create a - /backupdirectory in the root directory of the control plane.- ssh root@CONTROL_PLANE_NAME "mkdir /backup"- This step is not strictly required, but we recommend it. The following steps assume you have created a - /backupdirectory.
- Copy the etcd snapshot file, - snapshot.dbfrom workstation to the- backupdirectory on the cluster control plane.- sudo scp snapshot.db root@CONTROL_PLANE_NAME:/backup
- Use SSH to connect to the control plane node: - ssh root@CONTROL_PLANE_NAME
- Stop the etcd and kube-apiserver static pods by moving their manifest files out of the - /etc/kubernetes/manifestsdirectory and into the- /backupdirectory.- sudo mv /etc/kubernetes/manifests/etcd.yaml /backup/etcd.yaml sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /backup/kube-apiserver.yaml
- Remove the etcd data directory. - rm -rf /var/lib/etcd/
- Run - etcdctlsnapshot restore using- docker.- sudo docker run --rm -t \ -v /var/lib:/var/lib \ -v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \ -v /backup:/backup \ --env ETCDCTL_API=3 \ k8s.gcr.io/etcd:3.2.24 etcdctl \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ --data-dir=/var/lib/etcd \ --name=CONTROL_PLANE_NAME \ --initial-advertise-peer-urls=https://CONTROL_PLANE_IP:2380 \ --initial-cluster=CONTROL_PLANE_NAME=https://CONTROL_PLANE_IP:2380 \ snapshot restore /backup/snapshot.db- The entries for - --name,- --initial-advertise-peer-urls, and- --initial-clustercan be found in the- etcd.yamlmanifest file that was moved to the- /backupdirectory.
- Ensure that - /var/lib/etcdwas recreated and that a new member is created in- /var/lib/etcd/member.
- Move the etcd and kube-apiserver manifests back to the - /manifestsdirectory so that the static pods can restart.- sudo mv /backup/etcd.yaml /etc/kubernetes/manifests/etcd.yaml sudo mv /backup/kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
- Run a Bash shell in the etcd container: - kubectl --kubeconfig CLUSTER_KUBECONFIG exec -it \ ETCD_POD_NAME --container etcd --namespace kube-system \ -- bin/sh- Use etcdctlto confirm the added member is working properly.
 - ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --endpoints=CONTROL_PLANE_IP:2379 \ endpoint health- If you are restoring multiple failed members, once all failed members have been restored, run the command with the control plane IP addresses from all restored members in the `--endpoints' field. - For example: - ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --endpoints=10.200.0.3:2379,10.200.0.4:2379,10.200.0.5:2379 \ endpoint health- On success for each endpoint, your cluster should be working properly. 
- Use