This page describes how to manually create and restore backups of GKE on-prem admin and user clusters' etcd key-value stores. This page also provides a script that you can use to automatically back up your clusters' etcd stores.
You should create backups for recovery from foreseen disasters that might damage etcd data and Secrets. Be sure to store backups in a location that is outside of the cluster and that is not dependent on the cluster's operation. If you want to be safe, consider creating a copy of the backup, too.
While the etcd events Pod that runs in every cluster is not vital to the restoration of a user cluster, you can follow a similar process to back it up. Also note this procedure allows you to back up only the etcd stores; the PersistentVolumes are not backed up as part of this guide, and you should plan for additional backup and restore procedure for those.
Restrictions
- Backing up application-specific data is out of scope for this feature.
- Secrets remain valid until you manually rotate them.
- Workloads scheduled after you create a backup aren't restored with that backup.
- Currently, you aren't able to restore from failed cluster upgrades.
- This procedure is not intended to restore a deleted cluster.
Known issues
When you run sudo
commands, you might encounter the following error:
sudo: unable to resolve host gke-admin-master-[CLUSTER_ID]
If you do, add the following line to the /etc/hosts
file:
127.0.0.1 gke-admin-master-[CLUSTER_ID]
User cluster backups
A user cluster backup contains a snapshot of the user cluster's etcd. A cluster's etcd contains, among other things, all of the Kubernetes objects and any custom objects required to manage cluster state. This snapshot contains the data required to recreate the cluster's components and workloads.
Backing up a user cluster
A user cluster's etcd is stored in its control plane node, which you can access using the admin cluster's kubeconfig.
To create a snapshot of etcd, execute the following steps:
Shell into the kube-etcd container:
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] exec \ -it -n [USER_CLUSTER_NAME] kube-etcd-0 -c \ kube-etcd -- bin/sh
where:
- [ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file.
- [USER_CLUSTER_NAME] is name of the user cluster. Specifically, you're passing in a namespace in the admin cluster that is named after the user cluster.
From the shell, use
etcdctl
to a create backup namedsnapshot.db
in the local directory:ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etcd.local.config/certificates/etcdCA.crt \ --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key \ snapshot save snapshot.db
Exit the container:
exit
Copy the backup out of the kube-etcd container using
kubectl cp
:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] cp \ [USER_CLUSTER_NAME]/kube-etcd-0:snapshot.db [DIRECTORY] -c kube-etcd
where [RELATIVE_DIRECTORY] is a path where you want to store your backup.
Restoring a user cluster backup
Before you restore a backup, be sure to diagnose your cluster and resolve existing issues. Restoring a backup to a problematic cluster might recreate or exacerbate issues. Contact the GKE on-prem support team for further assistance on restoring your clusters.
If you created a HA user cluster, you should run these steps once per etcd cluster member. You can use the same snapshot when restoring each etcd member. Don't take these steps unless all etcd Pods are crashlooping: this indicates that there is data corruption.
Crashlooping etcd Pod
The following instructions explain how to restore a backup in cases where a user cluster's etcd data has become damaged and its etcd Pod is crashlooping. You can recover by deploying a etcd Pod to the existing Pod's volumes and overwriting the damaged data with the backup, assuming that the user cluster's API server is running and can schedule new Pods.
Copy the etcd Pod specification below to a file,
restore-etcd.yaml
, after populating the following placeholder values:- [MEMBER_NUMBER] is the numbered Pod that you are restoring.
- [NODE_NAME] is the node on which the [MEMBER_NUMBER[ Pod is running.
- [ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file.
- [USER_CLUSTER_NAME] is the name of the user cluster.
[DEFAULT_TOKEN] is used for authentication. You can find this value by running the following command:
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ -n [USER_CLUSTER_NAME] get pods kube-etcd-0 \ -o yaml | grep default-token
restore-etcd.yaml
apiVersion: v1 kind: Pod metadata: labels: Component: restore-etcd-[MEMBER_NUMBER] name: restore-etcd-0 namespace: [USER_CLUSTER_NAME] spec: restartPolicy: Never containers: - command: ["/bin/sh"] args: ["-ec", "while :; do echo '.'; sleep 5 ; done"] image: gcr.io/gke-on-prem-release/etcd:v3.2.24-1-gke.0 imagePullPolicy: IfNotPresent name: restore-etcd terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/etcd name: data - mountPath: /etcd.local.config/certificates name: etcd-certs - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: [DEFAULT_TOKEN] readOnly: true dnsPolicy: ClusterFirst hostname: restore-etcd-0 imagePullSecrets: - name: private-registry-creds nodeSelector: kubernetes.googleapis.com/cluster-name: [USER_CLUSTER_NAME] kubernetes.io/hostname: [NODE_NAME] priority: 0 restartPolicy: Always schedulerName: default-scheduler securityContext: {} serviceAccount: default serviceAccountName: default subdomain: restore-etcd terminationGracePeriodSeconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes: - name: data persistentVolumeClaim: claimName: data-kube-etcd-[MEMBER_NUMBER] - name: etcd-certs secret: defaultMode: 420 secretName: kube-etcd-certs - name: [DEFAULT_TOKEN] secret: defaultMode: 420 secretName: [DEFAULT_TOKEN]
Deploy the Pod:
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ -n [USER_CLUSTER_NAME] create -f restore-etcd.yaml
Copy etcd's backup file,
snapshot.db
, to the new Pod.snapshot.db
lives at the relative directory where you created the backup:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ cp [RELATIVE_DIRECTORY]/snapshot.db \ [USER_CLUSTER_NAME]/restore-etcd-0:snapshot.db
Shell into the
restore-etcd
Pod:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ -it -n [USER_CLUSTER_NAME] exec restore-etcd-0 -- bin/sh
Run the following command to create a new default.etcd folder containing the backup:
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etcd.local.config/certificates/etcdCA.crt \ --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key \ snapshot restore snapshot.db
Overwrite the damaged etcd data with the backup:
rm -r var/lib/etcd/*; cp -r default.etcd/* var/lib/etcd/
Exit the container:
exit
Delete the crashing etcd Pod:
kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ -n [USER_CLUSTER_NAME] delete pod kube-etcd-0
Verify that the etcd Pod is no longer crashing.
Remove
restore-etcd.yaml
and delete therestore-etcd
Pod:rm restore-etcd.yaml; kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] \ -n [USER_CLUSTER_NAME] delete pod restore-etcd-0
Admin cluster backups
An admin cluster backup contains the following:
- A snapshot of the admin cluster's etcd.
- Admin control plane's Secrets, which are required for authenticating to the admin and user clusters.
Complete the following steps before you create an admin cluster backup:
Find the admin cluster's external IP address, which is used to SSH in to the admin cluster control plane:
kubectl --kubeconfig [ADMIN_KUBECONFIG] get nodes -n kube-system -o wide | grep master
where [ADMIN_CLUSTER_KUBECONFIG] is the admin cluster's kubeconfig file.
Create an SSH key called
vsphere_tmp
from the admin cluster's private key.You can find the private key from the admin clusters Secrets:
kubectl --kubeconfig [ADMIN_KUBECONFIG] get secrets sshkeys -n kube-system -o yaml
In the command output, you can find the private key in the
vsphere_tmp
field.Copy the private key to
vsphere_tmp
:echo "[PRIVATE_KEY]" | base64 -d > vsphere_tmp; chmod 600 vsphere_tmp
Check that you can shell into the admin control plane using this private key:
ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]
Exit the container:
exit
Backing up an admin cluster
You can back up an admin cluster's etcd and its control plane's Secrets.
etcd
To back up the admin cluster's etcd:
Get the etcd Pod's name:
kubectl --kubeconfig [ADMIN_KUBECONFIG] get pods \ -n kube-system | grep etcd-gke-admin-master
Shell into Pod's kube-etcd container:
kubectl --kubeconfig [ADMIN_KUBECONFIG] exec -it \ -n kube-system [ADMIN_ETCD_POD] -- bin/sh
where [ADMIN_ETCD_POD] is the name of the etcd Pod.
From the shell, use
etcdctl
to a create backup namedsnapshot.db
in the local directory:ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save snapshot.db
Exit the container:
exit
Copy the backup out of the kube-etcd container using
kubectl cp
:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] cp \ kube-system/[ADMIN_ETCD_POD]:snapshot.db [RELATIVE_DIRECTORY]
where [RELATIVE_DIRECTORY] is a path where you want to store your backup.
Secrets
To back up the admin control plane's Secrets:
Shell into the admin control plane node:
ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]
where [EXTERNAL_IP] is the admin control plane's external IP address, which you gathered previously.
Create a local backup directory. (This is optional, but highly recommended. You need to change the backup Secrets' permissions to copy them out of the node):
mkdir backup
Locally copy the Secrets to the local backup directory:
sudo cp -r /etc/kubernetes/pki/* backup/
Change permissions of the backup Secrets:
sudo chmod -R +rw backup/
Exit the container:
exit
Run
scp
to copy the backup folder out of the admin control plane node:sudo scp -r -i vsphere_tmp ubuntu@[EXTERNAL_IP]:backup/ [RELATIVE_DIRECTORY]
where [RELATIVE_DIRECTORY] is a path where you want to store your backup.
Restoring an admin cluster
The following procedure recreates a backed-up admin cluster and all of the user control planes it managed when its etcd snapshot was created.
Run
scp
to copysnapshot.db
to the admin control plane:sudo scp -i vsphere_tmp snapshot.db ubuntu@[EXTERNAL_IP]:
where [EXTERNAL_IP] is the admin control plane's external IP address, which you gathered previously.
Shell into the admin control plane:
sudo ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]
Copy
snapshot.db/
to/mnt
:sudo cp snapshot.db /mnt/
Make temporary directory, like
backup
:mkdir backup
Exit the admin control plane:
exit
Copy the certificates to
backup/
:sudo scp -r -i vsphere_tmp [BACKUP_CERT_FILE] ubuntu@[EXTERNAL_IP]:backup/
Shell into the admin control plane node:
ssh -i vsphere_tmp ubuntu@[EXTERNAL_IP]
where [EXTERNAL_IP] is the admin control plane's external IP address, which you gathered previously.
Run
kubeadm reset
. This stops anything still running in the admin cluster, deletes all etcd data, and deletes Secrets in/etc/kubernetes/pki/
:sudo kubeadm reset --ignore-preflight-errors=all
Copy the backup Secrets to
/etc/kubernetes/pki/
:sudo cp -r backup/* /etc/kubernetes/pki/
Run
etcdctl restore
with Docker:sudo docker run --rm \ -v '/mnt:/backup' \ -v '/var/lib/etcd:/var/lib/etcd' --env ETCDCTL_API=3 'k8s.gcr.io/etcd-amd64:3.1.12' /bin/sh -c "etcdctl snapshot restore '/backup/snapshot.db'; mv /default.etcd/member/ /var/lib/etcd/"
Run
kubeadm init
. This reuses all of the backup Secrets and restarts etcd with the restored snapshot:sudo kubeadm init --config /etc/kubernetes/kubeadm_config.yaml --ignore-preflight-errors=DirAvailable--var-lib-etcd
Exit the admin control plane:
exit
Copy the newly generated kubeconfig file out of the admin node:
sudo scp -i vsphere_tmp ubuntu@[EXTERNAL_IP]:[HOME]/.kube/config kubeconfig
where:
- [EXTERNAL_IP] is the admin control plane's external IP address.
- [HOME] is the home directory on the admin node.
Now you can use this new kubeconfig file to access restored cluster.
Backup script
You can use the script given here as an example on how to automatically back up your clusters. Note that the following script is not supported, and should only be used as reference to write a better, more robust and complete script. Before you run the script, fill in values for the five variables at the beginning of the script:
- Set
BACKUP_DIR
to the path where you want to store the admin and user cluster backups. This path should not exist. - Set
ADMIN_CLUSTER_KUBECONFIG
to the path of the admin cluster's kubeconfig file - Set
USER_CLUSTER_NAMESPACE
to the name of your user cluster. The name of your user cluster is a namespace in the admin cluster. - Set
EXTERNAL_IP
to the VIP that you reserved for the admin control plane service. - Set
SSH_PRIVATE_KEY
to the path of the SSH key you created when you set up your admin workstation. - If you are using a private network, set
JUMP_IP
to your network's jump server's IP address.
#!/usr/bin/env bash
# Automates manual steps for taking backups of user and admin clusters.
# Fill in the variables below before running the script.
BACKUP_DIR="" # path to store user and admin cluster backups
ADMIN_CLUSTER_KUBECONFIG="" # path to admin cluster kubeconfig
USER_CLUSTER_NAMESPACE="" # user cluster namespace
EXTERNAL_IP="" # admin control plane node external ip - follow steps in documentation
SSH_PRIVATE_KEY="" # path to vsphere_tmp ssh private key - follow steps in documentation
JUMP_IP="" # network jump server IP - leave empty string if not using private network.
if [ -e ${BACKUP_DIR} ]
then
echo "Error: Backup directory $BACKUP_DIR exists already."
exit 1
fi
mkdir -p $BACKUP_DIR
mkdir $BACKUP_DIR/pki
# USER CLUSTER BACKUP
# Snapshot user cluster etcd
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} exec -it -n ${USER_CLUSTER_NAMESPACE} kube-etcd-0 -c kube-etcd -- /bin/sh -ec "export ETCDCTL_API=3; etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etcd.local.config/certificates/etcdCA.crt --cert=/etcd.local.config/certificates/etcd.crt --key=/etcd.local.config/certificates/etcd.key snapshot save ${USER_CLUSTER_NAMESPACE}_snapshot.db"
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} cp ${USER_CLUSTER_NAMESPACE}/kube-etcd-0:${USER_CLUSTER_NAMESPACE}_snapshot.db $BACKUP_DIR/user-cluster_${USER_CLUSTER_NAMESPACE}_snapshot.db
# ADMIN CLUSTER BACKUP
# Set up ssh options
SSH_OPTS=(-oStrictHostKeyChecking=no -i ${SSH_PRIVATE_KEY})
if [ "${JUMP_IP}" != "" ]; then
SSH_OPTS+=(-oProxyCommand="ssh -oStrictHostKeyChecking=no -i ${SSH_PRIVATE_KEY} -W %h:%p ubuntu@${JUMP_IP}")
fi
# Copy admin certs
ssh "${SSH_OPTS[@]}" ubuntu@${EXTERNAL_IP} 'sudo chmod -R +rw /etc/kubernetes/pki/*'
scp -r "${SSH_OPTS[@]}" ubuntu@${EXTERNAL_IP}:/etc/kubernetes/pki/* ${BACKUP_DIR}/pki/
# Snapshot admin cluster etcd
admin_etcd=$(kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} get pods -n kube-system -o=name | grep etcd | cut -c 5-)
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} exec -it -n kube-system ${admin_etcd} -- /bin/sh -ec "export ETCDCTL_API=3; etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save admin_snapshot.db"
kubectl --kubeconfig=${ADMIN_CLUSTER_KUBECONFIG} cp -n kube-system ${admin_etcd}/admin_snapshot.db $BACKUP_DIR/admin-cluster_snapshot.db
Troubleshooting
For more information, refer to Troubleshooting.
Diagnosing cluster issues using gkectl
Use gkectl diagnose
commands to identify cluster issues
and share cluster information with Google. See
Diagnosing cluster issues.
Running gkectl
commands verbosely
-v5
Logging gkectl
errors to stderr
--alsologtostderr
Locating gkectl
logs in the admin workstation
Even if you don't pass in its debugging flags, you can view
gkectl
logs in the following admin workstation directory:
/home/ubuntu/.config/gke-on-prem/logs
Locating Cluster API logs in the admin cluster
If a VM fails to start after the admin control plane has started, you can try debugging this by inspecting the Cluster API controllers' logs in the admin cluster:
Find the name of the Cluster API controllers Pod in the
kube-system
namespace, where [ADMIN_CLUSTER_KUBECONFIG] is the path to the admin cluster's kubeconfig file:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system get pods | grep clusterapi-controllers
Open the Pod's logs, where [POD_NAME] is the name of the Pod. Optionally, use
grep
or a similar tool to search for errors:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system logs [POD_NAME] vsphere-controller-manager
What's next
- Learn how to diagnose cluster issues
- Learn about augur, an open-source tool for restoring individual objects from etcd backups.