This document describes how to replace a failed etcd replica in a high availability
(HA) user cluster.
The instructions given here apply to an HA user cluster that uses
kubeception;
that is, a user cluster that does not have
Controlplane V2
enabled. If you need to replace an etcd replica in a user cluster that has
Controlplane V2 enabled, contact Cloud Customer Care.
Before you begin
Make sure the admin cluster is working correctly.
Make sure the other two etcd members in the user cluster are working
correctly. If more than one etcd member has failed, see Recovery from etcd
data corruption or loss.
Replacing a failed etcd replica
Back up a copy of the etcd PodDisruptionBudget (PDB) so you can restore it
later.
Where MEMBER_ID is the hex member ID of the
failed etcd replica pod.
Follow steps 1-3 of Deploying the utility Pods to create a utility Pod in
the admin cluster. This Pod is used to access the PersistentVolume (PV) of the
failed etcd member in the user cluster.
Clean up the etcd data directory from within the utility Pod.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[[["\u003cp\u003eThis guide details the process for replacing a failed etcd replica in a high availability user cluster that uses kubeception and does not have Controlplane V2 enabled.\u003c/p\u003e\n"],["\u003cp\u003eBefore starting, ensure the admin cluster and the other two etcd members in the user cluster are functioning correctly, and make sure to back up the etcd PodDisruptionBudget (PDB).\u003c/p\u003e\n"],["\u003cp\u003eThe process involves deleting the etcd PDB, modifying the kube-etcd StatefulSet to change the \u003ccode\u003e--initial-cluster-state\u003c/code\u003e flag to \u003ccode\u003eexisting\u003c/code\u003e, draining the failed node, removing the failed replica from the etcd cluster via etcdctl commands, cleaning up the etcd data directory using a utility Pod, uncordoning the failed node, then setting the \u003ccode\u003e--initial-cluster-state\u003c/code\u003e flag to \u003ccode\u003enew\u003c/code\u003e.\u003c/p\u003e\n"],["\u003cp\u003eThe process will require the use of the admin cluster's kubeconfig file, as well as the name of the user cluster containing the failed replica, and the name of the node which holds the failed replica.\u003c/p\u003e\n"],["\u003cp\u003eThe final step will restore the etcd PDB that was deleted at the beginning.\u003c/p\u003e\n"]]],[],null,["# Replacing a failed etcd replica\n\n\u003cbr /\u003e\n\nThis document describes how to replace a failed etcd replica in a high availability\n(HA) user cluster.\n\nThe instructions given here apply to an HA user cluster that uses\n[kubeception](/anthos/clusters/docs/on-prem/1.15/how-to/create-user-cluster);\nthat is, a user cluster that does not have\n[Controlplane V2](/anthos/clusters/docs/on-prem/1.15/how-to/user-cluster-configuration-file#enablecontrolplanev2-field)\nenabled. If you need to replace an etcd replica in a user cluster that has\nControlplane V2 enabled, [contact Cloud Customer Care](/anthos/clusters/docs/on-prem/1.15/getting-support).\n\nBefore you begin\n----------------\n\n- Make sure the admin cluster is working correctly.\n\n- Make sure the other two etcd members in the user cluster are working\n correctly. If more than one etcd member has failed, see [Recovery from etcd\n data corruption or loss](/anthos/clusters/docs/on-prem/1.15/concepts/high-availability-disaster-recovery#recovery_from_etcd_data_corruption_or_loss).\n\nReplacing a failed etcd replica\n-------------------------------\n\n1. Back up a copy of the etcd PodDisruptionBudget (PDB) so you can restore it\n later.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME get pdb kube-etcd-pdb -o yaml \u003e PATH_TO_PDB_FILE\n ```\n\n Where:\n - \u003cvar translate=\"no\"\u003eADMIN_CLUSTER_KUBECONFIG\u003c/var\u003e is the path to the\n kubeconfig file for the admin cluster.\n\n - \u003cvar translate=\"no\"\u003eUSER_CLUSTER_NAME\u003c/var\u003e is the name of the user cluster\n that contains the failed etcd replica.\n\n - \u003cvar translate=\"no\"\u003ePATH_TO_PDB_FILE\u003c/var\u003e is the path where you want to\n save the etcd PDB file, for instance `/tmp/etcpdb.yaml`.\n\n2. Delete the etcd PodDisruptionBudget (PDB).\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME delete pdb kube-etcd-pdb\n ```\n3. Run the following command to open the kube-etcd [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) in your text editor:\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME edit statefulset kube-etcd\n ```\n\n Change the value of the `--initial-cluster-state` flag to `existing`. \n\n ```\n containers:\n - name: kube-etcd\n ...\n args:\n - --initial-cluster-state=existing\n ...\n \n ```\n4. Drain the failed etcd replica node.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG drain NODE_NAME --ignore-daemonsets --delete-local-data\n ```\n\n Where \u003cvar translate=\"no\"\u003eNODE_NAME\u003c/var\u003e is the name of the failed etcd replica node.\n5. Create a new shell in the container of one of the working kube-etcd pods.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG exec -it \\\n KUBE_ETCD_POD --container kube-etcd --namespace USER_CLUSTER_NAME \\\n -- bin/sh\n ```\n\n Where \u003cvar translate=\"no\"\u003eKUBE_ETCD_POD\u003c/var\u003e is the name of the working\n kube-etcd pod. For example, `kube-etcd-0`.\n\n From this new shell, run the following commands:\n 1. Remove the failed etcd replica node from the etcd cluster.\n\n First, list all the members of the etcd cluster: \n\n ```\n etcdctl member list -w table\n ```\n\n The output shows all the member IDs. Determine the member ID of the\n failed replica.\n\n Next, remove the failed replica: \n\n ```\n export ETCDCTL_CACERT=/etcd.local.config/certificates/etcdCA.crt\n export ETCDCTL_CERT=/etcd.local.config/certificates/etcd.crt\n export ETCDCTL_CERT=/etcd.local.config/certificates/etcd.crt\n export ETCDCTL_KEY=/etcd.local.config/certificates/etcd.key\n export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379\n etcdctl member remove MEMBER_ID\n ```\n\n Where \u003cvar translate=\"no\"\u003eMEMBER_ID\u003c/var\u003e is the hex member ID of the\n failed etcd replica pod.\n6. Follow steps 1-3 of [Deploying the utility Pods](/anthos/clusters/docs/on-prem/1.15/how-to/backing-up#deploy_utility_pods) to create a utility Pod in\n the admin cluster. This Pod is used to access the PersistentVolume (PV) of the\n failed etcd member in the user cluster.\n\n7. Clean up the etcd data directory from within the utility Pod.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG exec -it -n USER_CLUSTER_NAME etcd-utility-MEMBER_NUMBER -- /bin/bash -c 'rm -rf /var/lib/etcd/*'\n ```\n8. Delete the utility Pod.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG delete pod -n USER_CLUSTER_NAME etcd-utility-MEMBER_NUMBER\n ```\n9. Uncordon the failed node.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG uncordon NODE_NAME\n ```\n10. Open the kube-etcd StatefulSet in your text editor.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG -n USER_CLUSTER_NAME edit statefulset kube-etcd\n ```\n\n Change the value of the `--initial-cluster-state` flag to `new`. \n\n ```\n containers:\n - name: kube-etcd\n ...\n args:\n - --initial-cluster-state=new\n ...\n \n ```\n11. Restore the etcd PDB which was deleted in step 1.\n\n ```\n kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG apply -f /path/to/etcdpdb.yaml\n ```"]]