In Google Distributed Cloud, periodic health checking and automatic node repair are enabled by default.
The node auto repair feature continuously detects and repairs unhealthy nodes in a cluster.
Periodic health checks run every fifteen minutes. The checks are the same as
the ones performed by gkectl diagnose cluster. The results are surfaced as
logs and events on Cluster objects in the admin cluster.
Unhealthy conditions
The following conditions are indications that a node is unhealthy:
- The node condition - NotReadyis- truefor approximately 10 minutes.
- The machine state is - Unavailablefor approximately 10 minutes after successful creation.
- The machine state is not - Availablefor approximately 30 minutes after VM creation.
- There is no node object (nodeRef is - nil) corresponding to a machine in the- Availablestate for approximately 10 minutes.
- The node condition - DiskPressureis- truefor approximately 30 minutes.
Repair strategy
Google Distributed Cloud initiates a repair on a node if the node meets at least one of the conditions in the preceding list.
The repair drains the unhealthy node and creates a new VM. If the node draining is unsuccessful for one hour, the repair forces the drain and safely detaches the attached Kubernetes managed disks.
If there are multiple unhealthy nodes in the same MachineDeployment, the repair is performed on only one of those nodes at a time.
The number of repairs per hour for a node pool is limited to the maximum of:
- Three
- Ten percent of the number of nodes in the node pool
Enabling node repair and health checking for a new cluster
In your
admin
or
user
cluster configuration file, set autoRepair.enabled to true:
autoRepair: enabled: true
Continue with the steps for creating your admin or user cluster.
Enabling node repair and health checking for an existing user cluster
In your
user cluster configuration file,
set autoRepair.enabled to true:
Update the cluster:
gkectl update cluster --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config USER_CLUSTER_CONFIG
Replace the following:
- ADMIN_CLUSTER_KUBECONFIG: the path of your admin cluster kubeconfig file 
- USER_CLUSTER_CONFIG: the path of your user cluster configuration file 
Enabling node repair and health checking for an existing admin cluster
In your
admin
cluster configuration file, set autoRepair.enabled to true:
Update the cluster:
gkectl update admin --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config ADMIN_CLUSTER_CONFIG
Replace ADMIN_CLUSTER_CONFIG with the path of your admin cluster configuration file.
Viewing logs from a health checker
List all of the health checker Pods in the admin cluster:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get pods --all-namespaces | grep cluster-health-controller
The output is similar to this:
kube-system cluster-health-controller-6c7df455cf-zlfh7 2/2 Running my-user-cluster cluster-health-controller-5d5545bb75-rtz7c 2/2 Running
To view the logs from a particular health checker, get the logs for the
cluster-health-controller container in one of the Pods. For example, to get
the logs for my-user-cluster shown in the preceding output:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG --namespace my-user-cluster logs \
    cluster-health-controller-5d5545bb75-rtz7c cluster-health-controller
Viewing events from a health checker
List all of the Cluster objects in your admin cluster:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get clusters --all-namespaces
The output is similar to this:
default gke-admin-ldxh7 2d15h my-user-cluster my-user-cluster 2d12h
To view the events for a particular cluster, run kubectl describe cluster with
the --show-events flag. For example, to see the events for my-user-cluster
shown in the preceding output:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG --namespace my-user-cluster \
    describe --show-events cluster my-user-cluster
Example output:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ValidationFailure 17s cluster-health-periodics-controller validator for Pod returned with status: FAILURE, reason: 1 pod error(s).
Disabling node repair and health checking for a user cluster
In your
user cluster configuration file,
set autoRepair.enabled to false:
Update the cluster:
gkectl update cluster --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config USER_CLUSTER_CONFIG
Disabling node repair and health checking for an admin cluster
In your
admin
cluster configuration file, set autoRepair.enabled to false:
Update the cluster:
gkectl update admin --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config ADMIN_CLUSTER_CONFIG
Debugging node auto repair
You can investigate issues with node auto repair by describing the Machine and Node objects in the admin cluster. Here's an example:
List the machine objects:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get machines
Example output:
default gke-admin-master-wcbrj default gke-admin-node-7458969ff8-5cg8d default gke-admin-node-7458969ff8-svqj7 default xxxxxx-user-cluster-41-25j8d-567f9c848f-fwjqt
Describe one of the Machine objects:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG describe machine gke-admin-master-wcbrj
In the output, look for events from cluster-health-controller.
Similarly, you can list and describe node objects. For example:
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get nodes ... kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG describe node gke-admin-master-wcbrj
Manual node repair
In case there are node problems that are not covered by the auto repair logic, or you have not enabled node auto repair, you can do a manual repair. This deletes and re-creates the node.
Get the name of the Machine object that corresponds to the node:
kubectl --kubeconfig CLUSTER_KUBECONFIG get machines
Replace CLUSTER_KUBECONFIG with the path of your admin or user cluster kubeconfig file.
Add the repair annotation to the Machine object:
kubectl annotate --kubeconfig CLUSTER_KUBECONFIG machine MACHINE_NAME onprem.cluster.gke.io/repair-machine=true
Replace MACHINE_NAME with the name of the Machine object.
Delete the Machine object:
kubectl delete --kubeconfig CLUSTER_KUBECONFIG machine MACHINE_NAME