Automatic node repair and health checking

In Google Distributed Cloud, periodic health checking and automatic node repair are enabled by default.

The node auto repair feature continuously detects and repairs unhealthy nodes in a cluster.

Periodic health checks run every fifteen minutes. The checks are the same as the ones performed by gkectl diagnose cluster. The results are surfaced as logs and events on Cluster objects in the admin cluster.

Make sure that your admin and user clusters each have an extra IP address available for automatic node repair.

Unhealthy node conditions

The following conditions are indications that a node is unhealthy:

  • The node condition NotReady is true for approximately 10 minutes.

  • The machine state is Unavailable for approximately 10 minutes after successful creation.

  • The machine state is not Available for approximately 30 minutes after VM creation.

  • There is no node object (nodeRef is nil) corresponding to a machine in the Available state for approximately 10 minutes.

  • The node condition DiskPressure is true for approximately 30 minutes.

Node repair strategy

Google Distributed Cloud initiates a repair on a node if the node meets at least one of the conditions in the preceding list.

The repair drains the unhealthy node and creates a new VM. If the node draining is unsuccessful for one hour, the repair forces the drain and safely detaches the attached Kubernetes managed disks.

If there are multiple unhealthy nodes in the same MachineDeployment, the repair is performed on only one of those nodes at a time.

The number of repairs per hour for a node pool is limited to the maximum of:

  • Three
  • Ten percent of the number of nodes in the node pool

Enabling node repair and health checking for a new cluster

In your admin or user cluster configuration file, set autoRepair.enabled to true:

autoRepair:
  enabled: true

Continue with the steps for creating your admin or user cluster.

Enabling node repair and health checking for an existing user cluster

In your user cluster configuration file, set autoRepair.enabled to true:

Update the cluster:

gkectl update cluster --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config USER_CLUSTER_CONFIG

Replace the following:

  • ADMIN_CLUSTER_KUBECONFIG: the path of your admin cluster kubeconfig file

  • USER_CLUSTER_CONFIG: the path of your user cluster configuration file

Enabling node repair and health checking for an existing admin cluster

In your admin cluster configuration file, set autoRepair.enabled to true:

Update the cluster:

gkectl update admin --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config ADMIN_CLUSTER_CONFIG

Replace ADMIN_CLUSTER_CONFIG with the path of your admin cluster configuration file.

Viewing logs from a health checker

List all of the health checker Pods in the admin cluster:

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get pods --all-namespaces | grep cluster-health-controller

The output is similar to this:

kube-system       cluster-health-controller-6c7df455cf-zlfh7   2/2   Running
my-user-cluster   cluster-health-controller-5d5545bb75-rtz7c   2/2   Running

To view the logs from a particular health checker, get the logs for the cluster-health-controller container in one of the Pods. For example, to get the logs for my-user-cluster shown in the preceding output:

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG --namespace my-user-cluster logs \
    cluster-health-controller-5d5545bb75-rtz7c cluster-health-controller

Viewing events from a health checker

List all of the Cluster objects in your admin cluster:

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get clusters --all-namespaces

The output is similar to this:

default            gke-admin-ldxh7   2d15h
my-user-cluster    my-user-cluster   2d12h

To view the events for a particular cluster, run kubectl describe cluster with the --show-events flag. For example, to see the events for my-user-cluster shown in the preceding output:

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG --namespace my-user-cluster \
    describe --show-events cluster my-user-cluster

Example output:

Events:
  Type     Reason             Age   From                                 Message
  ----     ------             ----  ----                                 -------
  Warning  ValidationFailure  17s   cluster-health-periodics-controller  validator for Pod returned with status: FAILURE, reason: 1 pod error(s).

Disabling node repair and health checking for a user cluster

In your user cluster configuration file, set autoRepair.enabled to false:

Update the cluster:

gkectl update cluster --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config USER_CLUSTER_CONFIG

Disabling node repair and health checking for an admin cluster

In your admin cluster configuration file, set autoRepair.enabled to false:

Update the cluster:

gkectl update admin --kubeconfig ADMIN_CLUSTER_KUBECONFIG --config ADMIN_CLUSTER_CONFIG

Debugging node auto repair

You can investigate issues with node auto repair by describing the Machine and Node objects in the admin cluster. Here's an example:

List the machine objects:

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG  get machines

Example output:

default     gke-admin-master-wcbrj
default     gke-admin-node-7458969ff8-5cg8d
default     gke-admin-node-7458969ff8-svqj7
default     xxxxxx-user-cluster-41-25j8d-567f9c848f-fwjqt

Describe one of the Machine objects:

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG describe machine gke-admin-master-wcbrj

In the output, look for events from cluster-health-controller.

Similarly, you can list and describe node objects. For example:

kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG get nodes
...
kubectl --kubeconfig ADMIN_CLUSTER_KUBECONFIG describe node gke-admin-master-wcbrj

Manual node repair

Admin control plane node

The admin control plane node has a dedicated repair command, because the normal manual repair doesn't work for it.

Use gkectl repair admin-master to repair the admin control plane node.

Other nodes

In case there are node problems that are not covered by the auto repair logic, or you have not enabled node auto repair, you can do a manual repair. This deletes and re-creates the node.

Get the name of the Machine object that corresponds to the node:

kubectl --kubeconfig CLUSTER_KUBECONFIG get machines

Replace CLUSTER_KUBECONFIG with the path of your admin or user cluster kubeconfig file.

Add the repair annotation to the Machine object:

kubectl annotate --kubeconfig CLUSTER_KUBECONFIG machine MACHINE_NAME onprem.cluster.gke.io/repair-machine=true

Replace MACHINE_NAME with the name of the Machine object.

Delete the Machine object:

kubectl delete --kubeconfig CLUSTER_KUBECONFIG machine MACHINE_NAME