Configure a high availability policy for VMs

This document shows you how to configure the high-availability policy for virtual machines (VMs) that run using VM Runtime on Google Distributed Cloud.

When you enable VM Runtime on Google Distributed Cloud, the cluster creates a VMHighAvailabilityPolicy object named default. This object specifies the default recovery strategy in case a cluster node that is running a VM fails. Possible default recovery strategies are:

  • Reschedule: Reschedule the VM on another cluster node.
  • Ignore: Do nothing.

Initially, the default recovery strategy is set to Reschedule.

A default recovery strategy of Reschedule is appropriate in the following situation:

  • Your cluster has at least two worker nodes.

  • Your VM disks are provisioned using a network-file-based storage class. That is, the storage class is based on a network file system that coordinates POSIX file locks across different clients. Network File System (NFS) is an example of a network-file-based storage class.

If your VMs are using local storage or a block-based storage system, we recommend that you set the default recovery strategy to Ignore. We make this recommendation for the following reasons:

  • If your VMs use local storage, and a node fails, there is no way to recover the stored data and move it to a new node.

  • If your VMs use a block-based storage system, the storage might not have sufficient detachment guarantees. That could lead to concurrent disk access and data corruption during VM scheduling.

Inspect the VMHighAvailabilityPolicy object

Verify that there is a VMHighAvailabilityPolicy object:

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG get VMHighAvailabilityPolicy --namespace vm-system

Replace USER_CLUSTER_KUBECONFIG with the path of your user cluster kubeconfig file.

The output shows that there is a VMHighAvailabilityPolicy object named default. In the output, you can also see the current value of defaultRecoveryStrategy. For example, the following output shows that the current value of defaultRecoveryStrategy is Reschedule:

vm-system   default   5m55s   Reschedule   15s   1m30s

Get a detailed view of the VMHighAvailabilityPolicy object:

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG get VMHighAvailabilityPolicy \
    --namespace vm-system --output yaml

Example output:

apiVersion: vm.cluster.gke.io/v1alpha1
kind: VMHighAvailabilityPolicy
metadata:
  ...
  labels:
    app.kubernetes.io/component: kubevirt
    app.kubernetes.io/managed-by: virt-operator
    kubevirt.io: virt-api
  name: default
  namespace: vm-system
  ..
spec:
  defaultRecoveryStrategy: Reschedule
  nodeHeartbeatInterval: 15s
  nodeMonitorGracePeriod: 1m30s

Change the default recovery strategy

In certain situations, we recommend that you change the default recovery strategy. For example if your VMs are using local storage or a file system that is not network-file-based, then we recommend that you change the value of defaultRecoveryStrategy to Ignore.

To change the value of defaultRecoveryStrategy, open the VMHighAvailabilityPolicy object for editing:

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG edit VMHighAvailabilityPolicy \
    default --namespace vm-system

In your text editor, change the value of defaultRecoveryStrategy to a value of your choice: Reschedule or Ignore. Close the text editor.

Override the default recovery strategy for a VM

The default recovery strategy applies to all VMs running in the cluster. However, you might need to override the default recovery strategy for individual VMs.

For example, suppose that most of your VMs are provisioned with a network-file-based storage class, but a few VMs are provisioned with a block-based storage class. For each VM that uses block-based storage, we recommend that you override the default recovery strategy by setting the recovery strategy for the individual VM to Ignore.

To override the default recovery strategy for a VM, add a vm.cluster.gke.io/vm-ha-recovery-strategy annotation to both the VirtualMachineInstance (VMI) object and the GVM object.

For example, these commands set the recovery strategy to Ignore for a VM named my-vm:

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
  annotate vmi my-vm \
  vm.cluster.gke.io/vm-ha-recovery-strategy=Ignore --overwrite

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
  annotate gvm my-vm \
  vm.cluster.gke.io/vm-ha-recovery-strategy=Ignore --overwrite

If you want to remove the annotations later, use a hyphen at the end of the annotation name. For example:

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
  annotate vmi my-vm \
  vm.cluster.gke.io/vm-ha-recovery-strategy-

kubectl --kubeconfig USER_CLUSTER_KUBECONFIG \
  annotate gvm my-vm \
  vm.cluster.gke.io/vm-ha-recovery-strategy-

Advanced configuration

In addition to configuring the default recovery strategy, you can configure the following:

  • Node heartbeat interval: The time between heartbeats sent by each cluster node

  • Node monitor grace period: The maximum amount of time a node can fail to send a heartbeat before it is considered unhealthy

In most cases, the default values for heartbeat interval and grace period are appropriate. However, you might choose to adjust these values if you want to fine tune the tradeoff between speed of recovery and overhead. A shorter heartbeat interval will shorten recovery time, but will also increase overhead. In a large cluster, you might choose to lengthen the heartbeat interval, because frequent heartbeats from many nodes could create an unacceptable load on the Kubernetes API server.

Keep the heartbeat interval lower than the grace period to avoid cases where a single missed heartbeat results in a node being deemed unhealthy.

Run kubectl edit to open the VMHighAvailabilityPolicy object for editing. Set nodeHeartbeatInterval and nodeMonitorGracePeriod to values of your choice.

spec:
  defaultRecoveryStrategy: Reschedule
  nodeHeartbeatInterval: 15s
  nodeMonitorGracePeriod: 1m30s