Live migration process during maintenance events

During a planned maintenance event for the underlying hardware of a virtual machine (VM) instance or bare metal instance, the host server is unavailable. To keep an instance running during a host event, Compute Engine performs a live migration of the instance to another host server in the same zone. For more information about host events, see About host events.

Live migration lets Google Cloud perform maintenance without interrupting a workload, rebooting an instance, or modifying any of the instance's properties, such as IP addresses, metadata, block storage data, application state, or network settings.

Live migration keeps instances running during the following situations:

Infrastructure maintenance. Infrastructure maintenance includes host hardware, network and power grids in data centers, and host operating system (OS) and BIOS.
Security-related updates and system configuration changes. These include events such as installing security patches and changing the size of the host root partition for storage of the host OS image and packages.
Hardware failures. This includes failures in memory, CPUs, network interface cards, and disks. If the failure is detected before there is a complete server failure, then Compute Engine performs a preventative live migration of the instance to a new host server. If the hardware fails completely or otherwise prevents live migration, then the instance terminates and restarts automatically.

Compute Engine only performs a live migration of VMs that have the host maintenance policy set to migrate. For information about how to change the host maintenance policy, see Set VM host maintenance policy.

Live migration process and Local SSD disks

Compute Engine can live migrate instances with Local SSD disks attached (excluding Z3 instances with more than 18 TiB of attached Titanium SSD). Compute Engine moves the VM instances along with their Local SSD data to a new machine in advance of any planned maintenance.

Limitations

Live migration is not supported for the following VM types:

Bare metal instances. Instances created with a bare metal machine type don't support live migration. The maintenance behavior for these instances is set to TERMINATE and RESTART, respectively.
Most Confidential VM instances. Live migration for Confidential VM instances is only supported on N2D machine types with AMD EPYC Milan CPU platforms running AMD SEV. All other Confidential VM instances don't support live migration, and must be set to stop and optionally restart during a host maintenance event. See Live migration for more details.
VMs with GPUs attached. VM instances with GPUs attached must be set to stop and optionally restart. Compute Engine offers a notice before a VM instance with a GPU attached is stopped, depending on the GPU type:
- For most GPUs, Compute Engine provides a 60-minute notice.
- For GPU families running on AI Hypercomputer Cluster Director, Compute Engine provides a 10-minute notice.
To learn more about these maintenance event notices, read Query metadata server for maintenance event notices.

To learn more about handling host maintenance with GPUs, read Handling host maintenance in the GPUs documentation.
Cloud TPUs. Cloud TPUs don't support live migration.
Storage-optimized VMs. Z3 VMs with more than 18 TiB of attached Titanium SSD don't support live migration. The maintenance behavior for these VMs is set to TERMINATE and RESTART.Compute Engine preserves the data on Titanium SSD during the maintenance event, as described in Disk persistence following instance termination.

How does the live migration process work?

When a VM is scheduled to live migrate, Compute Engine provides a notification so that you can prepare your workloads and applications for this live migration disruption. During live migration, Google Cloud observes a minimum disruption time, which is typically much less than 1 second. If a VM is not set to live migrate, Compute Engine terminates the VM during host maintenance. VMs that are set to terminate during a host event stop and (optionally) restart.

When Google Cloud migrates a running VM from one host to another, it moves the complete state of the VM from the source to the destination in a way that is transparent to the guest OS and anything communicating with it. There are many components involved in making this work seamlessly, but the high-level steps are shown in the following illustration:

Migrating a VM and each of its resources to a new host system
without requiring the guest operating system to restart. — *Live migration components*

The process begins with a notification that a VM needs to be moved from its current host machine. The notification might start with a file change indicating that a new BIOS version is available, a hardware operation scheduling maintenance, or an automatic signal from an impending hardware failure.

Google Cloud's cluster management software constantly watches for these events and schedules them based on policies that control the data centers, such as capacity utilization rates and the number of VMs that a single customer can migrate at once.

After a VM is selected for migration, Google Cloud provides a notification to the guest that a migration is happening soon. After a waiting period, a target host is selected and the host is asked to set up a new, empty "target" VM to receive the migrating "source" VM. Authentication is used to establish a connection between the source and the target.

There are three stages involved in the VM's migration:

Source brownout. The VM is still executing on the source, while most state is sent from the source to the target. For example, Google Cloud copies all the guest memory to the target, while tracking the pages that have been changed on the source. The time spent in source brownout is a function of the size of the guest memory and the rate at which pages are being changed.
Blackout. A very brief moment when the VM is not running anywhere, the source VM is paused and all the remaining state required to begin running the VM on the target is sent. The VM enters the blackout stage when sending state changes during the source brownout stage reaches a point of diminishing returns. An algorithm is used that balances numbers of bytes of memory being sent against the rate at which the guest VM is making changes.

During blackout events, the system clock appears to jump forward, up to 5 seconds. If a blackout event exceeds 5 seconds, Google Cloud stops and synchronizes the clock using a daemon that is included as part of the VM guest packages.
Target brownout. The VM executes on the target VM. The source VM is present and might provide support for the target VM. For example, until the network fabric has caught up with the new location of the target VM, the source VM provides forwarding services for packets to and from the target VM.

Finally, the migration is complete and the system deletes the source VM. You can see that the migration took place in the Cloud Logging logs for your VM.

Live migration of sole-tenant VMs

As your workload runs, you might want to move VMs to a different sole-tenant node or node group. If you move a VM to a group of nodes, Compute Engine determines which node to place it on. For information about sole-tenancy, see Sole-tenancy overview.

To move sole-tenant VMs to a different node or node group, you can manually initiate a live migration. You can also manually initiate a live migration to move a VM on a multi-tenant host into a sole-tenant node. For more information, see Manually live migrate VMs.

What's next

Set VM host maintenance policy options to configure your instances to live migrate.
Learn how to get live migration notices so you can trigger tasks that you want to perform prior to a maintenance event.
Read tips for designing a robust system that can handle service disruptions.