Availability best practices

This page describes best practices for ensuring high availability for your Google Distributed Cloud installation. Distributed Cloud does not offer a service level agreement (SLA) and only provides the service level objective (SLO) described on this page.

Choose and implement the level of availability

You must choose the level of availability for your Distributed Cloud workloads that best suits your business requirements. For example, a self-checkout application at a retail store has a much lower availability risk than an edge RAN deployment of a cellular network carrier.

Target availability is directly proportional to the Distributed Cloud spare resource capacity that you reserve for emergencies. The following table describes this relationship. These estimates do not include the downtime scheduled with a maintenance window.

The Distributed Cloud connected software consumes some resources on each physical machine. The amount varies depending on the specific configuration of your Distributed Cloud connected deployment. Google recommends that you benchmark your Distributed Cloud connected deployment to measure this amount and account for it when planning your workload distribution.

Capacity in use Reserved capacity Target availability
83.33% 16.67% 99.9%
100% 0% 93.5%

You might experience a sudden loss of capacity due to hardware failure or a node that requires a restart. To prepare for this, you must architect your workloads with resource quotas in mind so that you always have available capacity on each Distributed Cloud node that meets your chosen level of availability.

For example, to achieve 99.9% target availability, you must configure your workloads so that one of the six physical machines in each Distributed Cloud cluster is available as a backup.

Use survivability mode

Distributed Cloud lets you create clusters that use a local control plane that runs on your Distributed Cloud hardware. Such clusters allow workloads to continue running when the connection to Google Cloud is lost. For more information, see Distributed Cloud survivability mode.

Understand software updates and maintenance windows

Google regularly updates the Distributed Cloud software. These software updates are mandatory and you cannot opt out of them. Distributed Cloud lets you specify individual maintenance windows for each of your Distributed Cloud clusters.

To mitigate potential transient disruptions to your workloads, maintenance windows let you control when automatic upgrades of control planes and nodes can occur. Maintenance windows are useful for the following types of scenarios, among others:

  • Off-peak hours: You want to minimize the chance of downtime by scheduling automatic upgrades during off-peak hours when traffic is reduced.
  • On-call: You want to ensure that upgrades happen during working hours so that someone can monitor the upgrades and manage any unanticipated issues.
  • Multi-cluster upgrades: You want to roll out upgrades across multiple clusters in different regions one at a time at specified intervals.

In addition to automatic upgrades, Google might occasionally need to perform other maintenance tasks. In those cases, it honors a cluster's maintenance window when possible.

If tasks run beyond the maintenance window, Distributed Cloud attempts to pause the tasks. It then attempts to resume those tasks during the next maintenance window.

Distributed Cloud reserves the right to roll out unplanned emergency upgrades outside of maintenance windows. Additionally, mandatory upgrades from deprecated or outdated software might automatically occur outside of maintenance windows.

You can also manually upgrade your cluster at any time. Manually-initiated upgrades begin immediately and ignore any maintenance windows.

To learn how to set up a maintenance window for a new or existing cluster, see Configure a maintenance window.

Restrictions

Maintenance windows have the following restrictions:

  • One maintenance window per cluster. You can only configure a single maintenance window per cluster. Configuring a new maintenance window overwrites the previous one.

  • Time zones for maintenance windows. When configuring and viewing maintenance windows, times are shown differently depending on the tool that you are using, as detailed in the following sections.

When configuring maintenance windows

When you use the more generic --maintenance-window flag to configure a maintenance window, you cannot specify a time zone. When you use the Google Cloud CLI or the API, UTC is used to display times. The Google Cloud console uses the local time zone to display times.

When you use more granular flags, such as --maintenance-window-start, you can specify the time zone as part of the value. If you omit the time zone, your local time zone is used. Times are always stored in UTC.

When viewing maintenance windows

When you view information about your cluster, timestamps for maintenance windows can be shown in UTC or in your local time zone, depending on how you are viewing the information:

  • When you use the Google Cloud console to view information about your cluster, times are always displayed in your local time zone.
  • When you use the gcloud CLI to view information about your cluster, times are always shown in UTC.

In both cases, the RRULE is always in UTC. That means that if specifying, for example, days of the week, then those days are in UTC.

Configure cluster maintenance windows

Distributed Cloud lets you specify a maintenance window for each of your Distributed Cloud clusters. This window tells Google to only update the Distributed Cloud software during the time and at the frequency that you specify.

The following rules govern Distributed Cloud cluster maintenance windows:

  • If you specify a maintenance window for a Distributed Cloud cluster, Google updates your Distributed Cloud software 48 hours after the update has been announced through the Distributed Cloud release notes. On the release notes page, you can subscribe to the Distributed Cloud release notes RSS feed to stay informed about software updates as they are released.
  • The minimum length of a maintenance window is six hours. You can specify a longer window based on the complexity of your Distributed Cloud installation and your business requirements.
  • The minimum frequency of software updates is once per week. You can specify either weekly or daily maintenance windows. You can include and exclude specific days.
  • You can change the maintenance window schedule for a cluster at any time, except when a maintenance window has already been scheduled or when a maintenance window is in progress.
  • If the software update does not complete within the specified time window, it pauses and then resumes during the next scheduled maintenance window.

For detailed instructions, see Configure a maintenance window for a cluster.

Repair of failed hardware

When Google detects a failure of the Distributed Cloud hardware, Google attempts to schedule a site visit within three business days. For a Google-authorized technician to perform the necessary diagnosis and repairs, you must grant them access to the Distributed Cloud hardware.

If a failure of Distributed Cloud hardware occurs and Google performs on-site repairs, all storage media are removed from the Distributed Cloud machine being serviced and are placed into your custody for the duration of the repair.

Other points of failure

You are responsible for maintaining the following aspects of your Distributed Cloud installation that are outside of Google's control and can affect the availability of Distributed Cloud:

What's next