Protect your data using backups

This page describes the AlloyDB Omni Standard Availability reference architecture, which provides data protection using backups.

Use cases

This reference architecture supports the following scenarios:

  • You have databases that can tolerate some downtime and data loss since last backup.
  • You want to protect your AlloyDB Omni database from user-error, corruption, or physical failures at the database level (as opposed to server or VM image snapshots).
  • You want to be able to recover your database in-place or remotely, possibly up to a specific point in time.

How the reference architecture works

The Standard Availability reference architecture covers the backup and recovery of your AlloyDB Omni databases, whether they're running as a standalone instance on a host server, as a virtual machine (Install AlloyDB Omni), or in a Kubernetes cluster (Install AlloyDB Omni on Kubernetes).

While Standard Availability is a basic implementation and minimizes required additional hardware or services, the Recovery Time Objective (RTO) increases as the database gets larger. The more data there is to back up, the longer it takes to restore and recover the database. Data loss depends on the type of backup. If only data files are backed up periodically, then when you restore, you incur data loss since the last backup.

Reducing RPO

The PostgreSQL continuous archiving feature lets you achieve a low Recovery Point Objective (RPO) and enable point-in-time recovery through backups. This process involves archiving Write-Ahead Logging (WAL) files and streaming WAL data, potentially to a remote storage location.

If WAL files are archived only when full or at specific intervals, a complete database loss (including current WAL files) restricts recovery to the last archived WAL file, which means that the Recovery Point Objective (RPO) must consider potential data loss. Conversely, continuous WAL data transfer maximizes zero data loss.

When doing continuous backups, you can perform recovery to a specific point in time. Point-in-time recovery allows restoration to a state before an error, such as accidental table deletion or incorrect batch updates. However, this recovery method impacts the Recovery Point Objective (RPO) unless a temporary auxiliary database is utilized.

Backup strategies

You can configure AlloyDB Omni Postgres-level backups to be stored on either local or remote storage. While local storage might be faster to back up and recover from, remote storage is usually more robust for handling failures when an entire host or VM fails.

Backups in non-Kubernetes

For non-Kubernetes deployments, you can schedule backups using the following PostgreSQL tools:

Alternatively, for small databases, you can choose to perform a logical backup of the database (using pg_dump for a single database or pg_dumpall for the entire cluster). You can restore using pg_restore.

Backups in Kubernetes using the AlloyDB Omni operator

For AlloyDB Omni deployed in a Kubernetes cluster, you can configure continuous backups using a backup plan for each database cluster. For more information, see Back up and restore in Kubernetes.

You can store AlloyDB Omni backups locally or remotely in Cloud Storage, including options provided by any cloud vendor. For more information, see Figure 1, which shows potential backup destinations.

AlloyDB Omni with backup options

Figure 1. AlloyDB Omni with backup options.

Backups can be taken to either local or remote storage options. Local backups tend to be quicker because they rely on only I/O throughput, whereas remote backups usually have a higher latency and lower network bandwidth. However, remote backups provide optimal protection, including zonal failures.

You can also split local backups into local or shared storage. While local storage options are impacted by the lack of disaster recovery options when a database host fails, shared storage allows for that storage to be relocated to another server and then used for recovery. This means that shared storage potentially offers a faster RTO.

For local and shared storage deployments, the following types of backup might be scheduled or taken manually on demand:

  • Full backups: complete backups of all the database files needed for data recovery.
  • Differential backups: backups of only the file changes since the last full backup.
  • Incremental backups: backups of only the file changes since the last backup of any kind.

Point-in-time recovery

Continuous backups of the PostgreSQL Write-Ahead Logging (WAL) files support point-in-time recovery. If, after a failure event, the WAL files are intact and usable, then you can use these files to recover with no data loss.

To control the writing of the WAL files, you can configure the following parameters:

Parameter Description

wal_writer_delay

Specifies how often the WAL writer flushes the WAL to disk unless the write is woken up sooner by a transaction that commits asynchronously The default value is 200ms. Increasing this value reduces the frequency of writes, but it might increase the amount of data lost if the server crashes.

wal_writer_flush_after

Specifies how much WAL data can accumulate before the WAL writer forces a flush to disk. The default value is 1MB. If set to zero, then WAL data is always flushed to disk immediately.

synchronous_commit

Specifies whether the commit returns a response to the user before the WAL data is flushed to disk. The default setting is on, which ensures that the transaction is durable. In other words, the commit has been written to disk before returning a success code to the user. If set to off, then there is up to three times wal_writer_delay before the transaction is written to disk.

WAL usage monitoring

You can use the following methods to observe WAL usage:

Observation method Description

pg_stat_wal

This standard view has the columns wal_write and wal_sync, which store the counts of the number of WAL writes and WAL synchronises. When the configuration parameter track_wal_io_timing is enabled, then the wal_write_time and wal_sync_time are also stored. Regular snapshots of this view can help to display the WAL write and sync activity over time.
pg_current_wal_lsn() This function returns the current log sequence number (lsn) position which, when associated with a timestamp and collected as snapshots over time, can provide the bytes/second of WAL generated using the function pg_wal_lsn_diff(lsn1, lsn2). This function is a useful metric in understanding the transaction rate and how the WAL files are performing.

Streaming WAL data to a remote location

When you use Barman, the WAL data can also be set up to stream in real-time to a remote location to ensure that there is little to no data loss on recovery. Despite the streaming in real time, there is a small chance of losing committed transactions as the streaming writes to the remote barman server are asynchronous by default. However, it's possible to set up WAL streaming using the synchronous mode that stores the WAL and sends a status response back to the source database. Keep in mind that this approach can slow down transactions if they have to wait for this write to complete before continuing.

Backup schedules

In most environments, backups are usually scheduled on a weekly basis. The following is a typical weekly schedule of backups:

  • Sunday: full backup
  • Monday, Tuesday: backup
  • Wednesday: differential backup
  • Thursday, Friday: incremental backup
  • Saturday: differential backup

Using this typical schedule, a rolling recovery window of one week requires storage space for up to three full backups plus the required incremental or differential backups. This approach supports recovery for a failure that takes place during the full backup on Sunday, and the database recovery is required to extend to the previous Sunday prior to the backup starting.

To minimize RTO with a potential for higher RPO, additional databases can operate in continuous recovery mode. This involves replaying backups and continuously updating the secondary environment by archiving and replaying new WAL files. The actual RPO, which reflects potential data loss, depends on transaction frequency, WAL file size, and the use of WAL streaming.

Restoring in non-Kubernetes

For non-Kubernetes deployments, restoring an AlloyDB Omni database involves either stopping the Docker container and then restoring data, or restoring the data to a different location and launching a new Docker instance using that restored data. Once the container restarts, the database is accessible with the restored data.

For more information about recovery options, see Restore an AlloyDB Omni cluster using pgBackRest and Restore an AlloyDB Omni cluster using Barman.

Restoring in Kubernetes using Operator

To restore a database in Kubernetes, the operator offers recovery into the same Kubernetes cluster and namespace, from a named backup or a clone from a Point In Time (PIT). To clone a database into a different Kubernetes cluster, use pgBackRest. For more information, see Back up and restore in Kubernetes and Clone a database cluster from a Kubernetes backup overview.

Implementation

When you choose an availability reference architecture, keep in mind the following benefits, limitations, and alternatives.

Benefits

  • Straightforward to use and manage, and suitable for non-critical databases with lenient RTO/RPO.
  • Minimal additional hardware required
  • Backups are always required for a full disaster recovery plan
  • Recovery to any point in time within the recovery window is possible

Limitations

  • Storage requirements that are possibly larger than the database itself, depending on the retention requirements.
  • Can be slow to recover and might result in higher RTO.
  • Can result in some data loss, depending upon the availability of the current WAL data after database failure, which might negatively affect RPO.

Alternatives

  • Consider the Enhanced or Premium Availability Architecture for improved availability and disaster recovery options.

What's next