Database migration: Concepts and principles (Part 1)

Last reviewed 2024-03-07 UTC

This document introduces concepts, principles, terminology, and architecture of near-zero downtime database migration for cloud architects who are migrating databases to Google Cloud from on-premises or other cloud environments.

This document is part 1 of two parts. Part 2 discusses setting up and executing the migration process, including failure scenarios.

Database migration is the process of migrating data from one or more source databases to one or more target databases by using a database migration service. When a migration is finished, the dataset in the source databases resides fully, though possibly restructured, in the target databases. Clients that accessed the source databases are then switched over to the target databases, and the source databases are turned down.

The following diagram illustrates this database migration process.

Flow of data from source to target databases through the migration service.

This document describes database migration from an architectural standpoint:

  • The services and technologies involved in database migration.
  • The differences between homogeneous and heterogeneous database migration.
  • The tradeoffs and selection of a migration downtime tolerance.
  • A setup architecture that supports a fallback if unforeseen errors occur during a migration.

This document does not describe how you set up a particular database migration technology. Rather, it introduces database migration in fundamental, conceptual, and principle terms.


The following diagram shows a generic database migration architecture.

Architecture of migration service accessing source and target databases.

A database migration service runs within Google Cloud and accesses both source and target databases. Two variants are represented: (a) shows the migration from a source database in an on-premises data center or a remote cloud to a managed database like Spanner; (b) shows a migration to a database on Compute Engine.

Even though the target databases are different in type (managed and unmanaged) and setup, the database migration architecture and configuration is the same for both cases.


The most important data migration terms for these documents are defined as follows:

source database: A database that contains data to be migrated to one or more target databases.

target database: A database that receives data migrated from one or more source databases.

database migration: A migration of data from source databases to target databases with the goal of turning down the source database systems after the migration completes. The entire dataset, or a subset, is migrated.

homogeneous migration: A migration from source databases to target databases where the source and target databases are of the same database management system from the same provider.

heterogeneous migration: A migration from source databases to target databases where the source and target databases are of different database management systems from different providers.

database migration system: A software system or service that connects to source databases and target databases and performs data migrations from source to target databases.

data migration process: A configured or implemented process executed by the data migration system to transfer data from source to target databases, possibly transforming the data during the transfer.

database replication: A continuous transfer of data from source databases to target databases without the goal of turning down the source databases. Database replication (sometimes called database streaming) is an ongoing process.

Classification of database migrations

There are different types of database migrations that belong to different classes. This section describes the criteria that defines those classes.

Replication versus migration

In a database migration, you move data from source databases to target databases. After the data is completely migrated, you delete source databases and redirect client access to the target databases. Sometimes you keep the source databases as a fallback measure if you encounter unforeseen issues with the target databases. However, after the target databases are reliably operating, you eventually delete the source databases.

With database replication, in contrast, you continuously transfer data from the source databases to the target databases without deleting the source databases. Sometimes database replication is referred to as database streaming. While there is a defined starting time, there is typically no defined completion time. The replication might be stopped or become a migration.

This document discusses only database migration.

Partial versus complete migration

Database migration is understood to be a complete and consistent transfer of data. You define the initial dataset to be transferred as either a complete database or a partial database (a subset of the data in a database) plus every change committed on the source database system thereafter.

Heterogeneous migration versus homogeneous migration

A homogeneous database migration is a migration between the source and target databases of the same database technology, for example, migrating from a MySQL database to a MySQL database, or from an Oracle® database to an Oracle database. Homogeneous migrations also include migrations between a self-hosted database system such as PostgreSQL to a managed version of it such as Cloud SQL for PostgreSQL or AlloyDB for PostgreSQL.

In a homogenous database migration, the schemas for the source and target databases are likely identical. If the schemas are different, the data from the source databases must be transformed during migration.

Heterogeneous database migration is a migration between source and target databases of different database technologies, for example, from an Oracle database to Spanner. Heterogeneous database migration can be between the same data models (for example, from relational to relational), or between different data models (for example, from relational to key-value).

Migrating between different database technologies doesn't necessarily involve different data models. For example, Oracle, MySQL, PostgreSQL, and Spanner all support the relational data model. However, multi-model databases like Oracle, MySQL, or PostgreSQL support different data models. Data stored as JSON documents in a multi-model database can be migrated to MongoDB with little or no transformation necessary, as the data model is the same in the source and the target database.

Although the distinction between homogeneous and heterogeneous migration is based on database technologies, an alternative categorization is based on database models involved. For example, a migration from an Oracle database to Spanner is homogeneous when both use the relational data model; a migration is heterogeneous if, for example, data stored as JSON objects in Oracle is migrated to a relational model in Spanner.

Categorizing migrations by data model more accurately expresses the complexity and effort required to migrate the data than basing the categorization on the database system involved. However, because the commonly used categorization in the industry is based on the database systems involved, the remaining sections are based on that distinction.

Migration downtime: zero versus minimal versus significant

After you successfully migrate a dataset from the source to the target database, you then switch client access over to the target database and delete the source database.

Switching clients from the source databases to the target databases involves several processes:

  • To continue processing, clients must close existing connections to the source databases and create new connections to the target databases. Ideally, closing connections is graceful, meaning that you don't unnecessarily roll back ongoing transactions.
  • After closing connections on the source databases, you must migrate remaining changes from the source databases to the target databases (called draining) to ensure that all changes are captured.
  • You might need to test target databases to ensure that these databases are functional and that clients are functional and operate within their defined service level objectives (SLOs).

In a migration, achieving truly zero downtime for clients is impossible; there are times when clients cannot process requests. However, you can minimize the duration that clients are unable to process requests in several ways (near-zero downtime):

  • You can start your test clients in read-only mode against the target databases long before you switch the clients over. With this approach, testing is concurrent with the migration.
  • You can configure the amount of data being migrated (that is, in flight between the source and target databases) to be as small as possible when the switch over period approaches. This step reduces the time for draining because there are fewer differences between the source databases and the target databases.
  • If new clients operating on the target databases can be started concurrently with existing clients operating on the source databases, you can shorten the switch over time because the new clients are ready to execute as soon as all data is drained.

While it's unrealistic to achieve zero downtime during a switch over, you can minimize the downtime by starting activities concurrently with the ongoing data migration when possible.

In some database migration scenarios, significant downtime is acceptable. Typically, this allowance is a result of business requirements. In such cases, you can simplify your approach. For example, with a homogeneous database migration, you might not require data modification; export and import or backup and restore are perfect approaches. With heterogeneous migrations, the database migration system does not have to deal with updates of source database systems during the migration.

However, you need to establish that the acceptable downtime is long enough for the database migration and follow-up testing to occur. If this downtime cannot be clearly established or is unacceptably long, you need to plan a migration that involves minimal downtime.

Database migration cardinality

In many situations database migration takes place between a single source database and a single target database. In such situations, the cardinality is 1:1 (direct mapping). That is, a source database is migrated without changes to a target database.

A direct mapping, however, is not the only possibility. Other cardinalities include the following:

  • Consolidation (n:1). In a consolidation, you migrate data from several source databases to a smaller number of target databases (or even one target). You might use this approach to simplify database management or employ a target database that can scale.
  • Distribution (1:n). In a distribution, you migrate data from one source database to several target databases. For example, you might use this approach when you need to migrate a large centralized database containing regional data to several regional target databases.
  • Re-distribution (n:m). In a re-distribution, you migrate data from several source databases to several target databases. You might use this approach when you have sharded source databases with shards of very different sizes. The re-distribution evenly distributes the sharded data over several target databases that represent the shards.

Database migration provides an opportunity to redesign and implement your database architecture in addition to merely migrating data.

Migration consistency

The expectation is that a database migration is consistent. In the context of migration, consistent means the following:

  • Complete. All data that is specified to be migrated is actually migrated. The specified data could be all data in a source database or a subset of the data.
  • Duplicate free. Each piece of data is migrated once, and only once. No duplicate data is introduced into the target database.
  • Ordered. The data changes in the source database are applied to the target database in the same order as the changes occurred in the source database. This aspect is essential to ensure data consistency.

An alternative way to describe migration consistency is that after a migration completes, the data state between the source and the target databases is equivalent. For example, in a homogenous migration that involves the direct mapping of a relational database, the same tables and rows must exist in the source and the target databases.

This alternative way of describing migration consistency is important because not all data migrations are based on sequentially applying transactions in the source database to the target database. For example, you might back up the source database and use the backup to restore the source database content into the target database (when significant downtime is possible).

Active-passive versus active-active migration

An important distinction is whether the source and target databases are both open to modifying query processing. In an active-passive database migration, the source databases can be modified during the migration, while the target databases allow only read-only access.

An active-active migration supports clients writing into both the source as well as the target databases during the migration. In this type of migration, conflicts can occur. For example, if the same data item in the source and target database is modified so as to conflict with each other semantically, you might need to run conflict resolution rules to resolve the conflict.

In an active-active migration, you must be able to resolve all data conflicts by using conflict resolution rules. If you cannot, you might experience data inconsistency.

Database migration architecture

A database migration architecture describes the various components required for executing a database migration. This section introduces a generic deployment architecture and treats the database migration system as a separate component. It also discusses the features of a database management system that support data migration as well as non-functional properties that are important for many use cases.

Deployment architecture

A database migration can occur between source and target databases located in any environment, like on-premises or different clouds. Each source and target database can be in a different environment; it is not necessary that all are collocated in the same environment.

The following diagram shows an example of a deployment architecture involving several environments.

Migration architecture involving cloud and on-premises data centers.

DB1 and DB2 are two source databases, and DB3 and Spanner are the target databases. Two clouds and two on-premises data centers are involved in this database migration. The arrows represent the invocation relationships: the database migration service invokes interfaces of all source and target databases.

A special case not discussed here is the migration of data from a database into the same database. This special case uses the database migration system for data transformation only, not for migrating data between different systems across different environments.

Fundamentally, there are three approaches to database migration, which this section discusses:

Database migration system

The database migration system is at the core of database migration. The system executes the actual data extraction from the source databases, transports the data to the target databases, and optionally modifies the data during transit. This section discusses the basic database migration system functionality in general. Examples of database migration systems include Database Migration Service, Striim, Debezium, tcVision and Cloud Data Fusion.

Data migration process

The core technical building block of a database migration system is the data migration process. The data migration process is specified by a developer and defines the source databases from which data is extracted, the target databases into which data is migrated, and any data modification logic applied to the data during the migration.

You can specify one or more data migration processes and execute them sequentially or concurrently depending on the needs of the migration. For example, if you migrate independent databases, the corresponding data migration processes can run in parallel.

Data extraction and insertion

You can detect changes (insertions, updates, deletions) in a database system in two ways: database-supported change data capture (CDC) based on a transaction log, and differential querying of data itself using the query interface of a database management system.

CDC based on a transaction log

Database-supported CDC is based on database management features that are separate from the query interface. One approach is based on transaction logs (for example the binary log in MySQL). A transaction log contains the changes made to data in the correct order. The transaction log is continuously read, and so every change can be observed. For database migration, this logging is extremely useful, as CDC ensures that each change is visible and is subsequently migrated to the target database without loss and in the correct order.

CDC is the preferred approach for capturing changes in a database management system. CDC is built into the database itself and has the least load impact on the system.

Differential querying

If no database management system feature exists that supports observing all changes in the correct order, you can use differential querying as an alternative. In this approach, each data item in a database gets an additional attribute that contains a timestamp or a sequence number. Every time the data item is changed, the change timestamp is added or the sequence number is increased. A polling algorithm reads all data items since the last time it executed or since the last sequence number it used. Once the polling algorithm determines the changes, it records the current time or sequence number into its internal state and then passes on the changes to the target database.

While this approach works without problems for inserts and updates, you need to carefully design deletes because a delete removes a data item from the database. After the data is deleted, it is impossible for the poller to detect that a deletion occurred. You implement a deletion by using an additional status field (a logical delete flag) that indicates the data is deleted. Alternatively, deleted data items can be collected into one or more tables, and the poller accesses those tables to determine if deletion occurred.

For variants on differential querying, see Change data capture.

Differential querying is the least preferred approach because it involves schema and functionality changes. Querying the database also adds a query load that does not relate to executing client logic.

Adapter and agent

The database migration system requires access to the source and to the database systems. Adapters are the abstraction that encapsulates the access functionality. In the simplest form, an adapter can be a JDBC driver for inserting data into a target database that supports JDBC. In a more complex case, an adapter is running in the environment of the target (sometimes called agent), accessing a built-in database interface like log files. In an even more complex case an adapter or agent interfaces with yet another software system, which in turn accesses the database. For example, an agent accesses Oracle GoldenGate, and that in turn accesses an Oracle database.

The adapter or agent that accesses a source database implements the CDC interface or the differential querying interface, depending on the design of the database system. In both cases, the adapter or agent provides changes to the database migration system, and the database migration system is unaware if the changes were captured by CDC or differential querying.

Data modification

In some use cases, data is migrated from source databases to target databases unmodified. These straight-through migrations are typically homogeneous.

Many use cases, however, require data to be modified during the migration process. Typically, modification is required when there are differences in schema, differences in data values, or opportunities to clean up data while it is in transition.

The following sections discuss several types of modifications that can be required in a data migration—data transformation, data enrichment or correlation, and data reduction or filtering.

Data transformation

Data transformation transforms some or all data values from the source database. Some examples include the following:

  • Data type transformation. Sometimes data types between the source and target databases are not equivalent. In these cases, data type transformation casts the source value into the target value based on type transformation rules. For example, a timestamp type from the source might be transformed into a string in the target.
  • Data structure transformation. Data structure transformation modifies the structure in the same database model or between different database models. For example, in a relational system, one source table might be split into two target tables, or several source tables might be denormalized into one target table by using a join. A 1:n relationship in the source database might be transformed into a parent and child relationship in Spanner. Documents from a source document database system might be decomposed into a set of relational rows in a target system.
  • Data value transformation. Data value transformation is separate from data type transformation. Data value transformation changes the value without changing the data type. For example, a local time zone is converted to Coordinated Universal Time (UTC). Or a short zip code (five digits) represented as a string is converted to a long zip code (five digits followed by a dash followed by 4 digits, also known as ZIP+4).
Data enrichment and correlation

Data transformation is applied on the existing data without reference to additional, related reference data. With data enrichment, additional data is queried to enrich source data before it's stored in the target database.

  • Data correlation. It is possible to correlate source data. For example, you can combine data from two tables in two source databases. In one target database, for instance, you might relate a customer to all open, fulfilled, and canceled orders whereby the customer data and the order data originate from two different source databases.
  • Data enrichment. Data enrichment adds reference data. For example, you might enrich records that only contain a zip code by adding the city name corresponding to the zip code. A reference table containing zip codes and the corresponding city names is a static dataset accessed for this use case. Reference data can be dynamic as well. For example, you might use a list of all known customers as reference data.
Data reduction and filtering

Another type of data transformation is reducing or filtering the source data before migrating it to a target database.

  • Data reduction. Data reduction removes attributes from a data item. For example, if a zip code is present in a data item, the corresponding city name might not be required and is dropped, because it can be recalculated or because it is not needed anymore. Sometimes this information is kept for historical reasons to record the name of the city as entered by the user, even if the city name changes in time.
  • Data filtering. Data filtering removes a data item altogether. For example, all canceled orders might be removed and not transferred to the target database.
Data combination or recombination

If data is migrated from different source databases to different target databases, it can be necessary to combine data differently between source and target databases.

Suppose that customers and orders are stored in two different source databases. One source database contains all orders, and a second source database contains all customers. After migration, customers and their orders are stored in a 1:n relationship within a single target database schema—not in a single target database, however, but several target databases where each contains a partition of the data. Each target database represents a region and contains all customers and their orders located in that region.

Target database addressing

Unless there is only one target database, each data item that is migrated needs to be sent to the correct target database. A couple of approaches to addressing the target database include the following:

  • Schema-based addressing. Schema-based addressing determines the target database based on the schema. For example, all data items of a customer collection or all rows of a customer table are migrated to the same target database storing customer information, even though this information was distributed in several source databases.
  • Content-based routing. Content-based routing (using a content-based router, for example) determines the target database based on data values. For example, all customers located in the Latin America region are migrated to a specific target database that represents that region.

You can use both types of addressing at the same time in a database migration. Regardless of the addressing type used, the target database must have the correct schema in place so that data items are stored.

Persistence of in-transit data

Database migration systems, or the environments on which they run, can fail during a migration, and in-transit data can be lost. When failures occur, you need to restart the database migration system and ensure that the data stored in the source database is consistently and completely migrated to the target databases.

As part of the recovery, the database migration system needs to identify the last successfully migrated data item to determine where to begin extracting from the source databases. To resume at the point of failure, the system needs to keep an internal state on the migration progress.

You can maintain state in several ways:

  • You can store all extracted data items within the database migration system before any database modification, and then remove the data item once its modified version is successfully stored in the target database. This approach ensures that the database migration system can exactly determine what is extracted and stored.
  • You can maintain a list of references to the data items in transit. One possibility is to store the primary keys or other unique identifiers of each data item together with a status attribute. After a failure, this state is the basis for recovering the system consistently.