Create a stream

In this section, you learn how to create a stream. Datastream uses this stream to transfer data from a source Oracle, MySQL, SQL Server or PostgreSQL database into BigQuery or Cloud Storage.

Creating a stream includes:

  • Defining settings for the stream.
  • Selecting the connection profile that you created for your source Oracle, MySQL, SQL Server, or PostgreSQL database (the source connection profile), or creating a source connection profile if you haven't created one.
  • Configuring information about the source database for the stream by specifying the tables and schemas in the source database that Datastream:
    • Can transfer into the destination.
    • Is restricted from transferring into the destination.
  • Determining whether Datastream will backfill historical data, as well as stream ongoing changes into the destination, or stream only changes to the data. When enabling historical backfill, you can optionally specify schemas and tables in the source database that Datastream shouldn't backfill into the destination.

  • Selecting the connection profile that you created for BigQuery or Cloud Storage (the destination connection profile), or creating a destination connection profile if you haven't created one.

  • Configuring information about the destination for the stream. This information includes:

    • For BigQuery:
      • The datasets into which Datastream will replicate schemas, tables, and data from a source database.
    • For Cloud Storage:
      • The folder of the destination bucket into which Datastream will transfer schemas, tables, and data from a source database.
  • Validating the stream to ensure that it will run successfully. Validating a stream checks:

    • Whether the source is configured properly to allow Datastream to stream data from it.
    • Whether the stream can connect to both the source and the destination.
    • The end-to-end configuration of the stream.

Define settings for the stream

  1. Go to the Streams page in the Google Cloud Console.

    Go to the Streams page

  2. Click CREATE STREAM.

  3. Use the following table to populate the fields of the Define stream details section of the Create stream page:

    FieldDescription
    Stream nameEnter the display name of the stream.
    Stream IDDatastream populates this field automatically based on the stream name that you enter. You can keep the ID that's auto-generated or change it.
    RegionSelect the region where the stream is stored. Streams, like all resources, are saved in a region. Region selection doesn't impact whether your stream can connect to the source database or the destination, but can impact availability if the region experiences downtime. We recommend that you keep all resources for the stream in the same region as your source data to optimize cost and performance.
    Source type

    Select the profile type that you specified when you created a connection profile for a source Oracle, MySQL, SQL Server, or PostgreSQL database. Or, if you haven't created a connection profile for the source database, you can create one now.

    Destination typeSelect the profile type that you specified when you created a connection profile for a BigQuery or Cloud Storage destination. Or, if you haven't created a connection profile for the destination, you can create one now.
    Encryption

    By default, your data is encrypted with a key that's managed by Google Cloud. If you want to manage your encryption, then you can use a customer-managed encryption key (CMEK):

    1. Select the Cloud KMS key checkbox.
    2. From the Key type drop-down menu, select Cloud KMS and then select your CMEK.

    If you don't see your key, then click ENTER KEY RESOURCE NAME to provide the resource name of the key that you want to use. For example, you can enter projects/my-project-name/locations/my-location/keyRings/my-keyring/cryptoKeys/my-key in the Key resource name field, and then click SAVE.

  4. Optionally, use labels to organize your Datastream resources.

    1. To create a label, click ADD LABEL, and then enter the key-value pair for the label.
    2. To remove the label, click the trashcan icon to the right of the row that contains the label.
  5. Optionally, add alerting policies for your stream. Alerting policies define when and how you want to be notified about stream failures.

    1. To create an alerting policy, click Add alerting policy.
    2. The Create alerting policy page appears in Cloud Monitoring. On this page, you define the alerting policy in case your stream fails.

    For more information about alerting policies, see Managing metric-based alerting policies.

  6. Review the required prerequisites that are generated automatically to reflect how the environment must be prepared for a stream. These prerequisites can include how to configure the source database and how to connect it to the destination. It's best to complete these prerequisites at this step, but you can complete them at any time before you test the stream or start it. For more information about these prerequisites, see Sources.

  7. Click CONTINUE. The Define connection profile panel of the Create stream page appears for your source database type.

Specify information about the source connection profile

  1. If you have created a source connection profile for Oracle, MySQL, SQL Server, or PostgreSQL, then select it from the list of connection profiles.

    If you haven't created a source connection profile, then create one by clicking CREATE CONNECTION PROFILE at the bottom of the drop-down list, and then perform the same steps as in Create connection profiles.

  2. Click RUN TEST to verify that the source database and Datastream can communicate with each other.

    If the test fails, then the issue associated with the connection profile appears. Refer to the Diagnose issues page for troubleshooting steps. Make the necessary changes to correct the issue, and then retest.

  3. Click CONTINUE. The Configure stream source panel of the Create stream page appears.

Configure information about the source database for the stream

Configure PostgreSQL source database

If you selected PostgreSQL as your source database, you need to define its replication properties.

  1. In the Replication properties section, specify the following properties:
    1. In the Replication slot name field, enter the name of the slot you created specifically for this stream. The database server uses this slot to send events to Datastream.
    2. In the Publication name field, enter the name of the publication you created in the database. A publication is a group of all tables that you want to replicate changes from using this stream.
  2. Follow the steps described in the Configure source databases section.

Configure source databases

  1. In the Select objects to include section, use the Objects to include drop-down menu to specify the tables and schemas in the source database that Datastream can transfer into the destination.

    1. If you want Datastream to transfer all tables and schemas, then select All tables from all schemas.
    2. If you want Datastream to transfer only specific tables and schemas, select Specific schemas and tables, and then select the checkboxes for the schemas and tables that you want Datastream to pull.

    3. If you want to provide a textual definition of the tables and schemas that you want Datastream to transfer, select Custom, and then, in the Object matching criteria field, enter the schemas and tables that you want Datastream to pull.

      If your database has large numbers of tables and schemas, we recommend using the Custom option because some tables and schemas might not be included in the list of objects to pull.

  2. Optionally, expand the Select objects to exclude node. In the Objects to exclude field, enter the tables and schemas in the source database that you want to restrict Datastream from pulling. The Objects to exclude list takes precedence over the Objects to include list. If an object meets the criteria in both the include and exclude lists, the object is excluded from the stream.

  3. Depending on what source you selected, specify the CDC method for your stream:

    • For SQL Server sources:
      • Transaction logs: select this method to process changes directly from the database logs. This method provides the best performance and is more efficient, but requires additional configuration steps.
      • Change tables: select this method to process changes from dedicated change tables. This method is easier to configure and has fewer limitations, but supports lower throughput and produces higher load on your database than the transaction logs method. For more information about configuring your source SQL Server database, see Configure a source SQL Server database.
    • For MySQL sources:
      • GTID (Global Transaction IDs)-based replication (Preview): select this method if you want Datastream to support failovers and provide seamless replication irrespective of the changes in your database cluster.
      • Binlog-based replication: select this method to read and replicate changes directly from the binary log files of a selected database instance. You can't switch a running stream to a different source instance or a replica using this CDC method.
    • For Oracle sources:
      • LogMiner: select this method to query archived redo logs using the LogMiner API. LogMiner supports most options available for Oracle, such as encryption and compression options.
      • Binary reader (Preview): select this method to extract changes directly from Oracle logs. This allows for faster replication of large Oracle databases and reduced latency and overhead. When you select this option, an additional drop-down is shown. Select the way you want to access your Oracle log files:
      • Automatic Storage Management (ASM): select this option if your database uses ASM. Make sure that you selected the Enable ASM access for binary reader checkbox and filled in the details of your ASM instance in your source connection profile.
      • Database directories: select this option if you don't use ASM for storing your log files. If you select database directories, you need to provide the redo log and archived log directory names.
  4. Optionally, expand the Choose backfill mode for historical data node, and then make one of the following selections:

    • Select the Automatic option to stream all existing data, in addition to changes to the data, from the source into the destination. In the Objects excluded from automatic backfill field, enter the tables and schemas in the source database that you want to restrict Datastream from backfilling into the destination.

    • Select the Manual option to stream only changes to the data into the destination.

  5. Click CONTINUE. The Define connection profile panel of the Create stream page appears for your destination type.

Select a destination connection profile

  1. If you have created a destination connection profile, then select it from the list of connection profiles.

    If you haven't created a destination connection profile, then create one by clicking CREATE CONNECTION PROFILE at the bottom of the drop-down list, and then perform the same steps as in Create connection profiles.

  2. Click CONTINUE. The Configure stream destination panel of the Create stream page appears.

Configure information about the destination for the stream

This section describes what information you need to provide to configure your BigQuery or Cloud Storage destination.

BigQuery destination

  1. In the Specify how Datastream should stream into a BigQuery dataset section, select one of the following options from the Schema grouping drop-down:

    • Dataset for each schema: Datastream creates a BigQuery dataset for each source schema, based on the schema name.

      If you select this option, then for every schema in the source database, Datastream creates a dataset in the project that contains the stream.

      In addition, if you select the Dataset for each schema option, you need to provide the following:

      1. In the Location type section, specify where you want to store your data in BigQuery. Select Region to specify a single geographical location, or Multi-region to specify a larger area that contains two or more regions. If your priority is to optimize costs and performance, select Region, and keep your datasets alongside other Google Cloud resources that you manage. If you want to achieve higher quota limits, select Multi-region to let BigQuery use multiple regions within a group. When you select Region, the field defaults to the region that you selected for the stream.
      2. (Optional) In the Dataset prefix field, define a prefix for the dataset created by the stream. The string that you specify is added to the source schema name. For example, if the schema is called <mySchema>, and you define the dataset prefix as <myPrefix>, Datastream creates a dataset called <myPrefix><mySchema>.
      3. Expand the Show advanced encryption options section.
      4. Specify how you want your data to be encrypted in BigQuery by selecting a Google-managed or customer-managed encryption key.
    • Single dataset for all schemas: Datastream creates all the tables inside the existing BigQuery dataset that you specify. Each table that Datastream creates is named using a combination of the source schema name and table name, separated by an underscore (for example, <mySchemaName>_<myTableName>). The datasets available are grouped by region. You can also create a new dataset.

  2. Specify the Stream write mode. Select your write mode to define how data is written and processed in BigQuery:

    • Merge: select this option if you want your data to be synchronised with the source.
    • Append-only: select this option if you want to keep all historical data for your change events.
  3. Specify the limit of data staleness. Only available when Merge mode is selected in the preceding step. Select a limit to balance BigQuery query performance and cost versus data freshness. BigQuery applies the changes in the background on an ongoing basis, or at query run time, according to the staleness limit. Lower staleness (fresher data) may increase BigQuery processing costs.

  4. Click CONTINUE. The Review stream details and create panel of the Create stream page appears.

Cloud Storage destination

  1. Optionally, in the Stream path prefix field, enter the folder of the destination bucket into which Datastream will transfer schemas, tables, and data from a source database.

    For example, if you want Datastream to transfer data from the source database into the /root/example folder in the destination bucket, then enter /root/example in the Stream path prefix field.

  2. In the Output format field, choose the format of files written to Cloud Storage.

  3. Optional. If you select the JSON format, then two checkboxes appear:

    1. Include a Unified Types schema file in the file path: If you select this checkbox, then Datastream writes two files to Cloud Storage: a JSON data file and an Avro schema file. The schema file has the same name as the data file, with a .schema extension. If you don't select this checkbox, then Datastream writes only the JSON data file to Cloud Storage. By default, this checkbox isn't selected.
    2. Enable gzip compression: If you select this checkbox, then Datastream uses the gzip utility to compress the files that Datastream writes to Cloud Storage. If you don't select this checkbox, then Datastream writes the files to Cloud Storage without compressing them. By default, this checkbox is selected.
  4. Click CONTINUE. The Review stream details and create panel of the Create stream page appears.

Create the stream

  1. Verify details about the stream as well as the source and destination connection profiles that the stream will use to transfer data from a source database into a destination.
  2. Click RUN VALIDATION to validate the stream. By validating a stream, Datastream checks that the source is configured properly, validates that the stream can connect to both the source and the destination, and verifies the end-to-end configuration of the stream.

    1. If your source database is Oracle, then Datastream performs the following checks:

      CheckDescription
      Objects list validityDatastream verifies that the exclude list doesn't subsume the include list.
      Forward SSH tunnel connectivityDatastream verifies that it can connect to a bastion host through a Forward SSH tunnel.
      Connectivity to Oracle databaseDatastream verifies that it can connect to the source Oracle database.
      Oracle user permissionsDatastream verifies that the user which it uses to connect to the source database has all of the necessary permissions to retrieve schemas, tables, and data from the database so that Datastream can stream this information into the destination.
      Logging mode configurationDatastream verifies that the logging mode for Oracle database is set to ARCHIVELOG.
      Supplemental logging configurationDatastream verifies that supplemental logging is turned on for the database tables that are being streamed from the source into the destination.
      Oracle CDC configurationDatastream verifies that the CDC method that you selected complies with your database configuration.
      Archive log files configurationDatastream verifies that archive logging is configured in the source, and that archive log files are present.
      Cloud Storage permissions (for Cloud Storage destinations) Datastream verifies that it has the necessary permissions to write to the destination bucket in Cloud Storage.
    2. If your source database is MySQL, then Datastream performs the following checks:

      CheckDescription
      Replication permissionsDatastream verifies that it has permissions to replicate a MySQL database.
      Database versionDatastream verifies that MySQL database version is supported. For information about supported editions, see Versions.
      Binary log format configurationDatastream verifies that the binary log format of the MySQL database is set to ROW.
      Binary logging is enabledDatastream verifies that the binary log files are configured correctly, and that there are log files present.
      Binary log row image configurationDatastream verifies that the binlog row image is set to FULL.
      Binlog compression disabledDatastream verifies that the binlog compression is disabled.
      Log replica or slave updates enabledDatastream verifies that the log replica or slave updates are enabled.
      GTID mode configuration (applies only to GTID-based CDC)Datastream verifies that the GTID mode is set to ON if GTID based replication is used.
      GtidSet doesn't contain tagged GTIDs (applies only to GTID-based CDC)Datastream verifies that the GTID set doesn't contain any tagged GTIDs.
      Cloud Storage permissions (for Cloud Storage destinations)Datastream verifies that it has the necessary permissions to write to the destination bucket in Cloud Storage.
      Warnings checkDescription
      Expire logs days or Expire logs seconds or Binlog retention hours configurationDatastream verifies that the expire_logs_days, binlog_expire_logs_seconds, binlog retention hours are configured to values greater than 7, 604800, and 168, respectively.
      Max allowed packet configurationDatastream verifies that the max_allowed_packet is set to 1GB.
      Enforce GTID consistency or GTID strict mode (applies only to GTID-based CDC)Datastream verifies that the GTID consistency or GTID strict mode is enabled.
    3. If your source database is PostgreSQL, then Datastream performs the following checks:

      CheckDescription
      Connectivity to PostgreSQL databaseDatastream verifies that it can connect to the source PostgreSQL database.
      Logical decoding is enabledDatastream verifies that wal_level parameter of the PostgreSQL database is set to logical.
      Replication slot configurationDatastream verifies whether the PostgreSQL replication slot exists and is active, whether Datastream has the permissions required to access it, and whether it is configured correctly.
      Publication configurationDatastream verifies that the publication exists and that the tables to replicate for the stream are included in the publication.
      Backfill permissionsDatastream verifies that tables required to backfill historical data exist and that it has the necessary permissions to read them.
    4. If your source database is SQL Server, then Datastream performs the following checks:

      CheckDescription
      SQL Server editionDatastream verifies that the source SQL Server database edition is supported. For information about supported editions, see Versions.
      SQL Server user permissionsDatastream verifies that the user has the permissions required to access the source SQL Server database.
      SQL Server CDC is enabledDatastream verifies whether CDC is enabled for the database and all tables included in the stream.
      SQL Server database validationsDatastream checks whether the source database is configured properly, whether all replication requirements are met and whether there are no unsupported data types in the columns.
      Cloud Storage or BigQuery permissionsDatastream verifies that the user has the necessary permissions to write to the destination.
  3. After all validation checks pass, click CREATE & START to create the stream and start it immediately, or CREATE to create the stream without starting it immediately.

    If you don't start the stream now, then you can start it from the Streams page by clicking START.

After creating a stream, you can view high-level and detailed information about it.

What's next