Import metadata using a custom pipeline

This document describes how to import Dataplex Catalog metadata from a third-party system into Dataplex by using the metadata import API methods and your own pipeline. Dataplex Catalog metadata consists of entries and their aspects.

If you instead want to use a Google Cloud-managed orchestration pipeline to extract and import metadata, we suggest using a managed connectivity pipeline. With a managed connectivity pipeline, you bring your own connector that extracts metadata and generates output in a format that can be used as input by the metadata import API methods (the metadata import file). Then, you use Workflows to orchestrate the pipeline tasks.

High-level steps

To import metadata using the metadata import API, follow these high-level steps:

  1. Determine the job scope.

    Also, understand how Dataplex applies the comparison logic and the sync mode for entries and aspects.

  2. Create one or more metadata import files that define the data to import.

  3. Save the metadata import files in a Cloud Storage bucket.

  4. Run a metadata import job.

The steps on this page assume that you're familiar with Dataplex Catalog concepts, including entry groups, entry types, and aspect types. For more information, see Dataplex Catalog overview.

Before you begin

Before you import metadata, complete the tasks in this section.

Required roles

To ensure that the Dataplex service account has the necessary permissions to access the Cloud Storage bucket, ask your administrator to grant the Dataplex service account the Storage Object Viewer (roles/storage.objectViewer) IAM role and the storage.buckets.get permission on the bucket.

To get the permissions that you need to manage metadata jobs, ask your administrator to grant you the following IAM roles:

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Create Google Cloud resources

Prepare the following Google Cloud resources:

  1. Create an entry group for the entries that you want to import.
  2. Create aspect types for the aspects that you want to import.
  3. Create entry types for the entries that you want to import.
  4. Create a Cloud Storage bucket to store your metadata import files.

Components of a metadata job

When you import metadata, consider the following components of a metadata job:

  • Job scope: the entry group, entry types, and aspect types to include in the job.
  • Sync mode: whether to perform a full update or an incremental update on the entries and aspects in the job.
  • Metadata import file: a file that defines the values to set for the entries and aspects in the job. You can provide multiple metadata import files in the same metadata job. You save the files in Cloud Storage.
  • Comparison logic: how Dataplex determines which entries and aspects to modify.

Job scope

The job scope defines the entry group, the entry types, and optionally the aspect types that you want to include in a metadata job. When you import metadata, you modify the entries and aspects that belong to resources within the job's scope.

To define the job scope, follow these guidelines:

  • Entry group: specify a single entry group to include in the job. The job modifies only the entries that belong to this entry group. The entry group and the job must be in the same region.

  • Entry types: specify one or more entry types to include in the job. The job modifies only the entries that belong to these entry types. The location of an entry type must either match the location of the job, or the entry type must be global.

  • Aspect types: optional. Specify one or more aspect types to include in the job. If you specify a scope for aspect types, the job modifies only the aspects that belong to these aspect types. The location of an aspect type must either match the location of the job, or the aspect type must be global.

You specify the job scope when you create a metadata job.

Sync mode

The sync mode specifies whether to perform a full update or an incremental update on the entries and aspects in a metadata job.

  • FULL: supported for entries. All entries in the job's scope are modified.

    If an entry exists in Dataplex but isn't included in the metadata import file, the entry is deleted when you run the metadata job.

    Full sync isn't supported for aspects.

  • INCREMENTAL: supported for aspects. An aspect is modified only if the metadata import file includes a reference to the aspect in the updateMask field and the aspectKeys field. For more information about these fields in the metadata import file, see the Structure of an import item section of this document.

    Incremental sync isn't supported for entries.

You specify the sync mode when you create a metadata job.

Metadata import file

The metadata import file is a collection of the entries and aspects that you want to modify. It defines the values to set for all of the fields that belong to these entries and aspects. You prepare the file before you run a metadata job.

These general guidelines apply:

  • You can provide multiple metadata import files in the same metadata job.
  • The entries that you provide in the file completely replace all of the existing entries for any resources that are within the job's scope. This means that you must include values for all of the entries in a job, not just the values that you want to add or update. To get a list of the current entries in your project to use as a starting point, use the entries.list API method.

  • You must provide a metadata import file as part of a metadata job. If you want to delete all existing data for the entries that are within the job's scope, provide an empty metadata import file.

  • All of the entries and aspects that you include in the file must belong to the entry groups, entry types, and aspect types that you define in the job's scope.

Use the detailed guidelines in the following sections to create a metadata import file.

Structure of the file

Each line in the metadata import file contains a JSON object that corresponds to one import item. An import item is an object that describes the values to modify for an entry and its attached aspects.

You can provide multiple import items in a single metadata import file. However, don't provide the same import item more than once in a metadata job. Use a newline character (0x0a) to separate each import item.

A metadata import file with a newline character between each import item looks like the following example:

{ "entry": { "name": "entry 1", #Information about entry 1 }
{ "entry": { "name": "entry 2", #Information about entry 2 }

Structure of an import item

Each import item in the metadata import file includes the following fields (see ImportItem). The following example is formatted with line breaks for readability, but when you save the file, include a newline character only after each import item. Don't include line breaks between the fields of a single import item.

{
  "entry": {
    "name": "ENTRY_NAME",
    "entryType": "ENTRY_TYPE",
    "entrySource": {
      "resource": "RESOURCE",
      "system": "SYSTEM",
      "platform": "PLATFORM",
      "displayName": "DISPLAY_NAME",
      "description": "DESCRIPTION",
      "createTime": "ENTRY_CREATE_TIMESTAMP",
      "updateTime": "ENTRY_UPDATE_TIMESTAMP"
    }
    "aspects": {
      "ASPECT": {
        "data": {
          "KEY": "VALUE"
        },
        "aspectSource": {
          "createTime": "ASPECT_CREATE_TIMESTAMP"
          "updateTime": "ASPECT_UPDATE_TIMESTAMP"
        }
      },
      # Additional aspect maps
    },
    "parentEntry": "PARENT_ENTRY",
    "fullyQualifiedName": "FULLY_QUALIFIED_NAME"
  },
  "updateMask": "UPDATE_MASK_FIELDS",
  "aspectKeys": [
    "ASPECT_KEY",
    # Additional aspect keys
  ],
}

Replace the following:

  • entry: information about an entry and its attached aspects:

    • ENTRY_NAME: the relative resource name of the entry, in the format projects/PROJECT_ID_OR_NUMBER/locations/LOCATION_ID/entryGroups/ENTRY_GROUP_ID/entries/ENTRY_ID.
    • ENTRY_TYPE: the relative resource name of the entry type that was used to create this entry, in the format projects/PROJECT_ID_OR_NUMBER/locations/LOCATION_ID/entryTypes/ENTRY_TYPE_ID.
    • entrySource: information from the source system about the data resource that is represented by the entry:
      • RESOURCE: the name of the resource in the source system.
      • SYSTEM: the name of the source system.
      • PLATFORM: the platform containing the source system.
      • DISPLAY_NAME: a user-friendly display name.
      • DESCRIPTION: a description of the entry.
      • ENTRY_CREATE_TIMESTAMP: the time the entry was created in the source system.
      • ENTRY_UPDATE_TIMESTAMP: the time the entry was updated in the source system.
    • aspects: the aspects that are attached to the entry. The aspect object and its data are called an aspect map.

      • ASPECT: an aspect that is attached to the entry. Depending on how the aspect is attached to the entry, use one of the following formats:

        • If the aspect is attached directly to the entry, provide the relative resource name of its aspect type, in the format PROJECT_ID_OR_NUMBER.LOCATION_ID.ASPECT_TYPE_ID.
        • If the aspect is attached to the entry's path, provide the aspect type's path, in the format PROJECT_ID_OR_NUMBER.LOCATION_ID.ASPECT_TYPE_ID@PATH.
      • KEY and VALUE: the content of the aspect, according to its aspect type metadata template. The content must be encoded as UTF-8. The maximum size of the field is 120 KB. The data dictionary is required, even if it is empty.

      • ASPECT_CREATE_TIMESTAMP: the time the aspect was created in the source system.

      • ASPECT_UPDATE_TIMESTAMP: the time the aspect was updated in the source system.

    • PARENT_ENTRY: the resource name of the parent entry.

    • FULLY_QUALIFIED_NAME: a name for the entry that can be referenced by an external system. See Fully qualified names.

  • UPDATE_MASK_FIELDS: the fields to update, in paths that are relative to the Entry resource. Separate each field with a comma.

    In FULL entry sync mode, Dataplex includes the paths of all of the fields for an entry that can be modified, including aspects.

    The updateMask field is ignored when an entry is created or re-created.

  • ASPECT_KEY: the aspects to modify. Supports the following syntaxes:

    • ASPECT_TYPE_REFERENCE: matches the aspect type for aspects that are attached directly to the entry.
    • ASPECT_TYPE_REFERENCE@PATH: matches the aspect type and the specified path.
    • ASPECT_TYPE_REFERENCE@*: matches the aspect type for all paths.

    Replace ASPECT_TYPE_REFERENCE with a reference to the aspect type, in the format PROJECT_ID_OR_NUMBER.LOCATION_ID.ASPECT_TYPE_ID.

    If you leave this field empty, it is treated as specifying exactly those aspects that are present within the specified entry.

    In FULL entry sync mode, Dataplex implicitly adds the keys for all of the required aspects of an entry.

File requirements

The metadata import file has the following requirements:

  • The file must be formatted as a JSON Lines file, which is a newline-delimited JSON file. Use a newline character (0x0a) to separate each import item.
  • The file must use UTF-8 character encoding.
  • Supported file extensions are .jsonl and .json.
  • The file size of each metadata import file must be less than 1 GiB. The maximum total size for all data in the metadata job is 3 GB. This includes all files and metadata associated with the job.
  • The entries and aspects that you specify in the file must be part of the metadata job's scope.
  • The file must be uploaded to a Cloud Storage bucket. Don't save the file in a folder named CLOUD_STORAGE_URI/deletions/.

Comparison logic

Dataplex determines which entries and aspects to modify by comparing the values and timestamps that you provide in the metadata import file with the values and timestamps that exist in your project.

At a high level, Dataplex updates the values in your project when at least one proposed change in the metadata import file will change the state of your project when the job runs, without introducing out-of-date data. The proposed change must be referenced in the update mask field or the aspect keys field in the metadata import file.

For each entry that is part of the job's scope, Dataplex does one of the following things:

  • Creates an entry and attached aspects. If the metadata import file includes an entry that doesn't exist in your project, Dataplex creates the entry and attached aspects.
  • Deletes an entry and attached aspects. If an entry exists in your project, but the metadata import file doesn't include the entry, Dataplex deletes the entry and its attached aspects from your project.
  • Updates an entry and attached aspects. If an entry exists in both the metadata import file and in your project, Dataplex evaluates the entry source timestamps and the aspect source timestamps that are associated with the entry to determine which values to modify. Then, Dataplex does one or more of the following things:

    • Re-creates the entry. If the entry source create timestamp in the metadata import file is more recent than the corresponding timestamp in your project, Dataplex re-creates the entry in your project.
    • Updates the entry. If the entry source update timestamp in the metadata import file is more recent than the corresponding timestamp in your project, Dataplex updates the entry in your project.
    • Creates an aspect. If an aspect doesn't exist in your project, and is included in the entry object, the update mask field, and the aspect keys field in the metadata import file, Dataplex creates the aspect.
    • Deletes an aspect. If an aspect exists in your project, and is included in the update mask field and the aspect keys field in the metadata import file, but isn't included in the entry object, Dataplex deletes the aspect.
    • Updates an aspect. If an aspect exists in your project and is included in the entry object, the update mask field, and the aspect keys field in the metadata import file, and the aspect source update timestamp in the metadata import file is more recent than the corresponding timestamp in your project, Dataplex updates the aspect.

      If an aspect source update timestamp isn't provided in the metadata import file, but the corresponding entry is marked for an update, Dataplex also updates the aspect.

      However, if at least one aspect in the metadata import file has an older timestamp than the corresponding timestamp in your project, then Dataplex doesn't make any updates for the attached entry.

Create a metadata import file

Before you import metadata, create a metadata import file for your job. Follow these steps:

  1. Prepare a metadata import file by following the guidelines that are described previously in this document.
  2. Upload the file to a Cloud Storage bucket.

You can provide multiple metadata import files in the same metadata job. To provide multiple files, save the files in the same Cloud Storage bucket. When run the job, you specify a bucket, not a specific file. Dataplex imports metadata from all of the files that are saved in the bucket, including files that are in subfolders.

Run a metadata import job

After you create a metadata import file, run a metadata import job by using the API.

REST

To import metadata, use the metadataJobs.create method.

Before using any of the request data, make the following replacements:

  • PROJECT_NUMBER: your Google Cloud project number or project ID.
  • LOCATION_ID: the Google Cloud location, such as us-central1.
  • METADATA_JOB_ID: optional. The metadata job ID.
  • CLOUD_STORAGE_URI: the URI of the Cloud Storage bucket or folder that contains the metadata import files. For more information about the file requirements, see Metadata import file.

  • ENTRY_GROUP: the relative resource name of the entry group that is in scope for the job, in the format projects/PROJECT_ID_OR_NUMBER/locations/LOCATION_ID/entryGroups/ENTRY_GROUP_ID. Provide only one entry group. For more information, see Job scope.
  • ENTRY_TYPE: the relative resource name of an entry type that is in scope for the job, in the format projects/PROJECT_ID_OR_NUMBER/locations/LOCATION_ID/entryTypes/ENTRY_TYPE_ID. For more information, see Job scope.

  • ASPECT_TYPE: optional. The relative resource name of an aspect type that is in scope for the job, in the format projects/PROJECT_ID_OR_NUMBER/locations/LOCATION_ID/aspectTypes/ASPECT_TYPE_ID. For more information, see Job scope.
  • LOG_LEVEL: the level of logs to capture, such as INFO or DEBUG. For more information, see View job logs and troubleshoot.

HTTP method and URL:

POST https://dataplex.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION_ID/metadataJobs?metadataJobId=METADATA_JOB_ID

Request JSON body:

{
  "type": IMPORT,
  "import_spec": {
    "source_storage_uri": gs://CLOUD_STORAGE_URI/,
    "scope": {
      "entryGroups": [
        ENTRY_GROUP
      ],
      "entry_types": [
        ENTRY_TYPE
      ],
      "aspect_types": [
        ASPECT_TYPE
      ]
    },
    "entry_sync_mode": FULL,
    "aspect_sync_mode": INCREMENTAL,
    "log_level": LOG_LEVEL
  }
}

To send your request, expand one of these options:

The response identifies a long-running operation.

Get details about a metadata job

To get information about a metadata job, such as the status of the job and the number of entries that were modified, use the following steps. For more information about how to troubleshoot a failed job, see the View job logs and troubleshoot section of this document.

REST

To get information about a metadata job, use the metadataJobs.get method.

Get a list of metadata jobs

You can get a list of the most recent metadata jobs. Older jobs that have reached a terminal state are periodically deleted from the system.

REST

To get a list of the most recent metadata jobs, use the metadataJobs.list method.

Cancel a metadata job

You can cancel a metadata job that you don't want to run.

REST

To cancel a metadata job, use the metadataJobs.cancel method.

View job logs and troubleshoot

Use Cloud Logging to view logs for a metadata job. For more information, see Monitor Dataplex logs.

You configure the log level when you create a metadata job. The following log levels are available:

  • INFO: provides logs at the overall job level. Includes aggregate logs about import items, but doesn't specify which import item has an error.
  • DEBUG: provides detailed logs for each import item. Use debug-level logging to troubleshoot issues with specific import items. For example, use debug-level logging to identify resources that are missing from the job scope, entries or aspects that don't conform to the associated entry type or aspect type, or other misconfigurations with the metadata import file.

Validation errors

Dataplex validates the metadata import files against the current metadata in your project. If there is a validation issue, the job status might return one of the following states:

  • FAILED: happens when the metadata import file has an error. Dataplex doesn't import any metadata and the job fails. Examples of errors in the metadata import file include the following:
    • An item in the file can't be parsed into a valid import item
    • An entry or aspect in the file belongs to an entry group, entry type, or aspect type that isn't part of the job's scope
    • The same entry name is specified more than once in the job
    • An aspect type that is specified in the aspect map or the aspect keys doesn't use the format PROJECT_ID_OR_NUMBER.LOCATION_ID.ASPECT_TYPE_ID@OPTIONAL_PATH
  • SUCCEEDED_WITH_ERRORS: happens when the metadata import file can be successfully parsed, but importing an item in the file would cause an entry in your project to be in an inconsistent state. Dataplex ignores such entries, but imports the rest of the metadata from the file.

Use job logs to troubleshoot the error.

What's next