Transition to business glossary on Dataplex Universal Catalog

This document provides instructions to migrate from the preview version of business glossary, which supported Data Catalog metadata, to the generally available version of business glossary, which supports Dataplex Universal Catalog metadata. The transition process requires you to export glossaries, categories, terms, and links from Data Catalog, and then import them into Dataplex Universal Catalog.

To transition to business glossary on Dataplex Universal Catalog, follow these steps:

  1. Export glossaries, categories, and terms from Data Catalog.
  2. Import glossaries, categories, and terms to Dataplex Universal Catalog.
  3. Export links between terms from Data Catalog.
  4. Import links between terms to Dataplex Universal Catalog.
  5. Export links between terms and columns from Data Catalog.
  6. Import links between terms and columns to Dataplex Universal Catalog.

Required roles

To export a glossary from Data Catalog you need to have the roles/datacatalog.glossaryOwner role on the projects in which the glossary is present. See the permissions required for this role.

To get the permissions that you need to import business glossary to Dataplex Universal Catalog, ask your administrator to grant you the Dataplex Administrator (roles/dataplex.admin) IAM role on the projects. For more information about granting roles, see Manage access to projects, folders, and organizations.

This predefined role contains the permissions required to import business glossary to Dataplex Universal Catalog. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to import business glossary to Dataplex Universal Catalog:

  • dataplex.glossaries.import on the glossary resource
  • dataplex.entryGroups.import on the Dataplex Universal Catalog entry group provided in the entry_groups field and on the entry groups where the Data Catalog entries are present which are linked to the glossary terms
  • dataplex.entryGroups.useSynonymEntryLink on the Dataplex Universal Catalog entry group provided in the entry_groups field and on the entry groups where the Data Catalog entries are present which are linked to the glossary terms
  • dataplex.entryGroups.useRelatedEntryLink on the Dataplex Universal Catalog entry group provided in theentry_groups field and on the entry groups where the Data Catalog entries are present which are linked to the glossary terms
  • dataplex.entryLinks.reference on all the projects provided in the referenced_entry_scopes field

You might also be able to get these permissions with custom roles or other predefined roles.

Export glossaries, categories, and terms from Data Catalog

You can export one glossary at a time.

  1. Clone the dataplex-labs repository, and then change directories to the business-glossary-import subdirectory:

    git clone https://github.com/GoogleCloudPlatform/dataplex-labs.git
    cd dataplex-labs
    cd dataplex-quickstart-labs/00-resources/scripts/python/business-glossary-import
    
  2. Get your access token:

    export GCLOUD_ACCESS_TOKEN=$(gcloud auth print-access-token)
    
  3. Run the export script:

    python3 bg_import/business_glossary_export_v2.py \
    --user-project="PROJECT_ID" \
    --url="DATA_CATALOG_GLOSSARY_URL" \
    --export-mode=glossary_only

    Replace the following:

    • PROJECT_ID: the ID of the project that contains the glossary.
    • DATA_CATALOG_GLOSSARY_URL: the URL of the Data Catalog business glossary in the console.

    The script creates a JSON file that follows the same format as the metadata import file that's used for metadata import jobs. The names of the glossary, categories, and terms use the following formats:

    • Glossary: projects/PROJECT_NUMBER/locations/LOCATION_ID/entryGroups/@dataplex/entries/projects/PROJECT_NUMBER/locations/LOCATION_ID/glossaries/GLOSSARY_ID
    • Term: projects/PROJECT_NUMBER/locations/LOCATION_ID/entryGroups/@dataplex/entries/projects/PROJECT_NUMBER/locations/LOCATION_ID/glossaries/GLOSSARY_ID/terms/TERM_ID
    • Category: projects/PROJECT_NUMBER/locations/LOCATION_ID/entryGroups/@dataplex/entries/projects/PROJECT_NUMBER/locations/LOCATION_ID/glossaries/GLOSSARY_ID/categories/CATEGORY_ID

    Where the GLOSSARY_ID, CATEGORY_ID, TERM_ID, PROJECT_NUMBER and LOCATION_ID are the same as the values from the Data Catalog glossary.

Import glossaries, categories, and terms

You need to import Dataplex Universal Catalog glossaries, categories, and terms exported in the previous step. This section describes how to import by using the metadata job API.

  1. Create a Cloud Storage bucket and then upload the file to the bucket.

  2. Grant the Dataplex Universal Catalog service account read access to the Cloud Storage bucket.

  3. Run a metadata import job to import the glossary.

    # Set GCURL alias
    alias gcurl='curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json"'
    
    # Import CURL Command
    gcurl https://dataplex.googleapis.com/v1/projects/PROJECT_NUMBER/locations/LOCATION_ID/metadataJobs?metadata_job_id=JOB_ID -d "$(cat<<EOF
    {
    "type": "IMPORT",
    "import_spec": {
        "log_level": "DEBUG",
        "source_storage_uri": "gs://STORAGE_BUCKET/",
        "entry_sync_mode": "FULL",
        "aspect_sync_mode": "INCREMENTAL",
        "scope": {
            "glossaries": ["projects/PROJECT_NUMBER/locations/LOCATION_ID/glossaries/GLOSSARY_ID"]
        }
    }
    }
    EOF
    )"

    Replace the following:

    • JOB_ID: (optional) a metadata import job ID, which you can use to track the job's status. If you don't provide an ID, the gcurl command generates a unique ID.
    • STORAGE_BUCKET: the URI of the Cloud Storage bucket or folder that contains the exported glossary file
  4. Optional: To track the status of the metadata import job, use the metadataJobs.get method:

    gcurl -X GET https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/metadataJobs/JOB_ID

    If you get any errors in the metadata import job, they'll appear in the logs.

Export links between terms from Data Catalog

  1. Clone the dataplex-labs repository (if you haven't already), and then change directories to the business-glossary-import subdirectory:

    git clone https://github.com/GoogleCloudPlatform/dataplex-labs.git
    cd dataplex-labs
    cd dataplex-quickstart-labs/00-resources/scripts/python/business-glossary-import
    
  2. Get your access token:

    export GCLOUD_ACCESS_TOKEN=$(gcloud auth print-access-token)
    
  3. Run the export code:

    python3 bg_import/business_glossary_export_v2.py \
    --user-project=PROJECT_ID \
    --url="DATA_CATALOG_GLOSSARY_URL" \
    --export-mode=entry_links_only \
    --entrylinktype="related,synonym"

    The script creates a JSON file that contains the synonyms and related links between terms. The exported files are in the folder Exported_Files in dataplex-quickstart-labs/00-resources/scripts/python/business-glossary-import. The name of the file is entrylinks_relatedsynonymGLOSSARY_ID.json.

Import links between terms to Dataplex Universal Catalog

You need to import links between terms exported in the previous step. This section describes how to import by using the metadata job API.

  1. Create a new Cloud Storage bucket, and then upload the exported entry links file from the previous step into the bucket.

  2. Grant the Dataplex Universal Catalog service account read access to the Cloud Storage bucket.

  3. Run a metadata import job to import the entry links:

    # Import CURL Command
    gcurl https://dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/global/metadataJobs?metadata_job_id=JOB_ID -d "(cat<<EOF
    {
    "type": "IMPORT",
    "import_spec": {
        "log_level": "DEBUG",
        "source_storage_uri": "gs://STORAGE_BUCKET",
        "entry_sync_mode": "FULL",
        "aspect_sync_mode": "INCREMENTAL",
        "scope": {
            "entry_groups": ["projects/GLOSSARY_PROJECT_ID/locations/global/entryGroups/@dataplex"],
            "entry_link_types": ["projects/dataplex-types/locations/global/entryLinkTypes/synonym", "projects/dataplex-types/locations/global/entryLinkTypes/related"]
            "referenced_entry_scopes": [PROJECT_IDS],
        },
    },
    }
    EOF
    )"

    Replace the following:

    • GLOSSARY_PROJECT_ID: the ID of the project that contains the glossary
    • PROJECT_IDS: if terms are linked across glossaries in different projects, provide the IDs of the projects

    Note the following:

    • The entry_groups object contains the entry group where the entry links are created. This is the @dataplex system entry group in the same project and location as the glossary.
    • The entry_link_types object lets you import synonyms, related terms, or both:

      • Synonyms: projects/dataplex-types/locations/global/entryLinkTypes/synonym
      • Related terms: projects/dataplex-types/locations/global/entryLinkTypes/related
    • The referenced_entry_scopes object includes the project IDs of entry-links that link terms from different glossaries.

Export links between terms and columns

After exporting and importing the glossaries, and links between terms, proceed with importing the links between terms and columns. In the following command, link type is set to definition to export links between terms and columns.

## Clone the repository and navigate to the directory
git clone https://github.com/GoogleCloudPlatform/dataplex-labs.git
cd dataplex-labs
cd dataplex-quickstart-labs/00-resources/scripts/python/business-glossary-import

export GCLOUD_ACCESS_TOKEN=$(gcloud auth print-access-token);

## Run the export code
python3 bg_import/business_glossary_export_v2.py \
 --user-project="PROJECT_ID" \
 --url="DATA_CATALOG_GLOSSARY_URL" \
 --export-mode=entry_links_only \
 --entrylinktype="definition"

Import links between terms and columns

You need to import links between terms and columns exported in the previous step. This section describes how to import by using the metadata job API.

  1. Upload each file exported in the preceding step to a Cloud Storage bucket as described in step 2.

  2. Run a separate import command for each file uploaded in the Cloud Storage bucket. Each file corresponds to a unique entry group containing links between terms and columns of that entry group.

# Set GCURL alias
alias gcurl='curl -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json"'

# Import CURL Command
gcurl https://DATAPLEX_API/metadataJobs?metadata_job_id=JOB_ID -d "$(cat<<EOF
{
   "type":"IMPORT",
   "import_spec":{
      "log_level":"DEBUG",
      "source_storage_uri":"gs://STORAGE_BUCKET",
      "entry_sync_mode":"FULL",
      "aspect_sync_mode":"INCREMENTAL",
      "scope":{
         "entry_groups":[
            "projects/ENTRY_GROUP_PROJECT_ID/locations/ENTRY_GROUP_LOCATION_ID/entryGroups/ENTRY_GROUP_ID"
         ],
         "entry_link_types":[
            "projects/dataplex-types/locations/global/entryLinkTypes/definition",
         ],
         "referenced_entry_scopes":[
            PROJECT_IDS
         ]
      }
   }
}
EOF
)"

Replace DATAPLEX_API with dataplex.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID.

What's next