Use Tag Engine to create bulk tags in Data Catalog

Last reviewed 2022-11-17 UTC

This guide shows you how to use Tag Engine to create bulk tags in Data Catalog. Bulk tags are a collection of similar tags which are created and updated together as a unit. Creating tags in bulk can reduce the time and effort required to tag your organization's data assets. This document introduces Tag Engine and shows you how to use it to tag a collection of assets when those assets have common metadata attributes.

Tag Engine is an open source tool that helps you create metadata for BigQuery and Cloud Storage assets. The tool supports populating tags that have static metadata or dynamic metadata. Tag Engine also supports the autotagging of new assets and the refreshing of existing tags as the underlying data changes.

This document is intended for data stewards who are responsible for creating metadata that accurately describes their organization's data assets.

This document walks through the initial setup of Tag Engine, which includes running some Terraform scripts. It uses two examples to illustrate bulk tagging for BigQuery assets. Both examples start by creating a small dataset in BigQuery.

Tag Engine architecture

Tag Engine is deployed under the following configurations:

  • Deployed into the same project as Data Catalog and BigQuery, which is suitable for proof-of-concept and development environments.
  • Shared with a Data Catalog project, but in a different BigQuery project, which is suitable for higher-level environments, like production.
  • Deployed into its own project, which is also suitable for higher-level environments.

Tag Engine architecture components.

The preceding diagram illustrates the shared deployment pattern, which is the pattern this document uses.

Objectives

  • Create bulk metadata for your BigQuery assets with Tag Engine, as a data steward.
  • Create bulk metadata for your BigQuery assets with the Tag Engine API, as a data engineer.
  • Create static tag configurations using the Tag Engine UI or its API.
  • Create dynamic tag configurations using the Tag Engine UI or its API.
  • Keep the change history of your tags in BigQuery.
  • Publish real-time tag updates to Pub/Sub and alert on critical changes.
  • Review your data estate with the Tag Engine coverage report and prioritize tagging tasks.

Costs

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Identity and Access Management. App Engine, Firestore APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Identity and Access Management. App Engine, Firestore APIs.

    Enable the APIs

  8. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

This document introduces two Google Cloud projects, one for running Tag Engine and another for storing data in BigQuery. For the remainder of this document, the first Google Cloud project is called the Tag Engine project. The second Google Cloud project is called the BigQuery project.

If you don't plan to keep the resources that you create in this document, create two projects instead of selecting existing ones.

Deploy and configure Tag Engine

In this section, you deploy Tag Engine on Google Cloud. First, you create the Tag Engine database on Firestore and use Google Cloud CLI commands to deploy the Tag Engine application on App Engine. Then, you use Terraform to configure Tag Engine.

Download and install Terraform before deploying Tag Engine on Google Cloud. The Terraform scripts grant IAM permissions, create database indexes, and configure the cloud task and scheduler entries.

  1. In Cloud Shell, set the required environment variables:

    export TAG_ENGINE_PROJECT=PROJECT_ID1
    export TAG_ENGINE_REGION=us-central
    export TAG_ENGINE_SUB_REGION=us-central1
    export BQ_PROJECT=PROJECT_ID2
    export BQ_REGION=us-central1
    export TAG_ENGINE_SA=${TAG_ENGINE_PROJECT}@appspot.gserviceaccount.com
    gcloud config set project ${TAG_ENGINE_PROJECT}
    

    Replace the following variables:

    • PROJECT_ID1 = the ID of your Tag Engine project
    • PROJECT_ID2 = the ID of your BigQuery project
  2. Clone the GitHub Tag Engine code repository:

    git clone https://github.com/GoogleCloudPlatform/datacatalog-tag-engine.git
    
  3. Set the Terraform variables by creating a variables.tfvars file:

    cd datacatalog-tag-engine
    cat > deploy/variables.tfvars << EOL
    tag_engine_project="${TAG_ENGINE_PROJECT}"
    bigquery_project="${BQ_PROJECT}"
    app_engine_region="${TAG_ENGINE_REGION}"
    app_engine_subregion="${TAG_ENGINE_SUB_REGION}"
    EOL
    
  4. Create the Tag Engine configuration file (tagengine.ini):

    cat > tagengine.ini << EOL\
    [DEFAULT]\
    TAG_ENGINE_PROJECT = ${TAG_ENGINE_PROJECT}\
    QUEUE_REGION = ${TAG_ENGINE_SUB_REGION}\
    INJECTOR_QUEUE = tag-engine-injector-queue\
    WORK_QUEUE = tag-engine-work-queue\
    BIGQUERY_REGION = ${BQ_REGION}\
    EOL
    
  5. Create an App Engine application for Tag Engine:

    gcloud app create \
      --project=${TAG_ENGINE_PROJECT} \
      --region=${TAG_ENGINE_REGION}
    
  6. Create the database:

    gcloud firestore databases create \
      --project=${TAG_ENGINE_PROJECT} \
      --region=${TAG_ENGINE_REGION}
    
  7. Deploy Tag Engine with the default App Engine service account:

    gcloud app deploy app.yaml
    

    Make sure that your service account is assigned the Editor role on the Tag Engine project.

  8. (Optional) Deploy Tag Engine with a user-managed service account:

    export TAG_ENGINE_SA=SERVICE_ACCOUNT
    gcloud beta app deploy --service-account=$TAG_ENGINE_SA app.yaml
    

    Replace the following variable:

    • SERVICE_ACCOUNT = the email of your user-managed service account

    Assign your user-managed service account the Editor role on the Tag Engine project.

  9. Secure App Engine using firewall rules:

    gcloud app firewall-rules create 100 --action ALLOW --source-range
    [IP_RANGE]
    gcloud app firewall-rules update default --action deny
    

    By default, App Engine allows traffic from the public internet. Use App Engine firewall rules to restrict network traffic to a range of IP addresses.

    Alternatively, you can manage access with IAM using Identity-Aware Proxy (IAP).

    To set up IAP to access Tag Engine, configure OAuth and IAP access.

  10. Run the following Terraform commands. Enter yes when prompted.

    cd deploy
    terraform init
    terraform apply -var-file variables.tfvars
    
  11. Launch Tag Engine in your browser:

    gcloud app browse
    

The Tag Engine browser interface is primarily how data stewards create bulk metadata tags. To create all metadata tags, Tag Engine uses tag configurations.

Create a sample dataset

In this scenario, the sample dataset used in both examples contains three tables. The tables hold publicly available service request records for the cities of Austin, New York, and San Francisco from the non-emergency municipal services 311 telephone number for each city. The tables are located in the us region and we would like to query them from the us-central1 region. To do so, we export them to Cloud Storage and then import them into the correct region in BigQuery.

  1. In Cloud Shell, create a bucket in Cloud Storage in the us-central1 region:

    export BUCKET=cities-311
    gsutil mb -l ${BQ_REGION} gs://${BUCKET}
    
  2. Export the three 311_service_requests tables into the Cloud Storage bucket:

    bq extract --destination_format=AVRO \
      bigquery-public-data:austin_311.311_service_requests \
      gs://${BUCKET}/austin/311_service_requests*.avro
    
    bq extract --destination_format=AVRO \
      bigquery-public-data:new_york_311.311_service_requests \
      gs://${BUCKET}/new_york/311_service_requests*.avro
    
    bq extract --destination_format=AVRO \
      bigquery-public-data:san_francisco_311.311_service_requests \
      gs://${BUCKET}/san_francisco/311_service_requests*.avro
    
  3. In Cloud Shell, create the sample dataset:

    bq --location=${BQ_REGION} mk --dataset ${BQ_PROJECT}:cities_311
    
  4. Load the files into BigQuery:

     bq load --source_format=AVRO ${BQ_PROJECT}:cities_311.austin_311_service_requests \
       gs://${BUCKET}/austin/*.avro
    
     bq load --source_format=AVRO ${BQ_PROJECT}:cities_311.new_york_311_service_requests \
       gs://${BUCKET}/new_york/*.avro
    
     bq load --source_format=AVRO ${BQ_PROJECT}:cities_311.san_francisco_311_service_requests \
       gs://${BUCKET}/san_francisco/*.avro
     ```
    
  5. Delete the cities 311 files and bucket from Cloud Storage:

    gsutil -m rm -r gs://${BUCKET}
    

Create Data Catalog tag templates

Before you can create the tag configurations in Tag Engine, you must first create one or more Data Catalog tag templates. A tag template is a Data Catalog concept that defines the schema of one or more tags of the same type. The schema specifies a collection of field names, their data types, and their sequence in the template. Every Data Catalog tag is instantiated from a tag template.

To annotate each 311 service request table using key performance metrics, create a tag template:

  1. In Cloud Shell, clone the GitHub datacatalog-templates.git repository and change into the datacatalog-templates folder:

    git clone https://github.com/GoogleCloudPlatform/datacatalog-templates.git
    cd datacatalog-templates
    
  2. Create and activate a virtual environment, and install the Python dependencies:

    python -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
    
  3. Create the cities_311 tag template:

    python create_template.py $TAG_ENGINE_PROJECT $TAG_ENGINE_SUB_REGION \
      cities_311.yaml
    

    The create_template.py script creates a tag template based on the contents of the cities_311.yaml file. The cities_311 tag template contains 11 fields, 2 of which are required. You must provide both required fields when you create tags from this tag template.

    Field name Field type Required field
    avg_daily_total_requests double false
    avg_daily_open_requests double false
    avg_daily_closed_requests double false
    avg_daily_unknown_requests double false
    sum_total_requests double true
    unique_total_requests double false
    closed_total_requests double false
    open_total_requests double false
    unknown_total_requests double false
    unique_total_complaints double false
    tag_snapshot_time datetime true
  4. Open Data Catalog in Google Cloud console to view the Cities 311 Service Requests tag template. It should look similar to the following screenshot:

    cities_311 tag template.

  5. Create the data_governance tag template:

    python create_template.py ${TAG_ENGINE_PROJECT} ${TAG_ENGINE_SUB_REGION} \
        data_governance.yaml
    

    The data_governance template also contains 11 fields, 2 of which are required. You must provide both of the required fields when you create tags from this template.

    Field name Type Required
    data_domain enum true
    broad_data_category enum true
    environment enum false
    data_origin enum false
    data_creation datetime false
    data_ownership string false
    data_asset_owner string false
    data_asset_expert string false
    data_confidentiality enum false
    data_retention enum false
    data_asset_documentation string false
  6. In Google Cloud console, open Data Catalog to view the data_governance tag template. It should look similar to the following screenshot:

    Data governance tag template.

Tag Engine preferences

Tag Engine gives you the option to save a default tag template for convenience. There are also options to save a copy of each tag to BigQuery and to Pub/Sub. Each of these topics is detailed in this section. Although you can start tagging without setting these options, you can't copy those tags to BigQuery or publish them to Pub/Sub. If you want to skip these options, you can go directly to the Tagging overview section.

Save a default tag template

To avoid entering the tag template ID, project ID, and region each time you create or edit your tags, Tag Engine lets you save a default tag template.

Follow these steps to set your default tag template:

  1. Click the Set Default Tag Template link from the Tag Engine home page.
  2. Enter the following details:

    • Template ID: cities_311
    • Template Project: PROJECT_ID
    • Template Region: us-central1

    Replace PROJECT_ID with the ID of your Tag Engine project.

  3. Click Save Settings.

    The parameters on the home page are now pre-populated with your default tag template details.

Select a tag history option

Through the tag history option, you can save a copy of every tag that Tag Engine creates to a BigQuery table. This action lets you keep a change history of all of your tags. Every tag is written as a new table record in BigQuery with all tag values from one tag template stored in the same table. The table schema includes the creation time of the tag, the asset name, and every field from the tag template. The tag history table for the cities_311 template is created with the following fields:

Field name Field type Description
event_time datetime Timestamp of the tagging operation
asset_name string Name of the asset in BigQuery—for example, warehouse-337221/dataset/cities_311/table/new_york_311_service_requests
avg_daily_total_requests numeric Value for the field avg_daily_total_requests
avg_daily_open_requests numeric Value for the field avg_daily_open_requests
avg_daily_closed_requests numeric Value for the field avg_daily_closed_requests
avg_daily_unknown_requests numeric Value for the field avg_daily_unknown_requests
sum_total_requests numeric Value for the field sum_total_requests
unique_total_requests numeric Value for the field unique_total_requests
closed_total_requests numeric Value for the field closed_total_requests
open_total_requests numeric Value for the field open_total_requests
unknown_total_requests numeric Value for the field unknown_total_requests
unique_total_complaints numeric Value for the field unique_total_complaints
tag_snapshot_time datetime Value for the field tag_snapshot_time

This tag history table is named tag_history.cities_311. The dataset in which you store your tag history tables is identified by tag_history.

  1. To enable the tag history feature, create a BigQuery dataset for storing your tag history tables:

    bq mk --project_id ${BQ_PROJECT} tag_history
    
  2. Grant the Tag Engine service account edit permissions on the dataset:

     bq query --nouse_legacy_sql \
    "GRANT \`roles/bigquery.dataEditor\` ON SCHEMA \`${BQ_PROJECT}.tag_history\`
    TO 'serviceAccount:${TAG_ENGINE_PROJECT}@appspot.gserviceaccount.com'"
    
  3. In Tag Engine, click Turn on/off Tag History on the home page. Enter the following details:

    • Enable Tag History: