External DAGs migration from v4.2 to v5.0

This guide outlines the steps necessary to relocate output tables from external Directed Acyclic Graphs (DAGs) to their new locations within the Cortex Data Foundation v5.0 architecture. For example, Weather and Trends. This guide is specifically designed for users who have implemented External DAGs in previous Cortex Data Foundation versions (4.2 to 5.0) and are now upgrading. If you haven't used External DAGs or haven't deployed SAP, this guide is not applicable.

Context

Cortex Data Foundation versions prior to 4.2 used a _GEN_EXT flag to manage the deployment of external data sources, with some sources tied to specific workloads (like currency conversion for SAP). However, with version 5.0, this flag has been removed. Now, there's a new module dedicated to managing DAGs that can serve multiple workloads. This guide outlines steps to adjust your existing data pipelines to work with this new structure.

Cross-workload reusable DAGs

Cortex Data Foundation v5.0 introduces K9, a new component responsible for ingesting, processing, and modeling reusable data elements that are shared across various data sources. Reporting views are now reference the K9_PROCESSING dataset to access these reusable components, streamlining data access and reducing redundancy. The following external data sources are now deployed as a part of K9, into the K9_PROCESSING dataset:

  • date_dimension
  • holiday_calendar
  • trends
  • weather

SAP-dependent DAGs

The following SAP-dependent DAGs are still triggered by generate_external_dags.sh script, but now executes during the reporting build step, and now write into the SAP reporting dataset instead of the CDC (Change Data Capture) stage.

  • currency_conversion
  • inventory_snapshots
  • prod_hierarchy_texts
  • hier_reader

Migration Guide

This guide outlines the steps to upgrade your Cortex Data Foundation to version 5.0.

Deploy Cortex Data Foundations 5.0

First, deploy the newest version (v5.0) of Cortex Data Foundations to your projects, with the following guidelines:

  1. Use your existing RAW and CDC datasets from prior development or staging deployments as your RAW and CDC datasets of this deployment, as no modification is made to them during deployment.
  2. Set both testData and SAP.deployCDC to False in config/config.json,.
  3. Create a new SAP Reporting project separate from your existing v4.2 environment for testing purposes. This safely evaluate the upgrade process without impacting your current operations.
  4. Optional. If you have active Airflow DAGs running for your previous Cortex Data Foundation version, pause them before proceeding with the migration. This can be done through the Airflow UI. For detailed instructions see Open Airflow UI from Composer and Pause the DAG documentation.

By following these steps, you can safely transition to Cortex Data Foundation version 5.0 and validate the new features and functionalities.

Migrate existing tables

To migrate your existing tables to their new location, use jinja-cli to format the provided migration script template to complete the migration.

  1. Install jinja-cli with the following command:

    pip install jinja-cli
    
  2. Identify the following parameters from your existing version 4.2 and new version 5.0 deployment:

    Name Description
    project_id_src Source Google Cloud Project: Project where your existing SAP CDC dataset from version 4.2 deployment is located. K9_PROCESSING dataset is also created in this project.
    project_id_tgt Target Google Cloud where your newly deployed SAP Reporting dataset from the new version 5.0 deployment is located. This might be different from the source project.
    dataset_cdc_processed CDC BigQuery Dataset: BigQuery dataset where the CDC processed data lands the latest available records. This might be the same as the source dataset.
    dataset_reporting_tgt Target BigQuery reporting dataset: BigQuery dataset where the Data Foundation for SAP predefined data models are deployed.
    k9_datasets_processing K9 BigQuery dataset: BigQuery dataset where the K9 (augmented data sources) is deployed.
  3. Create a JSON file with the required input data. Make sure to remove any DAGs you don't want to migrate from the migrate_list section:

    {
      "project_id_src": "your-source-project",
      "project_id_tgt": "your-target-project",
      "dataset_cdc_processed": "your-cdc-processed-dataset",
      "dataset_reporting_tgt": "your-reporting-target-dataset-OR-SAP_REPORTING",
      "k9_datasets_processing": "your-k9-processing-dataset-OR-K9_REPORTING",
      "migrate_list":
        [
            "holiday_calendar",
            "trends",
            "weather",
            "currency_conversion",
            "inventory_snapshots",
            "prod_hierarchy_texts",
            "hier_reader"
        ]
    }
    EOF
    

    For example, if you want to remove weather and trends, the script would look like the following:

    {
      "project_id_src": "kittycorn-demo",
      "project_id_tgt": "kittycorn-demo",
      "dataset_cdc_processed": "CDC_PROCESSED",
      "dataset_reporting_tgt": "SAP_REPORTING",
      "k9_datasets_processing": "K9_PROCESSING",
      "migrate_list":
        [
            "holiday_calendar",
            "currency_conversion",
            "inventory_snapshots",
            "prod_hierarchy_texts",
            "hier_reader"
        ]
        }
    
  4. Create an output folder with the following command:

      mkdir output
    
  5. Generate the parsed migration script with the following command (this command assumes you are at the root of the repository):

      jinja -d data.json -o output/migrate_external_dags.sql docs/external_dag_migration/scripts/migrate_external_dags.sql
    
  6. Examine the output SQL file and execute in BigQuery to migrate your tables to the new location.

Update and unpause the Airflow DAGs

Back up the current DAG Files in your Airflow bucket. Then, replace them with the newly generated files from your Cortex Data Foundations version 5.0 deployment. For detail instructions, see the following documentation:

Validation and cleanup

The migration is now complete. You can now validate that all reporting views in the new v5.0 reporting deployment is working correctly. If everything works properly, go through the process again, this time targeting the v5.0 deployment to your production Reporting set. Afterwards, feel free to remove all tables using the following script:

    jinja -d data.json -o output/delete_old_dag_tables.sql docs/external_dag_migration/scripts/delete_old_dag_tables.sql