Step 5: Configure deployment

This page describes the fifth step to deploy Cortex Data Foundation, the core of Cortex Framework. In this step, you modify the configuration file in the Cortex Data Foundation repository to match your requirements.

Configuration file

The behavior of the deployment is controlled by the configuration file config.json in the Cortex Data Foundation. This file contains global configuration, specific configuration to each workload, and an optional Credly badger service. Edit the config.json file according to your needs with the following steps:

  1. Open the file config.json from Cloud Shell.
  2. Edit the config.json file according to the following parameters:

    Parameter Meaning Default Value Description
    testData Deploy Test Data true Project where the source dataset is and the build runs.
    deploySAP Deploy SAP true Execute the deployment for SAP workload (ECC or S/4HANA).
    deploySFDC Deploy Salesforce true Execute the deployment for Salesforce workload.
    deployMarketing Deploy Marketing true Execute the deployment for Marketing sources (Google Ads, CM360, and TikTok).
    deployDataMesh Deploy Data Mesh true Execute the deployment for Data Mesh. For more information, see the Data Mesh User Guide.
    turboMode Deploy in Turbo mode. true Execute all views builds as a step in the same Cloud Build process, in parallel for a faster deployment. If set to false, each reporting view is generated in its own sequential build step. We recommend only setting it to true when using test data or after any mismatch between reporting columns and the source data have been resolved.
    projectIdSource Source Project ID - Project where the source dataset is and the build runs.
    projectIdTarget Target Project ID - Target project for user-facing datasets (reporting and ML datasets).
    targetBucket Target Bucket to storage generated DAG scripts - Bucket created previously where DAGs (and Dataflow temp files) are generated. Avoid using the actual Airflow bucket.
    location Location or Region "US" Location where the BigQuery dataset and Cloud Storage buckets are.

    See restrictions listed under BigQuery dataset locations.

    languages Filtering languages ["E", "S"] If not using test data, enter a single language (for example, ["E"]) or multiple languages (for example, ["E", "S"]) as relevant to your business. These values are used to replace placeholders in SQL in analytics models where available (SAP only for now - see the ERD).
    currencies Filtering currencies [ "USD" ] If not using test data, enter a single currency (for example, ["USD"]) or multiple currencies (for example, ["USD", "CAD"]) as relevant to your business. These values are used to replace placeholders in SQL in analytics models where available (SAP only).
    testDataProject Source for test harness kittycorn-public Source of the test data for demo deployments. Applies when testData is true.

    Don't change this value, unless you have your own test harness.

    k9.datasets.processing K9 datasets - Processing "K9_PROCESSING" Execute cross-workload templates (for example, date dimension) as defined in the K9 configuration file. These templates are normally required by the downstream workloads.
    k9.datasets.reporting K9 datasets - Reporting "K9_REPORTING" Execute cross-workload templates and external data sources (for example: weather) as defined in the K9 configuration file. Commented out by default.
    DataMesh.deployDescriptions Data Mesh - Asset descriptions true Deploy BigQuery asset schema descriptions.
    DataMesh.deployLakes Data Mesh - Lakes & Zones false Deploy Dataplex Lakes and Zones that organize tables by processing layer, requires configuration before enabling.
    DataMesh.deployCatalog Data Mesh - Catalog Tags and Templates false Deploy Data Catalog Tags that allow custom metadata on BigQuery assets or fields, requires configuration before enabling.
    DataMesh.deployACLs Data Mesh - Access Control false Deploy asset, row, or column level access control on BigQuery assets, requires configuration before enabling.
  3. Configure specific workload as need them. You don't need to configure them if the deployment parameter (for example, deploySAP or deployMarketing) for the workload is set to False. For more information, see Step 3: Determine integration mechanism.

For a better customization of your deployment, see the following optional steps:

Performance optimization for reporting views

Reporting artifacts can be created as views or as tables refreshed regularly through DAGs. On one hand, views compute the data on each execution of a query, which keep the results always fresh. On the other hand, the table runs the computations once, and the results can be queried multiple times without incurring higher computing costs and achieving faster runtime. Each customer creates their own configuration according to their needs.

Materialized results are updated into a table. These tables can be further fine-tuned by adding Partitioning and Clustering properties to these tables.

The configuration files for each workload are located in the following paths within the Cortex Data Foundation repository:

Data Source Settings files
Operational - SAP src/SAP/SAP_REPORTING/reporting_settings_ecc.yaml
Operational - Salesforce Sales Cloud src/SFDC/config/reporting_settings.yaml
Marketing - Google Ads src/marketing/src/GoogleAds/config/reporting_settings.yaml
Marketing - CM360 src/marketing/src/CM360/config/reporting_settings.yaml
Marketing - Meta src/marketing/src/Meta/config/reporting_settings.yaml
Marketing - Salesforce Marketing Cloud src/marketing/src/SFMC/config/reporting_settings.yaml
Marketing - TikTok src/marketing/src/TikTok/config/reporting_settings.yaml

Customizing reporting settings file

The reporting_settings files drives how the BigQuery objects (tables or views) are created for reporting datasets. Customize your file with the following parameters descriptions. Consider that this file contains two sections:

  1. bq_independent_objects: All BigQuery objects that can be created independently, without any other dependencies. When Turbo mode is enabled, these BigQuery objects are created in parallel during the deployment time, speeding up the deployment process.
  2. bq_dependent_objects: All BigQuery objects that need to be created in a specific order due to dependencies on other BigQuery objects. Turbo mode does not apply to this section.

The deployer first creates all the BigQuery objects listed in bq_independent_objects, and then all the objects listed in bq_dependent_objects. Define The following properties for each object:

  1. sql_file: Name of the SQL file that creates a given object.
  2. type: Type of BigQuery object. Possible values:
    • view : If you want the object to be a BigQuery view.
    • table: If you want the object to be a BigQuery table.
    • script: This is to create other types of objects (for example, BigQuery functions and stored processes).
  3. If type is set to table, the following optional properties can be defined:
    • load_frequency: Frequency at which a Composer DAG is executed to refresh this table. See Airflow documentation for details on possible values.
    • partition_details: How the table should be partitioned. This value is optional. For more information, see section Table partition.
    • cluster_details: How the table should be clustered. This value is optional. For more information, see section Cluster settings.

Table partition

Certain settings files let you configure materialized tables with custom clustering and partitioning options. This can significantly improve query performance for large datasets. This option applies only for SAP cdc_settings.yaml and all reporting_settings.yaml files.

Table Partitioning can be enabled by specifying the followingpartition_details:

- base_table: vbap
  load_frequency: "@daily"
  partition_details: {
    column: "erdat", partition_type: "time", time_grain: "day" }

Use the following parameters to control partitioning details for a given table:

Property Description Value
column Column by which the CDC table is partitioned. Column name.
partition_type Type of partition. "time" for time based partition. For more information, see Timestamp partitioned tables. "integer_range" for integer based partition. For more information, see Integer range documentation.
time_grain Time part to partition with Required when partition_type = "time". "hour", "day", "month" or "year".
integer_range_bucket Bucket range Required when partition_type = "integer_range" "start" = Start value, "end" = End value, and "interval" = Interval of range.

For more information about options and related limitations, see BigQuery Table Partition.

Cluster settings

Table clustering can be enabled by specifying cluster_details:

  - base_table: vbak
    load_frequency: "@daily"
    cluster_details: {columns: ["vkorg"]}

Use the following parameters to control cluster details for a given table:

Property Description Value
columns Columns by which a table is clustered. List of column names. For example, "mjahr" and "matnr".

For more information about options and related limitations, see Table cluster documentation.

Next steps

After you complete this step, move on to the following deployment step:

  1. Establish workloads.
  2. Clone repository.
  3. Determine integration mechanism.
  4. Set up components.
  5. Configure deployment (this page).
  6. Execute deployment.