Gathering Cloud Composer settings

This page describes how the gather Cloud Composer settings to automate data within the Google Cloud Cortex Framework. If Cloud Composer is available, you need to create connections within Cloud Composer that point to the source project where your data resides. These connections act as bridges for Cloud Composer to access and interact with the data in the source project. For more information, see Creating new Airflow connections.

Create connections with the following names for DAG execution, based on the workload to deployment. Consider that SFDC Raw Ingestion module uses the same Airflow connection as SFDC CDC module. For details about workloads, see Data sources and workloads. If you are creating tables in the Reporting layer, make sure to create separate connections for Reporting DAGs.

Deploying workload Create for Raw Create for CDC Create for Reporting
SAP N/A sap_cdc_bq sap_reporting_bq
SFDC sfdc_cdc_bq sfdc_cdc_bq sfdc_reporting_bq
Google Ads googleads_raw_dataflow googleads_cdc_bq googleads_reporting_bq
CM360 cm360_raw_dataflow cm360_cdc_bq cm360_reporting_bq
TikTok tiktok_raw_dataflow tiktok_cdc_bq tiktok_reporting_bq
LiveRamp N/A liveramp_cdc_bq N/A

Connection Naming Conventions

Consider the following specifications for connection naming conventions:

  • Connection suffixes: The connection names include suffixes that indicate their intended purpose:
    • _bq: used for accessing BigQuery data.
    • _dataflow: Used for running Dataflow jobs.
  • Raw data connections: You only need to create connections for Raw data if you are using the data ingestion modules provided by Cortex.
  • Multiple data sources: If you are deploying multiple data sources (for example, both SAP and Salesforce), it's recommended to create separate connections for each, assuming security limitations are applied to individual service accounts. Alternatively, you can modify the connection name in the template before deployment to use the same connection for writing to BigQuery.

Security Best Practices

  • Avoid Default Connections: It's not recommended using the default connections and service accounts offered by Airflow, especially in production environments. This aligns with the principle of least privilege which emphasizes granting only the minimum access permissions necessary.
  • Secret Manager Integration: If you have Secret Manager enabled for Airflow, you can create these connections within Secret Manager using the same names. Connections stored in Secret Manager take precedence over those defined directly in Airflow.

The Cloud Storage bucket structure for some of the template DAG expects the folders to be in /data/bq_data_replication, as the following example. You can modify this path prior to deployment. If you don't have an environment of Cloud Composer available yet, you can create one afterwards and move the files into the DAG bucket.

with airflow.DAG("CDC_BigQuery_${base table}",
                template_searchpath=['/home/airflow/gcs/data/bq_data_replication/'], ##example
                default_args=default_dag_args,
                schedule_interval="${load_frequency}") as dag:
    start_task = DummyOperator(task_id="start")
    copy_records = BigQueryOperator(
      task_id='merge_query_records',
        sql="${query_file}",
        create_disposition='CREATE_IF_NEEDED',
        bigquery_conn_id="sap_cdc_bq", ## example
        use_legacy_sql=False)
    stop_task = DummyOperator (task_id="stop")
    start_task >> copy_records >> stop_task

The scripts that process data in Airflow or Cloud Composer are purposefully generated separately from the Airflow-specific scripts. This lets you port those scripts to another tool of choice.