同時執行多項 BigQuery 工作


BigQuery 會代管多個公開資料集,供一般大眾查詢。在本教學課程中,您將建立工作流程,以便同時執行多個 BigQuery 查詢工作,並比較以串列方式逐一執行工作,與此相比,您可以看到效能有所提升。

目標

教學課程內容:

  1. Wikipedia 公開資料集執行查詢,找出特定月份瀏覽次數最多的標題。
  2. 部署及執行工作流程,以便依序逐一執行多項 BigQuery 查詢工作。
  3. 部署並執行工作流程,以平行疊代的方式執行 BigQuery 工作,並在平行執行一般 for 迴圈。

您可以在 Google Cloud 控制台中執行下列指令,或是在終端機或 Cloud Shell 中使用 Google Cloud CLI。

費用

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

事前準備

貴機構定義的安全性限制,可能會導致您無法完成下列步驟。如需疑難排解資訊,請參閱「在受限的 Google Cloud 環境中開發應用程式」。

控制台

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Workflows API.

    Enable the API

  5. Create a service account:

    1. In the Google Cloud console, go to the Create service account page.

      Go to Create service account
    2. Select your project.
    3. In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.

      In the Service account description field, enter a description. For example, Service account for quickstart.

    4. Click Create and continue.
    5. Grant the following roles to the service account: BigQuery > BigQuery Job User, Logging > Logs Writer.

      To grant a role, find the Select a role list, then select the role.

      To grant additional roles, click Add another role and add each additional role.

    6. Click Continue.
    7. Click Done to finish creating the service account.

  6. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  7. Make sure that billing is enabled for your Google Cloud project.

  8. Enable the Workflows API.

    Enable the API

  9. Create a service account:

    1. In the Google Cloud console, go to the Create service account page.

      Go to Create service account
    2. Select your project.
    3. In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.

      In the Service account description field, enter a description. For example, Service account for quickstart.

    4. Click Create and continue.
    5. Grant the following roles to the service account: BigQuery > BigQuery Job User, Logging > Logs Writer.

      To grant a role, find the Select a role list, then select the role.

      To grant additional roles, click Add another role and add each additional role.

    6. Click Continue.
    7. Click Done to finish creating the service account.

gcloud

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Install the Google Cloud CLI.

  3. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  4. To initialize the gcloud CLI, run the following command:

    gcloud init
  5. Create or select a Google Cloud project.

    • Create a Google Cloud project:

      gcloud projects create PROJECT_ID

      Replace PROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set project PROJECT_ID

      Replace PROJECT_ID with your Google Cloud project name.

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Workflows API:

    gcloud services enable workflows.googleapis.com
  8. Set up authentication:

    1. Create the service account:

      gcloud iam service-accounts create SERVICE_ACCOUNT_NAME

      Replace SERVICE_ACCOUNT_NAME with a name for the service account.

    2. Grant roles to the service account. Run the following command once for each of the following IAM roles: roles/bigquery.jobUser, roles/logging.logWriter:

      gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com" --role=ROLE

      Replace the following:

      • SERVICE_ACCOUNT_NAME: the name of the service account
      • PROJECT_ID: the project ID where you created the service account
      • ROLE: the role to grant
  9. Install the Google Cloud CLI.

  10. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  11. To initialize the gcloud CLI, run the following command:

    gcloud init
  12. Create or select a Google Cloud project.

    • Create a Google Cloud project:

      gcloud projects create PROJECT_ID

      Replace PROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set project PROJECT_ID

      Replace PROJECT_ID with your Google Cloud project name.

  13. Make sure that billing is enabled for your Google Cloud project.

  14. Enable the Workflows API:

    gcloud services enable workflows.googleapis.com
  15. Set up authentication:

    1. Create the service account:

      gcloud iam service-accounts create SERVICE_ACCOUNT_NAME

      Replace SERVICE_ACCOUNT_NAME with a name for the service account.

    2. Grant roles to the service account. Run the following command once for each of the following IAM roles: roles/bigquery.jobUser, roles/logging.logWriter:

      gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com" --role=ROLE

      Replace the following:

      • SERVICE_ACCOUNT_NAME: the name of the service account
      • PROJECT_ID: the project ID where you created the service account
      • ROLE: the role to grant

執行 BigQuery 查詢工作

在 BigQuery 中,您可以執行互動式 (隨選) 查詢工作。如需更多資訊,請參閱「執行互動式和批次查詢工作」。

控制台

  1. 前往 Google Cloud 控制台的「BigQuery」頁面。

    前往 BigQuery

  2. 在「Query editor」(查詢編輯器) 文字區域中輸入下列 BigQuery SQL 查詢:

    SELECT TITLE, SUM(views)
    FROM `bigquery-samples.wikipedia_pageviews.201207h`
    GROUP BY TITLE
    ORDER BY SUM(views) DESC
    LIMIT 100
    
  3. 按一下「執行」

bq

在終端機中輸入下列 bq query 指令,即可使用標準 SQL 語法執行互動式查詢:

bq query \
--use_legacy_sql=false \
'SELECT
  TITLE, SUM(views)
FROM
  `bigquery-samples.wikipedia_pageviews.201207h`
GROUP BY
  TITLE
ORDER BY
  SUM(views) DESC
LIMIT 100'

這會執行查詢,傳回特定月份內觀看次數最多的前 100 個 Wikipedia 標題,並將輸出內容寫入臨時資料表。

請注意查詢執行所需的時間。

部署可依序執行多項查詢的工作流程

工作流程定義由一系列步驟組成,這些步驟會使用 Workflows 語法加以描述。建立工作流程後,您可以部署工作流程,讓系統執行。部署步驟也會驗證來源檔案是否可執行。

以下工作流程定義了五個資料表清單,以便使用工作流程 BigQuery 連接器執行查詢。查詢會依序逐一執行,每個資料表中觀看次數最多的書籍會儲存至結果地圖。

控制台

  1. 前往 Google Cloud 控制台的「Workflows」頁面:

    前往「Workflows」頁面

  2. 按一下 [建立]。

  3. 輸入新工作流程的名稱,例如 workflow-serial-bqjobs

  4. 選擇適當的地區,例如 us-central1

  5. 選取先前建立的服務帳戶

    您應該已將 BigQuery > BigQuery 工作使用者記錄 > 記錄寫入器 IAM 角色授予服務帳戶。

  6. 點按「Next」

  7. 在工作流程編輯器中,輸入工作流程的定義:

    main:
        steps:
        - init:
            assign:
                - results : {} # result from each iteration keyed by table name
                - tables:
                    - 201201h
                    - 201202h
                    - 201203h
                    - 201204h
                    - 201205h
        - runQueries:
            for:
                value: table
                in: ${tables}
                steps:
                - logTable:
                    call: sys.log
                    args:
                        text: ${"Running query for table " + table}
                - runQuery:
                    call: googleapis.bigquery.v2.jobs.query
                    args:
                        projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
                        body:
                            useLegacySql: false
                            useQueryCache: false
                            timeoutMs: 30000
                            # Find top 100 titles with most views on Wikipedia
                            query: ${
                                "SELECT TITLE, SUM(views)
                                FROM `bigquery-samples.wikipedia_pageviews." + table + "`
                                WHERE LENGTH(TITLE) > 10
                                GROUP BY TITLE
                                ORDER BY SUM(VIEWS) DESC
                                LIMIT 100"
                                }
                    result: queryResult
                - returnResult:
                    assign:
                        # Return the top title from each table
                        - results[table]: {}
                        - results[table].title: ${queryResult.rows[0].f[0].v}
                        - results[table].views: ${queryResult.rows[0].f[1].v}
        - returnResults:
            return: ${results}
  8. 按一下 [Deploy] (部署)

gcloud

  1. 開啟終端機,然後為工作流程建立原始碼檔案:

    touch workflow-serial-bqjobs.yaml
  2. 將以下工作流程複製到原始碼檔案中:

    main:
        steps:
        - init:
            assign:
                - results : {} # result from each iteration keyed by table name
                - tables:
                    - 201201h
                    - 201202h
                    - 201203h
                    - 201204h
                    - 201205h
        - runQueries:
            for:
                value: table
                in: ${tables}
                steps:
                - logTable:
                    call: sys.log
                    args:
                        text: ${"Running query for table " + table}
                - runQuery:
                    call: googleapis.bigquery.v2.jobs.query
                    args:
                        projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
                        body:
                            useLegacySql: false
                            useQueryCache: false
                            timeoutMs: 30000
                            # Find top 100 titles with most views on Wikipedia
                            query: ${
                                "SELECT TITLE, SUM(views)
                                FROM `bigquery-samples.wikipedia_pageviews." + table + "`
                                WHERE LENGTH(TITLE) > 10
                                GROUP BY TITLE
                                ORDER BY SUM(VIEWS) DESC
                                LIMIT 100"
                                }
                    result: queryResult
                - returnResult:
                    assign:
                        # Return the top title from each table
                        - results[table]: {}
                        - results[table].title: ${queryResult.rows[0].f[0].v}
                        - results[table].views: ${queryResult.rows[0].f[1].v}
        - returnResults:
            return: ${results}
  3. 輸入下列指令來部署工作流程:

    gcloud workflows deploy workflow-serial-bqjobs \
       --source=workflow-serial-bqjobs.yaml \
       --service-account=MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com

    MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com 替換為您先前建立的服務帳戶電子郵件地址。

    您應該已將 roles/bigquery.jobUserroles/logging.logWriter IAM 角色授予服務帳戶。

執行工作流程並依序執行多個查詢

執行工作流程會執行與工作流程相關聯的目前工作流程定義。

控制台

  1. 前往 Google Cloud 控制台的「Workflows」頁面:

    前往「Workflows」頁面

  2. 在「Workflows」頁面中,選取「workflow-serial-bqjobs」工作流程,前往該工作流程的詳細資料頁面。

  3. 在「Workflow Details」(工作流程詳細資料) 頁面中,按一下 「Execute」(執行)

  4. 再次按一下「執行」

  5. 在「Output」窗格中查看工作流程的結果。

gcloud

  1. 開啟終端機。

  2. 執行工作流程:

     gcloud workflows run workflow-serial-bqjob

工作流程執行作業大約需要一分鐘或前次執行時間的五倍時間。結果會包含每個資料表,如下所示:

{
  "201201h": {
    "title": "Special:Search",
    "views": "14591339"
  },
  "201202h": {
    "title": "Special:Search",
    "views": "132765420"
  },
  "201203h": {
    "title": "Special:Search",
    "views": "123316818"
  },
  "201204h": {
    "title": "Special:Search",
    "views": "116830614"
  },
  "201205h": {
    "title": "Special:Search",
    "views": "131357063"
  }
}

部署並執行可並行執行多個查詢的工作流程

您可以進行一些變更,讓系統並行執行這五個查詢,而非依序執行:

 - runQueries:
    parallel:
        shared: [results]
        for:
            value: table
            in: ${tables}
  • parallel 步驟可讓 for 迴圈的每個疊代都以平行方式執行。
  • results 變數會宣告為 shared,讓分支可以寫入,且每個分支的結果可附加至其中。

控制台

  1. 前往 Google Cloud 控制台的「Workflows」頁面:

    前往「Workflows」頁面

  2. 按一下 [建立]。

  3. 輸入新工作流程的名稱,例如 workflow-parallel-bqjobs

  4. 選擇適當的地區,例如 us-central1

  5. 選取先前建立的服務帳戶

  6. 點按「Next」

  7. 在工作流程編輯器中,輸入工作流程的定義:

    main:
        steps:
        - init:
            assign:
                - results : {} # result from each iteration keyed by table name
                - tables:
                    - 201201h
                    - 201202h
                    - 201203h
                    - 201204h
                    - 201205h
        - runQueries:
            parallel:
                shared: [results]
                for:
                    value: table
                    in: ${tables}
                    steps:
                    - logTable:
                        call: sys.log
                        args:
                            text: ${"Running query for table " + table}
                    - runQuery:
                        call: googleapis.bigquery.v2.jobs.query
                        args:
                            projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
                            body:
                                useLegacySql: false
                                useQueryCache: false
                                timeoutMs: 30000
                                # Find top 100 titles with most views on Wikipedia
                                query: ${
                                    "SELECT TITLE, SUM(views)
                                    FROM `bigquery-samples.wikipedia_pageviews." + table + "`
                                    WHERE LENGTH(TITLE) > 10
                                    GROUP BY TITLE
                                    ORDER BY SUM(VIEWS) DESC
                                    LIMIT 100"
                                    }
                        result: queryResult
                    - returnResult:
                        assign:
                            # Return the top title from each table
                            - results[table]: {}
                            - results[table].title: ${queryResult.rows[0].f[0].v}
                            - results[table].views: ${queryResult.rows[0].f[1].v}
        - returnResults:
            return: ${results}
  8. 按一下 [Deploy] (部署)

  9. 在「Workflow Details」(工作流程詳細資料) 頁面中,按一下 「Execute」(執行)

  10. 再次按一下「執行」

  11. 在「Output」窗格中查看工作流程的結果。

gcloud

  1. 開啟終端機,然後為工作流程建立原始碼檔案:

    touch workflow-parallel-bqjobs.yaml
  2. 將以下工作流程複製到原始碼檔案中:

    main:
        steps:
        - init:
            assign:
                - results : {} # result from each iteration keyed by table name
                - tables:
                    - 201201h
                    - 201202h
                    - 201203h
                    - 201204h
                    - 201205h
        - runQueries:
            parallel:
                shared: [results]
                for:
                    value: table
                    in: ${tables}
                    steps:
                    - logTable:
                        call: sys.log
                        args:
                            text: ${"Running query for table " + table}
                    - runQuery:
                        call: googleapis.bigquery.v2.jobs.query
                        args:
                            projectId: ${sys.get_env("GOOGLE_CLOUD_PROJECT_ID")}
                            body:
                                useLegacySql: false
                                useQueryCache: false
                                timeoutMs: 30000
                                # Find top 100 titles with most views on Wikipedia
                                query: ${
                                    "SELECT TITLE, SUM(views)
                                    FROM `bigquery-samples.wikipedia_pageviews." + table + "`
                                    WHERE LENGTH(TITLE) > 10
                                    GROUP BY TITLE
                                    ORDER BY SUM(VIEWS) DESC
                                    LIMIT 100"
                                    }
                        result: queryResult
                    - returnResult:
                        assign:
                            # Return the top title from each table
                            - results[table]: {}
                            - results[table].title: ${queryResult.rows[0].f[0].v}
                            - results[table].views: ${queryResult.rows[0].f[1].v}
        - returnResults:
            return: ${results}
  3. 輸入下列指令來部署工作流程:

    gcloud workflows deploy workflow-parallell-bqjobs \
       --source=workflow-parallel-bqjobs.yaml \
       --service-account=MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com

    MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com 替換為您先前建立的服務帳戶電子郵件地址。

  4. 執行工作流程:

     gcloud workflows run workflow-parallel-bqjobs

結果會與先前的輸出內容相似,但工作流程執行時間應約為二十秒或更短!

清除所用資源

如果您是為了這個教學課程建立新專案,請刪除專案。如果您使用現有的專案,且希望保留該專案而不採用本教學課程中新增的變更,請刪除為教學課程建立的資源

刪除專案

如要避免付費,最簡單的方法就是刪除您為了本教學課程所建立的專案。

如要刪除專案:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

刪除教學課程資源

刪除在本教學課程中建立的工作流程:

gcloud workflows delete WORKFLOW_NAME

後續步驟