BigQuery 會代管多個公開資料集,供一般大眾查詢。在本教學課程中,您將建立工作流程,以便同時執行多個 BigQuery 查詢工作,並比較以串列方式逐一執行工作,與此相比,您可以看到效能有所提升。
目標
教學課程內容:- 對 Wikipedia 公開資料集執行查詢,找出特定月份瀏覽次數最多的標題。
- 部署及執行工作流程,以便依序逐一執行多項 BigQuery 查詢工作。
- 部署並執行工作流程,以平行疊代的方式執行 BigQuery 工作,並在平行執行一般
for
迴圈。
您可以在 Google Cloud 控制台中執行下列指令,或是在終端機或 Cloud Shell 中使用 Google Cloud CLI。
費用
In this document, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
事前準備
貴機構定義的安全性限制,可能會導致您無法完成下列步驟。如需疑難排解資訊,請參閱「在受限的 Google Cloud 環境中開發應用程式」。
控制台
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Workflows API.
-
Create a service account:
-
In the Google Cloud console, go to the Create service account page.
Go to Create service account - Select your project.
-
In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.
In the Service account description field, enter a description. For example,
Service account for quickstart
. - Click Create and continue.
-
Grant the following roles to the service account: BigQuery > BigQuery Job User, Logging > Logs Writer.
To grant a role, find the Select a role list, then select the role.
To grant additional roles, click
Add another role and add each additional role. - Click Continue.
-
Click Done to finish creating the service account.
-
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Workflows API.
-
Create a service account:
-
In the Google Cloud console, go to the Create service account page.
Go to Create service account - Select your project.
-
In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.
In the Service account description field, enter a description. For example,
Service account for quickstart
. - Click Create and continue.
-
Grant the following roles to the service account: BigQuery > BigQuery Job User, Logging > Logs Writer.
To grant a role, find the Select a role list, then select the role.
To grant additional roles, click
Add another role and add each additional role. - Click Continue.
-
Click Done to finish creating the service account.
-
gcloud
-
Sign in to your Google Account.
If you don't already have one, sign up for a new account.
-
Install the Google Cloud CLI.
-
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
Create or select a Google Cloud project.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_ID
with a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_ID
with your Google Cloud project name.
-
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Workflows API:
gcloud services enable workflows.googleapis.com
-
Set up authentication:
-
Create the service account:
gcloud iam service-accounts create SERVICE_ACCOUNT_NAME
Replace
SERVICE_ACCOUNT_NAME
with a name for the service account. -
Grant roles to the service account. Run the following command once for each of the following IAM roles:
roles/bigquery.jobUser, roles/logging.logWriter
:gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com" --role=ROLE
Replace the following:
SERVICE_ACCOUNT_NAME
: the name of the service accountPROJECT_ID
: the project ID where you created the service accountROLE
: the role to grant
-
-
Install the Google Cloud CLI.
-
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
Create or select a Google Cloud project.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_ID
with a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_ID
with your Google Cloud project name.
-
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Workflows API:
gcloud services enable workflows.googleapis.com
-
Set up authentication:
-
Create the service account:
gcloud iam service-accounts create SERVICE_ACCOUNT_NAME
Replace
SERVICE_ACCOUNT_NAME
with a name for the service account. -
Grant roles to the service account. Run the following command once for each of the following IAM roles:
roles/bigquery.jobUser, roles/logging.logWriter
:gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:SERVICE_ACCOUNT_NAME@PROJECT_ID.iam.gserviceaccount.com" --role=ROLE
Replace the following:
SERVICE_ACCOUNT_NAME
: the name of the service accountPROJECT_ID
: the project ID where you created the service accountROLE
: the role to grant
-
執行 BigQuery 查詢工作
在 BigQuery 中,您可以執行互動式 (隨選) 查詢工作。如需更多資訊,請參閱「執行互動式和批次查詢工作」。
控制台
前往 Google Cloud 控制台的「BigQuery」頁面。
在「Query editor」(查詢編輯器) 文字區域中輸入下列 BigQuery SQL 查詢:
SELECT TITLE, SUM(views) FROM `bigquery-samples.wikipedia_pageviews.201207h` GROUP BY TITLE ORDER BY SUM(views) DESC LIMIT 100
按一下「執行」。
bq
在終端機中輸入下列 bq query
指令,即可使用標準 SQL 語法執行互動式查詢:
bq query \
--use_legacy_sql=false \
'SELECT
TITLE, SUM(views)
FROM
`bigquery-samples.wikipedia_pageviews.201207h`
GROUP BY
TITLE
ORDER BY
SUM(views) DESC
LIMIT 100'
這會執行查詢,傳回特定月份內觀看次數最多的前 100 個 Wikipedia 標題,並將輸出內容寫入臨時資料表。
請注意查詢執行所需的時間。
部署可依序執行多項查詢的工作流程
工作流程定義由一系列步驟組成,這些步驟會使用 Workflows 語法加以描述。建立工作流程後,您可以部署工作流程,讓系統執行。部署步驟也會驗證來源檔案是否可執行。
以下工作流程定義了五個資料表清單,以便使用工作流程 BigQuery 連接器執行查詢。查詢會依序逐一執行,每個資料表中觀看次數最多的書籍會儲存至結果地圖。
控制台
前往 Google Cloud 控制台的「Workflows」頁面:
按一下 [建立]。
輸入新工作流程的名稱,例如
workflow-serial-bqjobs
。選擇適當的地區,例如 us-central1。
選取先前建立的服務帳戶。
您應該已將 BigQuery > BigQuery 工作使用者 和 記錄 > 記錄寫入器 IAM 角色授予服務帳戶。
點按「Next」。
在工作流程編輯器中,輸入工作流程的定義:
按一下 [Deploy] (部署)。
gcloud
開啟終端機,然後為工作流程建立原始碼檔案:
touch workflow-serial-bqjobs.yaml
將以下工作流程複製到原始碼檔案中:
輸入下列指令來部署工作流程:
gcloud workflows deploy workflow-serial-bqjobs \ --source=workflow-serial-bqjobs.yaml \ --service-account=MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com
將
MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com
替換為您先前建立的服務帳戶電子郵件地址。您應該已將
roles/bigquery.jobUser
和roles/logging.logWriter
IAM 角色授予服務帳戶。
執行工作流程並依序執行多個查詢
執行工作流程會執行與工作流程相關聯的目前工作流程定義。
控制台
前往 Google Cloud 控制台的「Workflows」頁面:
在「Workflows」頁面中,選取「workflow-serial-bqjobs」工作流程,前往該工作流程的詳細資料頁面。
在「Workflow Details」(工作流程詳細資料) 頁面中,按一下 play_arrow「Execute」(執行)。
再次按一下「執行」。
在「Output」窗格中查看工作流程的結果。
gcloud
開啟終端機。
執行工作流程:
gcloud workflows run workflow-serial-bqjob
工作流程執行作業大約需要一分鐘或前次執行時間的五倍時間。結果會包含每個資料表,如下所示:
{
"201201h": {
"title": "Special:Search",
"views": "14591339"
},
"201202h": {
"title": "Special:Search",
"views": "132765420"
},
"201203h": {
"title": "Special:Search",
"views": "123316818"
},
"201204h": {
"title": "Special:Search",
"views": "116830614"
},
"201205h": {
"title": "Special:Search",
"views": "131357063"
}
}
部署並執行可並行執行多個查詢的工作流程
您可以進行一些變更,讓系統並行執行這五個查詢,而非依序執行:
- runQueries:
parallel:
shared: [results]
for:
value: table
in: ${tables}
parallel
步驟可讓for
迴圈的每個疊代都以平行方式執行。results
變數會宣告為shared
,讓分支可以寫入,且每個分支的結果可附加至其中。
控制台
前往 Google Cloud 控制台的「Workflows」頁面:
按一下 [建立]。
輸入新工作流程的名稱,例如
workflow-parallel-bqjobs
。選擇適當的地區,例如 us-central1。
選取先前建立的服務帳戶。
點按「Next」。
在工作流程編輯器中,輸入工作流程的定義:
按一下 [Deploy] (部署)。
在「Workflow Details」(工作流程詳細資料) 頁面中,按一下 play_arrow「Execute」(執行)。
再次按一下「執行」。
在「Output」窗格中查看工作流程的結果。
gcloud
開啟終端機,然後為工作流程建立原始碼檔案:
touch workflow-parallel-bqjobs.yaml
將以下工作流程複製到原始碼檔案中:
輸入下列指令來部署工作流程:
gcloud workflows deploy workflow-parallell-bqjobs \ --source=workflow-parallel-bqjobs.yaml \ --service-account=MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com
將
MY_SERVICE_ACCOUNT@MY_PROJECT.iam.gserviceaccount.com
替換為您先前建立的服務帳戶電子郵件地址。執行工作流程:
gcloud workflows run workflow-parallel-bqjobs
結果會與先前的輸出內容相似,但工作流程執行時間應約為二十秒或更短!
清除所用資源
如果您是為了這個教學課程建立新專案,請刪除專案。如果您使用現有的專案,且希望保留該專案而不採用本教學課程中新增的變更,請刪除為教學課程建立的資源。
刪除專案
如要避免付費,最簡單的方法就是刪除您為了本教學課程所建立的專案。
如要刪除專案:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
刪除教學課程資源
刪除在本教學課程中建立的工作流程:
gcloud workflows delete WORKFLOW_NAME
後續步驟
- 如要進一步瞭解並行步驟,請參閱「執行並行步驟」。
- 如要進一步瞭解 Workflows 連接器,請參閱「瞭解連接器」。
- 如要進一步瞭解 Workflows,請參閱「Workflows 總覽」。