本頁說明如何調查及解決 GKE 記錄相關問題。
Cloud Logging 中缺少叢集記錄
確認專案已啟用記錄功能
列出已啟用的服務:
gcloud services list --enabled --filter="NAME=logging.googleapis.com"
以下輸出內容表示專案已啟用記錄功能:
NAME TITLE logging.googleapis.com Cloud Logging API
選用:在記錄檢視器中查看記錄,判斷是誰在何時停用 API:
protoPayload.methodName="google.api.serviceusage.v1.ServiceUsage.DisableService" protoPayload.response.services="logging.googleapis.com"
如果記錄功能已停用,請啟用記錄功能:
gcloud services enable logging.googleapis.com
確認叢集已啟用記錄功能
列出叢集:
gcloud container clusters list \ --project=PROJECT_ID \ '--format=value(name,loggingConfig.componentConfig.enableComponents)' \ --sort-by=name | column -t
更改下列內容:
PROJECT_ID
:您的 Google Cloud 專案 ID。
輸出結果會與下列內容相似:
cluster-1 SYSTEM_COMPONENTS cluster-2 SYSTEM_COMPONENTS;WORKLOADS cluster-3
如果叢集的值為空白,系統會停用記錄功能。舉例來說,這個輸出內容中的
cluster-3
已停用記錄功能。如果設為
NONE
,請啟用叢集記錄功能:gcloud container clusters update CLUSTER_NAME \ --logging=SYSTEM,WORKLOAD \ --location=COMPUTE_LOCATION
更改下列內容:
CLUSTER_NAME
:叢集名稱。COMPUTE_LOCATION
:叢集的 Compute Engine 位置。
確認節點集區中的節點具有 Cloud Logging 存取範圍
節點必須具備下列其中一個範圍,才能將記錄寫入 Cloud Logging:
https://www.googleapis.com/auth/logging.write
https://www.googleapis.com/auth/cloud-platform
https://www.googleapis.com/auth/logging.admin
檢查叢集中每個節點集區設定的範圍:
gcloud container node-pools list --cluster=CLUSTER_NAME \ --format="table(name,config.oauthScopes)" \ --location COMPUTE_LOCATION
更改下列內容:
CLUSTER_NAME
:叢集名稱。COMPUTE_LOCATION
:叢集的 Compute Engine 位置。
將工作負載從舊節點集區遷移至新建立的節點集區,並監控進度。
使用正確的記錄範圍建立新的節點集區:
gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --location=COMPUTE_LOCATION \ --scopes="gke-default"
更改下列內容:
CLUSTER_NAME
:叢集名稱。COMPUTE_LOCATION
:叢集的 Compute Engine 位置。
找出節點服務帳戶缺少重要權限的叢集
如要找出缺少重要權限的節點服務帳戶叢集,請使用 NODE_SA_MISSING_PERMISSIONS
recommender 子類型的 GKE 建議:
- 使用 Google Cloud 控制台。前往「Kubernetes clusters」(Kubernetes 叢集) 頁面。在特定叢集的「通知」欄中,查看「授予重要權限」建議。
使用 gcloud CLI 或 Recommender API 時,請指定
NODE_SA_MISSING_PERMISSIONS
recommender 子類型。如要查詢這項建議,請執行下列指令:
gcloud recommender recommendations list \ --recommender=google.container.DiagnosisRecommender \ --location LOCATION \ --project PROJECT_ID \ --format yaml \ --filter="recommenderSubtype:NODE_SA_MISSING_PERMISSIONS"
如要實作這項建議,請將 roles/container.defaultNodeServiceAccount
角色授予節點的服務帳戶。
您可以執行指令碼,在專案的 Standard 和 Autopilot 叢集中搜尋任何沒有 GKE 必要權限的節點服務帳戶。這個指令碼會使用 gcloud CLI 和 jq
公用程式。如要查看指令碼,請展開下列章節:
查看指令碼
#!/bin/bash
# Set your project ID
project_id=PROJECT_ID
project_number=$(gcloud projects describe "$project_id" --format="value(projectNumber)")
declare -a all_service_accounts
declare -a sa_missing_permissions
# Function to check if a service account has a specific permission
# $1: project_id
# $2: service_account
# $3: permission
service_account_has_permission() {
local project_id="$1"
local service_account="$2"
local permission="$3"
local roles=$(gcloud projects get-iam-policy "$project_id" \
--flatten="bindings[].members" \
--format="table[no-heading](bindings.role)" \
--filter="bindings.members:\"$service_account\"")
for role in $roles; do
if role_has_permission "$role" "$permission"; then
echo "Yes" # Has permission
return
fi
done
echo "No" # Does not have permission
}
# Function to check if a role has the specific permission
# $1: role
# $2: permission
role_has_permission() {
local role="$1"
local permission="$2"
gcloud iam roles describe "$role" --format="json" | \
jq -r ".includedPermissions" | \
grep -q "$permission"
}
# Function to add $1 into the service account array all_service_accounts
# $1: service account
add_service_account() {
local service_account="$1"
all_service_accounts+=( ${service_account} )
}
# Function to add service accounts into the global array all_service_accounts for a Standard GKE cluster
# $1: project_id
# $2: location
# $3: cluster_name
add_service_accounts_for_standard() {
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
while read nodepool; do
nodepool_name=$(echo "$nodepool" | awk '{print $1}')
if [[ "$nodepool_name" == "" ]]; then
# skip the empty line which is from running `gcloud container node-pools list` in GCP console
continue
fi
while read nodepool_details; do
service_account=$(echo "$nodepool_details" | awk '{print $1}')
if [[ "$service_account" == "default" ]]; then
service_account="${project_number}-compute@developer.gserviceaccount.com"
fi
if [[ -n "$service_account" ]]; then
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id $cluster_name $cluster_location $nodepool_name
add_service_account "${service_account}"
else
echo "cannot find service account for node pool $project_id\t$cluster_name\t$cluster_location\t$nodepool_details"
fi
done <<< "$(gcloud container node-pools describe "$nodepool_name" --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](config.serviceAccount)")"
done <<< "$(gcloud container node-pools list --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](name)")"
}
# Function to add service accounts into the global array all_service_accounts for an Autopilot GKE cluster
# Autopilot cluster only has one node service account.
# $1: project_id
# $2: location
# $3: cluster_name
add_service_account_for_autopilot(){
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
while read service_account; do
if [[ "$service_account" == "default" ]]; then
service_account="${project_number}-compute@developer.gserviceaccount.com"
fi
if [[ -n "$service_account" ]]; then
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id $cluster_name $cluster_location $nodepool_name
add_service_account "${service_account}"
else
echo "cannot find service account" for cluster "$project_id\t$cluster_name\t$cluster_location\t"
fi
done <<< "$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --project "$project_id" --format="table[no-heading](autoscaling.autoprovisioningNodePoolDefaults.serviceAccount)")"
}
# Function to check whether the cluster is an Autopilot cluster or not
# $1: project_id
# $2: location
# $3: cluster_name
is_autopilot_cluster() {
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
autopilot=$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --format="table[no-heading](autopilot.enabled)")
echo "$autopilot"
}
echo "--- 1. List all service accounts in all GKE node pools"
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" "service_account" "project_id" "cluster_name" "cluster_location" "nodepool_name"
while read cluster; do
cluster_name=$(echo "$cluster" | awk '{print $1}')
cluster_location=$(echo "$cluster" | awk '{print $2}')
# how to find a cluster is a Standard cluster or an Autopilot cluster
autopilot=$(is_autopilot_cluster "$project_id" "$cluster_location" "$cluster_name")
if [[ "$autopilot" == "True" ]]; then
add_service_account_for_autopilot "$project_id" "$cluster_location" "$cluster_name"
else
add_service_accounts_for_standard "$project_id" "$cluster_location" "$cluster_name"
fi
done <<< "$(gcloud container clusters list --project "$project_id" --format="value(name,location)")"
echo "--- 2. Check if service accounts have permissions"
unique_service_accounts=($(echo "${all_service_accounts[@]}" | tr ' ' '\n' | sort -u | tr '\n' ' '))
echo "Service accounts: ${unique_service_accounts[@]}"
printf "%-60s| %-40s| %-40s| %-20s\n" "service_account" "has_logging_permission" "has_monitoring_permission" "has_performance_hpa_metric_write_permission"
for sa in "${unique_service_accounts[@]}"; do
logging_permission=$(service_account_has_permission "$project_id" "$sa" "logging.logEntries.create")
time_series_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.timeSeries.create")
metric_descriptors_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.metricDescriptors.create")
if [[ "$time_series_create_permission" == "No" || "$metric_descriptors_create_permission" == "No" ]]; then
monitoring_permission="No"
else
monitoring_permission="Yes"
fi
performance_hpa_metric_write_permission=$(service_account_has_permission "$project_id" "$sa" "autoscaling.sites.writeMetrics")
printf "%-60s| %-40s| %-40s| %-20s\n" $sa $logging_permission $monitoring_permission $performance_hpa_metric_write_permission
if [[ "$logging_permission" == "No" || "$monitoring_permission" == "No" || "$performance_hpa_metric_write_permission" == "No" ]]; then
sa_missing_permissions+=( ${sa} )
fi
done
echo "--- 3. List all service accounts that don't have the above permissions"
if [[ "${#sa_missing_permissions[@]}" -gt 0 ]]; then
printf "Grant roles/container.defaultNodeServiceAccount to the following service accounts: %s\n" "${sa_missing_permissions[@]}"
else
echo "All service accounts have the above permissions"
fi
找出叢集中缺少重要權限的節點服務帳戶
GKE 會使用附加至節點的 IAM 服務帳戶,執行記錄和監控等系統工作。這些節點服務帳戶至少必須具備專案的「Kubernetes Engine 預設節點服務帳戶」(roles/container.defaultNodeServiceAccount
) 角色。根據預設,GKE 會使用專案中自動建立的 Compute Engine 預設服務帳戶做為節點服務帳戶。
如果貴機構強制執行 iam.automaticIamGrantsForDefaultServiceAccounts
機構政策限制,專案中的預設 Compute Engine 服務帳戶可能不會自動取得 GKE 的必要權限。
如要確認是否缺少記錄權限,請檢查叢集記錄中是否有
401
錯誤:[[ $(kubectl logs -l k8s-app=fluentbit-gke -n kube-system -c fluentbit-gke | grep -cw "Received 401") -gt 0 ]] && echo "true" || echo "false"
如果輸出內容為
true
,表示系統工作負載發生401
錯誤,這表示缺少權限。如果輸出內容為false
,請略過其餘步驟,嘗試其他疑難排解程序。如要找出所有缺少的重大權限,請檢查指令碼。
-
找出節點使用的服務帳戶名稱:
主控台
- 前往「Kubernetes clusters」(Kubernetes 叢集) 頁面:
- 在叢集清單中,按一下要檢查的叢集名稱。
- 視叢集運作模式而定,請執行下列其中一項操作:
- 如為 Autopilot 模式叢集,請在「安全性」部分中,找出「服務帳戶」欄位。
- 如果是 Standard 模式叢集,請執行下列操作:
- 按一下「Nodes」(節點) 分頁標籤。
- 在「節點集區」表格中,按一下節點集區名稱。「節點集區詳細資料」頁面隨即開啟。
- 在「安全性」部分,找到「服務帳戶」欄位。
如果「服務帳戶」欄位中的值為
default
,節點就會使用 Compute Engine 預設服務帳戶。如果這個欄位的值不是default
,節點就會使用自訂服務帳戶。如要將必要角色授予自訂服務帳戶,請參閱「使用最低權限的 IAM 服務帳戶」。gcloud
如果是 Autopilot 模式叢集,請執行下列指令:
gcloud container clusters describe
CLUSTER_NAME
\ --location=LOCATION
\ --flatten=autoscaling.autoprovisioningNodePoolDefaults.serviceAccount如果是標準模式叢集,請執行下列指令:
gcloud container clusters describe
CLUSTER_NAME
\ --location=LOCATION
\ --format="table(nodePools.name,nodePools.config.serviceAccount)"如果輸出為
default
,表示節點使用 Compute Engine 預設服務帳戶。如果輸出不是default
,表示節點使用自訂服務帳戶。如要將必要角色授予自訂服務帳戶,請參閱「使用最低權限的 IAM 服務帳戶」。 -
如要將
roles/container.defaultNodeServiceAccount
角色授予 Compute Engine 預設服務帳戶,請完成下列步驟:主控台
- 前往「歡迎」頁面:
- 在「專案編號」欄位中,按一下 「複製到剪貼簿」。
- 前往「IAM」(身分與存取權管理)IAM 頁面:
- 按一下「授予存取權」 。
- 在「New principals」(新增主體) 欄位中,指定下列值:
將PROJECT_NUMBER-compute@developer.gserviceaccount.com
PROJECT_NUMBER
替換為您複製的專案編號。 - 在「Select a role」(選取角色) 選單中,選取「Kubernetes Engine Default Node Service Account」(Kubernetes Engine 預設節點服務帳戶) 角色。
- 按一下 [儲存]。
gcloud
- 找出 Google Cloud 專案編號:
gcloud projects describe PROJECT_ID \ --format="value(projectNumber)"
將
PROJECT_ID
替換為您的專案 ID。輸出結果會與下列內容相似:
12345678901
- 將
roles/container.defaultNodeServiceAccount
角色指派給 Compute Engine 預設服務帳戶:gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com" \ --role="roles/container.defaultNodeServiceAccount"
將
PROJECT_NUMBER
替換為上一步的專案編號。
- 確認節點服務帳戶具備必要權限。請檢查指令碼 以進行驗證。
用於找出 GKE 節點服務帳戶缺少權限的指令碼
您可以執行指令碼,在專案的 Standard 和 Autopilot 叢集中搜尋任何沒有 GKE 必要權限的節點服務帳戶。這個指令碼會使用 gcloud CLI 和 jq
公用程式。如要查看指令碼,請展開下列章節:
查看指令碼
#!/bin/bash
# Set your project ID
project_id=PROJECT_ID
project_number=$(gcloud projects describe "$project_id" --format="value(projectNumber)")
declare -a all_service_accounts
declare -a sa_missing_permissions
# Function to check if a service account has a specific permission
# $1: project_id
# $2: service_account
# $3: permission
service_account_has_permission() {
local project_id="$1"
local service_account="$2"
local permission="$3"
local roles=$(gcloud projects get-iam-policy "$project_id" \
--flatten="bindings[].members" \
--format="table[no-heading](bindings.role)" \
--filter="bindings.members:\"$service_account\"")
for role in $roles; do
if role_has_permission "$role" "$permission"; then
echo "Yes" # Has permission
return
fi
done
echo "No" # Does not have permission
}
# Function to check if a role has the specific permission
# $1: role
# $2: permission
role_has_permission() {
local role="$1"
local permission="$2"
gcloud iam roles describe "$role" --format="json" | \
jq -r ".includedPermissions" | \
grep -q "$permission"
}
# Function to add $1 into the service account array all_service_accounts
# $1: service account
add_service_account() {
local service_account="$1"
all_service_accounts+=( ${service_account} )
}
# Function to add service accounts into the global array all_service_accounts for a Standard GKE cluster
# $1: project_id
# $2: location
# $3: cluster_name
add_service_accounts_for_standard() {
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
while read nodepool; do
nodepool_name=$(echo "$nodepool" | awk '{print $1}')
if [[ "$nodepool_name" == "" ]]; then
# skip the empty line which is from running `gcloud container node-pools list` in GCP console
continue
fi
while read nodepool_details; do
service_account=$(echo "$nodepool_details" | awk '{print $1}')
if [[ "$service_account" == "default" ]]; then
service_account="${project_number}-compute@developer.gserviceaccount.com"
fi
if [[ -n "$service_account" ]]; then
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id $cluster_name $cluster_location $nodepool_name
add_service_account "${service_account}"
else
echo "cannot find service account for node pool $project_id\t$cluster_name\t$cluster_location\t$nodepool_details"
fi
done <<< "$(gcloud container node-pools describe "$nodepool_name" --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](config.serviceAccount)")"
done <<< "$(gcloud container node-pools list --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](name)")"
}
# Function to add service accounts into the global array all_service_accounts for an Autopilot GKE cluster
# Autopilot cluster only has one node service account.
# $1: project_id
# $2: location
# $3: cluster_name
add_service_account_for_autopilot(){
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
while read service_account; do
if [[ "$service_account" == "default" ]]; then
service_account="${project_number}-compute@developer.gserviceaccount.com"
fi
if [[ -n "$service_account" ]]; then
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id $cluster_name $cluster_location $nodepool_name
add_service_account "${service_account}"
else
echo "cannot find service account" for cluster "$project_id\t$cluster_name\t$cluster_location\t"
fi
done <<< "$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --project "$project_id" --format="table[no-heading](autoscaling.autoprovisioningNodePoolDefaults.serviceAccount)")"
}
# Function to check whether the cluster is an Autopilot cluster or not
# $1: project_id
# $2: location
# $3: cluster_name
is_autopilot_cluster() {
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
autopilot=$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --format="table[no-heading](autopilot.enabled)")
echo "$autopilot"
}
echo "--- 1. List all service accounts in all GKE node pools"
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" "service_account" "project_id" "cluster_name" "cluster_location" "nodepool_name"
while read cluster; do
cluster_name=$(echo "$cluster" | awk '{print $1}')
cluster_location=$(echo "$cluster" | awk '{print $2}')
# how to find a cluster is a Standard cluster or an Autopilot cluster
autopilot=$(is_autopilot_cluster "$project_id" "$cluster_location" "$cluster_name")
if [[ "$autopilot" == "True" ]]; then
add_service_account_for_autopilot "$project_id" "$cluster_location" "$cluster_name"
else
add_service_accounts_for_standard "$project_id" "$cluster_location" "$cluster_name"
fi
done <<< "$(gcloud container clusters list --project "$project_id" --format="value(name,location)")"
echo "--- 2. Check if service accounts have permissions"
unique_service_accounts=($(echo "${all_service_accounts[@]}" | tr ' ' '\n' | sort -u | tr '\n' ' '))
echo "Service accounts: ${unique_service_accounts[@]}"
printf "%-60s| %-40s| %-40s| %-20s\n" "service_account" "has_logging_permission" "has_monitoring_permission" "has_performance_hpa_metric_write_permission"
for sa in "${unique_service_accounts[@]}"; do
logging_permission=$(service_account_has_permission "$project_id" "$sa" "logging.logEntries.create")
time_series_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.timeSeries.create")
metric_descriptors_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.metricDescriptors.create")
if [[ "$time_series_create_permission" == "No" || "$metric_descriptors_create_permission" == "No" ]]; then
monitoring_permission="No"
else
monitoring_permission="Yes"
fi
performance_hpa_metric_write_permission=$(service_account_has_permission "$project_id" "$sa" "autoscaling.sites.writeMetrics")
printf "%-60s| %-40s| %-40s| %-20s\n" $sa $logging_permission $monitoring_permission $performance_hpa_metric_write_permission
if [[ "$logging_permission" == "No" || "$monitoring_permission" == "No" || "$performance_hpa_metric_write_permission" == "No" ]]; then
sa_missing_permissions+=( ${sa} )
fi
done
echo "--- 3. List all service accounts that don't have the above permissions"
if [[ "${#sa_missing_permissions[@]}" -gt 0 ]]; then
printf "Grant roles/container.defaultNodeServiceAccount to the following service accounts: %s\n" "${sa_missing_permissions[@]}"
else
echo "All service accounts have the above permissions"
fi
確認未達到 Cloud Logging 寫入 API 配額
確認您尚未達到 Cloud Logging 的 API 寫入配額。
前往 Google Cloud 控制台的「配額」頁面。
依「Cloud Logging API」篩選表格。
確認您未達到任何配額。
使用 gcpdiag 偵錯 GKE 記錄問題
如果 GKE 叢集缺少記錄或記錄不完整,請使用gcpdiag
工具進行疑難排解。
gcpdiag
是開放原始碼工具。這並非正式支援的 Google Cloud 產品。您可以使用 gcpdiag
工具找出並修正 Google Cloud專案問題。詳情請參閱 GitHub 上的 gcpdiag 專案。
- 專案層級記錄:確保裝載 GKE 叢集的專案已啟用 Cloud Logging API。 Google Cloud
- 叢集層級記錄:確認 GKE 叢集的設定中已明確啟用記錄功能。
- 節點集區權限:確認叢集節點集區中的節點已啟用「Cloud Logging 寫入」範圍,可傳送記錄檔資料。
- 服務帳戶權限:驗證節點集區使用的服務帳戶是否具備與 Cloud Logging 互動所需的 IAM 權限。具體來說,通常需要「roles/logging.logWriter」角色。
- Cloud Logging API 寫入配額:確認在指定時間範圍內,Cloud Logging API 寫入配額未超出上限。
Google Cloud 控制台
- 完成下列指令,然後複製。
- 開啟 Google Cloud 控制台並啟用 Cloud Shell。 開啟 Cloud 控制台
- 貼上複製的指令。
- 執行
gcpdiag
指令,下載gcpdiag
Docker 映像檔,然後執行診斷檢查。如適用,請按照輸出內容中的操作說明修正失敗的檢查。
gcpdiag runbook gke/logs \
--parameter project_id=PROJECT_ID \
--parameter name=GKE_NAME \
--parameter location=LOCATION
Docker
您可以使用在 Docker 容器中啟動 gcpdiag
的
wrapper 執行 gcpdiag
。必須安裝 Docker 或 Podman。
- 複製下列指令,並在本機工作站上執行。
curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag
- 執行
gcpdiag
指令。./gcpdiag runbook gke/logs \ --parameter project_id=PROJECT_ID \ --parameter name=GKE_NAME \ --parameter location=LOCATION
查看這本 Runbook 的可用參數。
更改下列內容:
- PROJECT_ID:包含資源的專案 ID。
- GKE_NAME:GKE 叢集名稱。
- LOCATION:GKE 叢集的區域或可用區。
實用標記:
--universe-domain
:如果適用,代管資源的信任合作夥伴主權雲端網域--parameter
或-p
:Runbook 參數
如需所有 gcpdiag
工具旗標的清單和說明,請參閱 gcpdiag
使用說明。
後續步驟
如果無法在說明文件中找到問題的解決方法,請參閱「取得支援」一文,尋求進一步的協助, 包括下列主題的建議:
- 與 Cloud 客戶服務聯絡,建立支援案件。
- 在 StackOverflow 上提問,並使用
google-kubernetes-engine
標記搜尋類似問題,向社群尋求支援。你也可以加入#kubernetes-engine
Slack 頻道,取得更多社群支援。 - 使用公開問題追蹤工具回報錯誤或提出功能要求。