本页面可帮助您解决 Google Kubernetes Engine (GKE) 日志记录流水线本身的问题,例如日志未显示在 Cloud Logging 中。如需详细了解如何使用日志排查工作负载和集群问题,请参阅 GKE 问题排查简介。
Cloud Logging 中缺少集群日志
验证项目中是否已启用日志记录
列出已启用的服务:
gcloud services list --enabled --filter="NAME=logging.googleapis.com"
以下输出表明已为项目启用日志记录:
NAME TITLE logging.googleapis.com Cloud Logging API
可选:查看日志查看器中的日志,以确定谁停用了 API 以及何时停用 API:
protoPayload.methodName="google.api.serviceusage.v1.ServiceUsage.DisableService" protoPayload.response.services="logging.googleapis.com"
如果日志记录已停用,请启用日志记录:
gcloud services enable logging.googleapis.com
验证集群上是否已启用日志记录
列出集群:
gcloud container clusters list \ --project=PROJECT_ID \ '--format=value(name,loggingConfig.componentConfig.enableComponents)' \ --sort-by=name | column -t
替换以下内容:
PROJECT_ID
:您的 Google Cloud 项目 ID。
输出内容类似如下:
cluster-1 SYSTEM_COMPONENTS cluster-2 SYSTEM_COMPONENTS;WORKLOADS cluster-3
如果集群的值为空,则停用日志记录。例如,此输出中的
cluster-3
已停用日志记录。如果设置为
NONE
,则启用集群日志记录:gcloud container clusters update CLUSTER_NAME \ --logging=SYSTEM,WORKLOAD \ --location=COMPUTE_LOCATION
替换以下内容:
CLUSTER_NAME
:您的集群的名称。COMPUTE_LOCATION
:集群的 Compute Engine 位置。
验证您已在项目和集群中启用日志记录后,不妨考虑使用 Gemini Cloud Assist 调查服务来深入了解日志并相应解决问题。如需详细了解如何使用 Logs Explorer 以不同方式发起调查,请参阅 Gemini 文档中的使用 Gemini Cloud Assist 调查服务排查问题。
验证节点池中的节点是否具有 Cloud Logging 访问权限范围
节点需要以下范围之一才能将日志写入 Cloud Logging:
https://www.googleapis.com/auth/logging.write
https://www.googleapis.com/auth/cloud-platform
https://www.googleapis.com/auth/logging.admin
检查为集群中的每个节点池配置的范围:
gcloud container node-pools list --cluster=CLUSTER_NAME \ --format="table(name,config.oauthScopes)" \ --location COMPUTE_LOCATION
替换以下内容:
CLUSTER_NAME
:您的集群的名称。COMPUTE_LOCATION
:集群的 Compute Engine 位置。
将工作负载从旧节点池迁移到新创建的节点池,并监控进度。
创建具有正确日志记录范围的新节点池:
gcloud container node-pools create NODE_POOL_NAME \ --cluster=CLUSTER_NAME \ --location=COMPUTE_LOCATION \ --scopes="gke-default"
替换以下内容:
CLUSTER_NAME
:您的集群的名称。COMPUTE_LOCATION
:集群的 Compute Engine 位置。
识别节点服务账号缺少关键权限的集群
如需识别节点服务账号缺少关键权限的集群,请使用 NODE_SA_MISSING_PERMISSIONS
Recommender 子类型的 GKE 建议:
- 使用 Google Cloud 控制台。 前往 Kubernetes 集群页面。在特定集群的通知列中,查看授予关键权限建议。
使用 gcloud CLI,或者通过指定
NODE_SA_MISSING_PERMISSIONS
Recommender 子类型使用 Recommender API。如需查询此建议,请运行以下命令:
gcloud recommender recommendations list \ --recommender=google.container.DiagnosisRecommender \ --location LOCATION \ --project PROJECT_ID \ --format yaml \ --filter="recommenderSubtype:NODE_SA_MISSING_PERMISSIONS"
如需实施此建议,请向节点的服务账号授予 roles/container.defaultNodeServiceAccount
角色。
您可以运行一个脚本,在项目 Standard 集群和 Autopilot 集群的节点池中搜索任何不具备 GKE 所需权限的节点服务账号。此脚本使用 gcloud CLI 和 jq
实用程序。如需查看脚本,请展开以下部分:
查看脚本
#!/bin/bash
# Set your project ID
project_id=PROJECT_ID
project_number=$(gcloud projects describe "$project_id" --format="value(projectNumber)")
declare -a all_service_accounts
declare -a sa_missing_permissions
# Function to check if a service account has a specific permission
# $1: project_id
# $2: service_account
# $3: permission
service_account_has_permission() {
local project_id="$1"
local service_account="$2"
local permission="$3"
local roles=$(gcloud projects get-iam-policy "$project_id" \
--flatten="bindings[].members" \
--format="table[no-heading](bindings.role)" \
--filter="bindings.members:\"$service_account\"")
for role in $roles; do
if role_has_permission "$role" "$permission"; then
echo "Yes" # Has permission
return
fi
done
echo "No" # Does not have permission
}
# Function to check if a role has the specific permission
# $1: role
# $2: permission
role_has_permission() {
local role="$1"
local permission="$2"
gcloud iam roles describe "$role" --format="json" | \
jq -r ".includedPermissions" | \
grep -q "$permission"
}
# Function to add $1 into the service account array all_service_accounts
# $1: service account
add_service_account() {
local service_account="$1"
all_service_accounts+=( ${service_account} )
}
# Function to add service accounts into the global array all_service_accounts for a Standard GKE cluster
# $1: project_id
# $2: location
# $3: cluster_name
add_service_accounts_for_standard() {
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
while read nodepool; do
nodepool_name=$(echo "$nodepool" | awk '{print $1}')
if [[ "$nodepool_name" == "" ]]; then
# skip the empty line which is from running `gcloud container node-pools list` in GCP console
continue
fi
while read nodepool_details; do
service_account=$(echo "$nodepool_details" | awk '{print $1}')
if [[ "$service_account" == "default" ]]; then
service_account="${project_number}-compute@developer.gserviceaccount.com"
fi
if [[ -n "$service_account" ]]; then
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id $cluster_name $cluster_location $nodepool_name
add_service_account "${service_account}"
else
echo "cannot find service account for node pool $project_id\t$cluster_name\t$cluster_location\t$nodepool_details"
fi
done <<< "$(gcloud container node-pools describe "$nodepool_name" --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](config.serviceAccount)")"
done <<< "$(gcloud container node-pools list --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](name)")"
}
# Function to add service accounts into the global array all_service_accounts for an Autopilot GKE cluster
# Autopilot cluster only has one node service account.
# $1: project_id
# $2: location
# $3: cluster_name
add_service_account_for_autopilot(){
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
while read service_account; do
if [[ "$service_account" == "default" ]]; then
service_account="${project_number}-compute@developer.gserviceaccount.com"
fi
if [[ -n "$service_account" ]]; then
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id $cluster_name $cluster_location $nodepool_name
add_service_account "${service_account}"
else
echo "cannot find service account" for cluster "$project_id\t$cluster_name\t$cluster_location\t"
fi
done <<< "$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --project "$project_id" --format="table[no-heading](autoscaling.autoprovisioningNodePoolDefaults.serviceAccount)")"
}
# Function to check whether the cluster is an Autopilot cluster or not
# $1: project_id
# $2: location
# $3: cluster_name
is_autopilot_cluster() {
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
autopilot=$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --format="table[no-heading](autopilot.enabled)")
echo "$autopilot"
}
echo "--- 1. List all service accounts in all GKE node pools"
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" "service_account" "project_id" "cluster_name" "cluster_location" "nodepool_name"
while read cluster; do
cluster_name=$(echo "$cluster" | awk '{print $1}')
cluster_location=$(echo "$cluster" | awk '{print $2}')
# how to find a cluster is a Standard cluster or an Autopilot cluster
autopilot=$(is_autopilot_cluster "$project_id" "$cluster_location" "$cluster_name")
if [[ "$autopilot" == "True" ]]; then
add_service_account_for_autopilot "$project_id" "$cluster_location" "$cluster_name"
else
add_service_accounts_for_standard "$project_id" "$cluster_location" "$cluster_name"
fi
done <<< "$(gcloud container clusters list --project "$project_id" --format="value(name,location)")"
echo "--- 2. Check if service accounts have permissions"
unique_service_accounts=($(echo "${all_service_accounts[@]}" | tr ' ' '\n' | sort -u | tr '\n' ' '))
echo "Service accounts: ${unique_service_accounts[@]}"
printf "%-60s| %-40s| %-40s| %-20s\n" "service_account" "has_logging_permission" "has_monitoring_permission" "has_performance_hpa_metric_write_permission"
for sa in "${unique_service_accounts[@]}"; do
logging_permission=$(service_account_has_permission "$project_id" "$sa" "logging.logEntries.create")
time_series_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.timeSeries.create")
metric_descriptors_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.metricDescriptors.create")
if [[ "$time_series_create_permission" == "No" || "$metric_descriptors_create_permission" == "No" ]]; then
monitoring_permission="No"
else
monitoring_permission="Yes"
fi
performance_hpa_metric_write_permission=$(service_account_has_permission "$project_id" "$sa" "autoscaling.sites.writeMetrics")
printf "%-60s| %-40s| %-40s| %-20s\n" $sa $logging_permission $monitoring_permission $performance_hpa_metric_write_permission
if [[ "$logging_permission" == "No" || "$monitoring_permission" == "No" || "$performance_hpa_metric_write_permission" == "No" ]]; then
sa_missing_permissions+=( ${sa} )
fi
done
echo "--- 3. List all service accounts that don't have the above permissions"
if [[ "${#sa_missing_permissions[@]}" -gt 0 ]]; then
printf "Grant roles/container.defaultNodeServiceAccount to the following service accounts: %s\n" "${sa_missing_permissions[@]}"
else
echo "All service accounts have the above permissions"
fi
识别集群中缺少关键权限的节点服务账号
GKE 使用关联到节点的 IAM 服务账号来运行日志记录和监控等系统任务。这些节点服务账号必须至少拥有项目的 Kubernetes Engine Default Node Service Account (roles/container.defaultNodeServiceAccount
) 角色。默认情况下,GKE 会将 Compute Engine 默认服务账号(在您的项目中自动创建)用作节点服务账号。
如果您的组织强制执行 iam.automaticIamGrantsForDefaultServiceAccounts
组织政策限制,则项目中的默认 Compute Engine 服务账号可能无法自动获得 GKE 所需的权限。
如需验证是否缺少日志记录权限,请检查集群的日志记录中是否存在
401
错误:[[ $(kubectl logs -l k8s-app=fluentbit-gke -n kube-system -c fluentbit-gke | grep -cw "Received 401") -gt 0 ]] && echo "true" || echo "false"
如果输出为
true
,则表示系统工作负载发生401
错误,这表明缺少权限。如果输出为false
,请跳过其余步骤,并尝试其他问题排查步骤。如需确定所有缺失的关键权限,请查看脚本。
-
找到节点使用的服务账号的名称:
控制台
- 前往 Kubernetes 集群页面:
- 在集群列表中,点击您要检查的集群的名称。
- 根据操作的集群模式,执行以下操作之一:
- 对于 Autopilot 模式集群,在安全部分中,找到服务账号字段。
- 对于 Standard 模式集群,请执行以下操作:
- 点击节点标签页。
- 在节点池表格中,点击节点池名称。此时会打开节点池详情页面。
- 在安全部分中,找到服务账号字段。
如果服务账号字段中的值为
default
,则表示节点使用 Compute Engine 默认服务账号。如果此字段中的值不是default
,则表示节点使用自定义服务账号。如需向自定义服务账号授予所需角色,请参阅使用最小权限 IAM 服务账号。gcloud
对于 Autopilot 模式集群,请运行以下命令:
gcloud container clusters describe
CLUSTER_NAME
\ --location=LOCATION
\ --flatten=autoscaling.autoprovisioningNodePoolDefaults.serviceAccount对于 Standard 模式集群,请运行以下命令:
gcloud container clusters describe
CLUSTER_NAME
\ --location=LOCATION
\ --format="table(nodePools.name,nodePools.config.serviceAccount)"如果输出为
default
,则表示节点使用 Compute Engine 默认服务账号。如果输出不是default
,则表示节点使用自定义服务账号。如需向自定义服务账号授予所需角色,请参阅使用最小权限 IAM 服务账号。 -
如需向 Compute Engine 默认服务账号授予
roles/container.defaultNodeServiceAccount
角色,请完成以下步骤:控制台
gcloud
- 找到您的 Google Cloud 项目编号:
gcloud projects describe PROJECT_ID \ --format="value(projectNumber)"
将
PROJECT_ID
替换为您的项目 ID。输出内容类似如下:
12345678901
- 将
roles/container.defaultNodeServiceAccount
角色授予 Compute Engine 默认服务账号:gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com" \ --role="roles/container.defaultNodeServiceAccount"
将
PROJECT_NUMBER
替换为上一步中的项目编号。
- 找到您的 Google Cloud 项目编号:
- 验证节点服务账号是否具有所需权限。检查脚本以进行验证。
用于确定 GKE 节点服务账号缺少哪些权限的脚本
您可以运行一个脚本,在项目 Standard 集群和 Autopilot 集群的节点池中搜索任何不具备 GKE 所需权限的节点服务账号。此脚本使用 gcloud CLI 和 jq
实用程序。如需查看脚本,请展开以下部分:
查看脚本
#!/bin/bash
# Set your project ID
project_id=PROJECT_ID
project_number=$(gcloud projects describe "$project_id" --format="value(projectNumber)")
declare -a all_service_accounts
declare -a sa_missing_permissions
# Function to check if a service account has a specific permission
# $1: project_id
# $2: service_account
# $3: permission
service_account_has_permission() {
local project_id="$1"
local service_account="$2"
local permission="$3"
local roles=$(gcloud projects get-iam-policy "$project_id" \
--flatten="bindings[].members" \
--format="table[no-heading](bindings.role)" \
--filter="bindings.members:\"$service_account\"")
for role in $roles; do
if role_has_permission "$role" "$permission"; then
echo "Yes" # Has permission
return
fi
done
echo "No" # Does not have permission
}
# Function to check if a role has the specific permission
# $1: role
# $2: permission
role_has_permission() {
local role="$1"
local permission="$2"
gcloud iam roles describe "$role" --format="json" | \
jq -r ".includedPermissions" | \
grep -q "$permission"
}
# Function to add $1 into the service account array all_service_accounts
# $1: service account
add_service_account() {
local service_account="$1"
all_service_accounts+=( ${service_account} )
}
# Function to add service accounts into the global array all_service_accounts for a Standard GKE cluster
# $1: project_id
# $2: location
# $3: cluster_name
add_service_accounts_for_standard() {
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
while read nodepool; do
nodepool_name=$(echo "$nodepool" | awk '{print $1}')
if [[ "$nodepool_name" == "" ]]; then
# skip the empty line which is from running `gcloud container node-pools list` in GCP console
continue
fi
while read nodepool_details; do
service_account=$(echo "$nodepool_details" | awk '{print $1}')
if [[ "$service_account" == "default" ]]; then
service_account="${project_number}-compute@developer.gserviceaccount.com"
fi
if [[ -n "$service_account" ]]; then
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id $cluster_name $cluster_location $nodepool_name
add_service_account "${service_account}"
else
echo "cannot find service account for node pool $project_id\t$cluster_name\t$cluster_location\t$nodepool_details"
fi
done <<< "$(gcloud container node-pools describe "$nodepool_name" --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](config.serviceAccount)")"
done <<< "$(gcloud container node-pools list --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](name)")"
}
# Function to add service accounts into the global array all_service_accounts for an Autopilot GKE cluster
# Autopilot cluster only has one node service account.
# $1: project_id
# $2: location
# $3: cluster_name
add_service_account_for_autopilot(){
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
while read service_account; do
if [[ "$service_account" == "default" ]]; then
service_account="${project_number}-compute@developer.gserviceaccount.com"
fi
if [[ -n "$service_account" ]]; then
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id $cluster_name $cluster_location $nodepool_name
add_service_account "${service_account}"
else
echo "cannot find service account" for cluster "$project_id\t$cluster_name\t$cluster_location\t"
fi
done <<< "$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --project "$project_id" --format="table[no-heading](autoscaling.autoprovisioningNodePoolDefaults.serviceAccount)")"
}
# Function to check whether the cluster is an Autopilot cluster or not
# $1: project_id
# $2: location
# $3: cluster_name
is_autopilot_cluster() {
local project_id="$1"
local cluster_location="$2"
local cluster_name="$3"
autopilot=$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --format="table[no-heading](autopilot.enabled)")
echo "$autopilot"
}
echo "--- 1. List all service accounts in all GKE node pools"
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" "service_account" "project_id" "cluster_name" "cluster_location" "nodepool_name"
while read cluster; do
cluster_name=$(echo "$cluster" | awk '{print $1}')
cluster_location=$(echo "$cluster" | awk '{print $2}')
# how to find a cluster is a Standard cluster or an Autopilot cluster
autopilot=$(is_autopilot_cluster "$project_id" "$cluster_location" "$cluster_name")
if [[ "$autopilot" == "True" ]]; then
add_service_account_for_autopilot "$project_id" "$cluster_location" "$cluster_name"
else
add_service_accounts_for_standard "$project_id" "$cluster_location" "$cluster_name"
fi
done <<< "$(gcloud container clusters list --project "$project_id" --format="value(name,location)")"
echo "--- 2. Check if service accounts have permissions"
unique_service_accounts=($(echo "${all_service_accounts[@]}" | tr ' ' '\n' | sort -u | tr '\n' ' '))
echo "Service accounts: ${unique_service_accounts[@]}"
printf "%-60s| %-40s| %-40s| %-20s\n" "service_account" "has_logging_permission" "has_monitoring_permission" "has_performance_hpa_metric_write_permission"
for sa in "${unique_service_accounts[@]}"; do
logging_permission=$(service_account_has_permission "$project_id" "$sa" "logging.logEntries.create")
time_series_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.timeSeries.create")
metric_descriptors_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.metricDescriptors.create")
if [[ "$time_series_create_permission" == "No" || "$metric_descriptors_create_permission" == "No" ]]; then
monitoring_permission="No"
else
monitoring_permission="Yes"
fi
performance_hpa_metric_write_permission=$(service_account_has_permission "$project_id" "$sa" "autoscaling.sites.writeMetrics")
printf "%-60s| %-40s| %-40s| %-20s\n" $sa $logging_permission $monitoring_permission $performance_hpa_metric_write_permission
if [[ "$logging_permission" == "No" || "$monitoring_permission" == "No" || "$performance_hpa_metric_write_permission" == "No" ]]; then
sa_missing_permissions+=( ${sa} )
fi
done
echo "--- 3. List all service accounts that don't have the above permissions"
if [[ "${#sa_missing_permissions[@]}" -gt 0 ]]; then
printf "Grant roles/container.defaultNodeServiceAccount to the following service accounts: %s\n" "${sa_missing_permissions[@]}"
else
echo "All service accounts have the above permissions"
fi
验证是否未达到 Cloud Logging Write API 配额
确认未达到 Cloud Logging 的 API 写入配额。
前往 Google Cloud 控制台中的配额页面。
按“Cloud Logging API”过滤表。
确认未达到任何配额。
使用 gcpdiag
调试 GKE 日志记录问题
如果您的 GKE 集群缺少日志或是获取的日志不完整,可使用 gcpdiag
工具进行问题排查。
gcpdiag
是一种开源工具,不是官方支持的 Google Cloud 产品。您可以使用 gcpdiag
工具来帮助识别和修复 Google Cloud项目问题。如需了解详情,请参阅 GitHub 上的 gcpdiag 项目。
- 项目级日志记录:确保托管 GKE 集群的 Google Cloud项目已启用 Cloud Logging API。
- 集群级日志记录:确认是否在 GKE 集群的配置中明确启用了日志记录。
- 节点池权限:确认集群节点池中的节点已启用“Cloud Logging 写入”范围,从而允许它们发送日志数据。
- 服务账号权限:确认节点池使用的服务账号是否具有与 Cloud Logging 交互所需的 IAM 权限。具体而言,“roles/logging.logWriter”角色通常是必需的。
- Cloud Logging API 写入配额:确认是否未超出指定时间范围内的 Cloud Logging API 写入配额。
Docker
您可以使用封装容器运行 gcpdiag
,以在 Docker 容器中启动 gcpdiag
。必须安装 Docker 或 Podman。
- 在本地工作站上复制并运行以下命令。
curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag
- 执行
gcpdiag
命令:./gcpdiag runbook gke/logs \ --parameter project_id=PROJECT_ID \ --parameter name=GKE_NAME \ --parameter location=LOCATION
查看此 Runbook 的可用参数。
替换以下内容:
- PROJECT_ID:资源所在项目的 ID。
- GKE_NAME:GKE 集群的名称。
- LOCATION:GKE 集群所在的可用区或区域。
实用标志:
--universe-domain
:如果适用,则为托管资源的可信合作伙伴主权云网域--parameter
或-p
:Runbook 参数
如需查看所有 gcpdiag
工具标志的列表和说明,请参阅 gcpdiag
使用说明。
后续步骤
如果您在文档中找不到问题的解决方案,请参阅获取支持以获取进一步的帮助,包括以下主题的建议:
- 请与 Cloud Customer Care 联系,以提交支持请求。
- 通过在 StackOverflow 上提问并使用
google-kubernetes-engine
标记搜索类似问题,从社区获得支持。您还可以加入#kubernetes-engine
Slack 频道,以获得更多社区支持。 - 使用公开问题跟踪器提交 bug 或功能请求。