排查 GKE 中的日志记录问题


本页面可帮助您解决 Google Kubernetes Engine (GKE) 日志记录流水线本身的问题,例如日志未显示在 Cloud Logging 中。如需详细了解如何使用日志排查工作负载和集群问题,请参阅 GKE 问题排查简介

Cloud Logging 中缺少集群日志

验证项目中是否已启用日志记录

  1. 列出已启用的服务:

    gcloud services list --enabled --filter="NAME=logging.googleapis.com"
    

    以下输出表明已为项目启用日志记录:

    NAME                    TITLE
    logging.googleapis.com  Cloud Logging API
    

    可选:查看日志查看器中的日志,以确定谁停用了 API 以及何时停用 API:

    protoPayload.methodName="google.api.serviceusage.v1.ServiceUsage.DisableService"
    protoPayload.response.services="logging.googleapis.com"
    
  2. 如果日志记录已停用,请启用日志记录:

    gcloud services enable logging.googleapis.com
    

验证集群上是否已启用日志记录

  1. 列出集群:

    gcloud container clusters list \
        --project=PROJECT_ID \
        '--format=value(name,loggingConfig.componentConfig.enableComponents)' \
        --sort-by=name | column -t
    

    替换以下内容:

    • PROJECT_ID:您的 Google Cloud 项目 ID。

    输出内容类似如下:

    cluster-1              SYSTEM_COMPONENTS
    cluster-2              SYSTEM_COMPONENTS;WORKLOADS
    cluster-3
    

    如果集群的值为空,则停用日志记录。例如,此输出中的 cluster-3 已停用日志记录。

  2. 如果设置为 NONE,则启用集群日志记录:

    gcloud container clusters update CLUSTER_NAME  \
        --logging=SYSTEM,WORKLOAD \
        --location=COMPUTE_LOCATION
    

    替换以下内容:

验证您已在项目和集群中启用日志记录后,不妨考虑使用 Gemini Cloud Assist 调查服务来深入了解日志并相应解决问题。如需详细了解如何使用 Logs Explorer 以不同方式发起调查,请参阅 Gemini 文档中的使用 Gemini Cloud Assist 调查服务排查问题

验证节点池中的节点是否具有 Cloud Logging 访问权限范围

节点需要以下范围之一才能将日志写入 Cloud Logging:

  • https://www.googleapis.com/auth/logging.write
  • https://www.googleapis.com/auth/cloud-platform
  • https://www.googleapis.com/auth/logging.admin
  1. 检查为集群中的每个节点池配置的范围:

    gcloud container node-pools list --cluster=CLUSTER_NAME \
        --format="table(name,config.oauthScopes)" \
        --location COMPUTE_LOCATION
    

    替换以下内容:

    将工作负载从旧节点池迁移到新创建的节点池,并监控进度。

  2. 创建具有正确日志记录范围的新节点池:

    gcloud container node-pools create NODE_POOL_NAME \
        --cluster=CLUSTER_NAME \
        --location=COMPUTE_LOCATION \
        --scopes="gke-default"
    

    替换以下内容:

识别节点服务账号缺少关键权限的集群

如需识别节点服务账号缺少关键权限的集群,请使用 NODE_SA_MISSING_PERMISSIONS Recommender 子类型GKE 建议

  • 使用 Google Cloud 控制台。 前往 Kubernetes 集群页面。在特定集群的通知列中,查看授予关键权限建议。
  • 使用 gcloud CLI,或者通过指定 NODE_SA_MISSING_PERMISSIONS Recommender 子类型使用 Recommender API。

    如需查询此建议,请运行以下命令:

    gcloud recommender recommendations list \
        --recommender=google.container.DiagnosisRecommender \
        --location LOCATION \
        --project PROJECT_ID \
        --format yaml \
        --filter="recommenderSubtype:NODE_SA_MISSING_PERMISSIONS"
    

如需实施此建议,请向节点的服务账号授予 roles/container.defaultNodeServiceAccount 角色。

您可以运行一个脚本,在项目 Standard 集群和 Autopilot 集群的节点池中搜索任何不具备 GKE 所需权限的节点服务账号。此脚本使用 gcloud CLI 和 jq 实用程序。如需查看脚本,请展开以下部分:

查看脚本

#!/bin/bash

# Set your project ID
project_id=PROJECT_ID
project_number=$(gcloud projects describe "$project_id" --format="value(projectNumber)")
declare -a all_service_accounts
declare -a sa_missing_permissions

# Function to check if a service account has a specific permission
# $1: project_id
# $2: service_account
# $3: permission
service_account_has_permission() {
  local project_id="$1"
  local service_account="$2"
  local permission="$3"

  local roles=$(gcloud projects get-iam-policy "$project_id" \
          --flatten="bindings[].members" \
          --format="table[no-heading](bindings.role)" \
          --filter="bindings.members:\"$service_account\"")

  for role in $roles; do
    if role_has_permission "$role" "$permission"; then
      echo "Yes" # Has permission
      return
    fi
  done

  echo "No" # Does not have permission
}

# Function to check if a role has the specific permission
# $1: role
# $2: permission
role_has_permission() {
  local role="$1"
  local permission="$2"
  gcloud iam roles describe "$role" --format="json" | \
  jq -r ".includedPermissions" | \
  grep -q "$permission"
}

# Function to add $1 into the service account array all_service_accounts
# $1: service account
add_service_account() {
  local service_account="$1"
  all_service_accounts+=( ${service_account} )
}

# Function to add service accounts into the global array all_service_accounts for a Standard GKE cluster
# $1: project_id
# $2: location
# $3: cluster_name
add_service_accounts_for_standard() {
  local project_id="$1"
  local cluster_location="$2"
  local cluster_name="$3"

  while read nodepool; do
    nodepool_name=$(echo "$nodepool" | awk '{print $1}')
    if [[ "$nodepool_name" == "" ]]; then
      # skip the empty line which is from running `gcloud container node-pools list` in GCP console
      continue
    fi
    while read nodepool_details; do
      service_account=$(echo "$nodepool_details" | awk '{print $1}')

      if [[ "$service_account" == "default" ]]; then
        service_account="${project_number}-compute@developer.gserviceaccount.com"
      fi
      if [[ -n "$service_account" ]]; then
        printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id  $cluster_name $cluster_location $nodepool_name
        add_service_account "${service_account}"
      else
        echo "cannot find service account for node pool $project_id\t$cluster_name\t$cluster_location\t$nodepool_details"
      fi
    done <<< "$(gcloud container node-pools describe "$nodepool_name" --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](config.serviceAccount)")"
  done <<< "$(gcloud container node-pools list --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](name)")"

}

# Function to add service accounts into the global array all_service_accounts for an Autopilot GKE cluster
# Autopilot cluster only has one node service account.
# $1: project_id
# $2: location
# $3: cluster_name
add_service_account_for_autopilot(){
  local project_id="$1"
  local cluster_location="$2"
  local cluster_name="$3"

  while read service_account; do
      if [[ "$service_account" == "default" ]]; then
        service_account="${project_number}-compute@developer.gserviceaccount.com"
      fi
      if [[ -n "$service_account" ]]; then
        printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id  $cluster_name $cluster_location $nodepool_name
        add_service_account "${service_account}"
      else
        echo "cannot find service account" for cluster  "$project_id\t$cluster_name\t$cluster_location\t"
      fi
  done <<< "$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --project "$project_id" --format="table[no-heading](autoscaling.autoprovisioningNodePoolDefaults.serviceAccount)")"
}


# Function to check whether the cluster is an Autopilot cluster or not
# $1: project_id
# $2: location
# $3: cluster_name
is_autopilot_cluster() {
  local project_id="$1"
  local cluster_location="$2"
  local cluster_name="$3"
  autopilot=$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --format="table[no-heading](autopilot.enabled)")
  echo "$autopilot"
}


echo "--- 1. List all service accounts in all GKE node pools"
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" "service_account" "project_id" "cluster_name" "cluster_location" "nodepool_name"
while read cluster; do
  cluster_name=$(echo "$cluster" | awk '{print $1}')
  cluster_location=$(echo "$cluster" | awk '{print $2}')
  # how to find a cluster is a Standard cluster or an Autopilot cluster
  autopilot=$(is_autopilot_cluster "$project_id" "$cluster_location" "$cluster_name")
  if [[ "$autopilot" == "True" ]]; then
    add_service_account_for_autopilot "$project_id" "$cluster_location"  "$cluster_name"
  else
    add_service_accounts_for_standard "$project_id" "$cluster_location"  "$cluster_name"
  fi
done <<< "$(gcloud container clusters list --project "$project_id" --format="value(name,location)")"

echo "--- 2. Check if service accounts have permissions"
unique_service_accounts=($(echo "${all_service_accounts[@]}" | tr ' ' '\n' | sort -u | tr '\n' ' '))

echo "Service accounts: ${unique_service_accounts[@]}"
printf "%-60s| %-40s| %-40s| %-20s\n" "service_account" "has_logging_permission" "has_monitoring_permission" "has_performance_hpa_metric_write_permission"
for sa in "${unique_service_accounts[@]}"; do
  logging_permission=$(service_account_has_permission "$project_id" "$sa" "logging.logEntries.create")
  time_series_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.timeSeries.create")
  metric_descriptors_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.metricDescriptors.create")
  if [[ "$time_series_create_permission" == "No" || "$metric_descriptors_create_permission" == "No" ]]; then
    monitoring_permission="No"
  else
    monitoring_permission="Yes"
  fi
  performance_hpa_metric_write_permission=$(service_account_has_permission "$project_id" "$sa" "autoscaling.sites.writeMetrics")
  printf "%-60s| %-40s| %-40s| %-20s\n" $sa $logging_permission $monitoring_permission $performance_hpa_metric_write_permission

  if [[ "$logging_permission" == "No" || "$monitoring_permission" == "No" || "$performance_hpa_metric_write_permission" == "No" ]]; then
    sa_missing_permissions+=( ${sa} )
  fi
done

echo "--- 3. List all service accounts that don't have the above permissions"
if [[ "${#sa_missing_permissions[@]}" -gt 0 ]]; then
  printf "Grant roles/container.defaultNodeServiceAccount to the following service accounts: %s\n" "${sa_missing_permissions[@]}"
else
  echo "All service accounts have the above permissions"
fi

识别集群中缺少关键权限的节点服务账号

GKE 使用关联到节点的 IAM 服务账号来运行日志记录和监控等系统任务。这些节点服务账号必须至少拥有项目的 Kubernetes Engine Default Node Service Account (roles/container.defaultNodeServiceAccount) 角色。默认情况下,GKE 会将 Compute Engine 默认服务账号(在您的项目中自动创建)用作节点服务账号。

如果您的组织强制执行 iam.automaticIamGrantsForDefaultServiceAccounts 组织政策限制,则项目中的默认 Compute Engine 服务账号可能无法自动获得 GKE 所需的权限。

  • 如需验证是否缺少日志记录权限,请检查集群的日志记录中是否存在 401 错误:

    [[ $(kubectl logs -l k8s-app=fluentbit-gke -n kube-system -c fluentbit-gke | grep -cw "Received 401") -gt 0 ]] && echo "true" || echo "false"
    

    如果输出为 true,则表示系统工作负载发生 401 错误,这表明缺少权限。如果输出为 false,请跳过其余步骤,并尝试其他问题排查步骤。如需确定所有缺失的关键权限,请查看脚本

  1. 找到节点使用的服务账号的名称:

    控制台

    1. 前往 Kubernetes 集群页面:

      转到 Kubernetes 集群

    2. 在集群列表中,点击您要检查的集群的名称。
    3. 根据操作的集群模式,执行以下操作之一:
      • 对于 Autopilot 模式集群,在安全部分中,找到服务账号字段。
      • 对于 Standard 模式集群,请执行以下操作:
        1. 点击节点标签页。
        2. 节点池表格中,点击节点池名称。此时会打开节点池详情页面。
        3. 安全部分中,找到服务账号字段。

    如果服务账号字段中的值为 default,则表示节点使用 Compute Engine 默认服务账号。如果此字段中的值不是 default,则表示节点使用自定义服务账号。如需向自定义服务账号授予所需角色,请参阅使用最小权限 IAM 服务账号

    gcloud

    对于 Autopilot 模式集群,请运行以下命令:

    gcloud container clusters describe CLUSTER_NAME \
        --location=LOCATION \
        --flatten=autoscaling.autoprovisioningNodePoolDefaults.serviceAccount

    对于 Standard 模式集群,请运行以下命令:

    gcloud container clusters describe CLUSTER_NAME \
        --location=LOCATION \
        --format="table(nodePools.name,nodePools.config.serviceAccount)"

    如果输出为 default,则表示节点使用 Compute Engine 默认服务账号。如果输出不是 default,则表示节点使用自定义服务账号。如需向自定义服务账号授予所需角色,请参阅使用最小权限 IAM 服务账号

  2. 如需向 Compute Engine 默认服务账号授予 roles/container.defaultNodeServiceAccount 角色,请完成以下步骤:

    控制台

    1. 前往欢迎页面:

      前往“欢迎”页面

    2. 项目编号字段中,点击 复制到剪贴板
    3. 转到 IAM 页面:

      转到 IAM

    4. 点击 授予访问权限
    5. 新的主账号字段中,指定以下值:
      PROJECT_NUMBER-compute@developer.gserviceaccount.com
      PROJECT_NUMBER 替换为您复制的项目编号。
    6. 选择角色菜单中,选择 Kubernetes Engine Default Node Service Account 角色。
    7. 点击保存

    gcloud

    1. 找到您的 Google Cloud 项目编号:
      gcloud projects describe PROJECT_ID \
          --format="value(projectNumber)"

      PROJECT_ID 替换为您的项目 ID。

      输出内容类似如下:

      12345678901
      
    2. roles/container.defaultNodeServiceAccount 角色授予 Compute Engine 默认服务账号:
      gcloud projects add-iam-policy-binding PROJECT_ID \
          --member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com" \
          --role="roles/container.defaultNodeServiceAccount"

      PROJECT_NUMBER 替换为上一步中的项目编号。

  • 验证节点服务账号是否具有所需权限。检查脚本以进行验证。

用于确定 GKE 节点服务账号缺少哪些权限的脚本

您可以运行一个脚本,在项目 Standard 集群和 Autopilot 集群的节点池中搜索任何不具备 GKE 所需权限的节点服务账号。此脚本使用 gcloud CLI 和 jq 实用程序。如需查看脚本,请展开以下部分:

查看脚本

#!/bin/bash

# Set your project ID
project_id=PROJECT_ID
project_number=$(gcloud projects describe "$project_id" --format="value(projectNumber)")
declare -a all_service_accounts
declare -a sa_missing_permissions

# Function to check if a service account has a specific permission
# $1: project_id
# $2: service_account
# $3: permission
service_account_has_permission() {
  local project_id="$1"
  local service_account="$2"
  local permission="$3"

  local roles=$(gcloud projects get-iam-policy "$project_id" \
          --flatten="bindings[].members" \
          --format="table[no-heading](bindings.role)" \
          --filter="bindings.members:\"$service_account\"")

  for role in $roles; do
    if role_has_permission "$role" "$permission"; then
      echo "Yes" # Has permission
      return
    fi
  done

  echo "No" # Does not have permission
}

# Function to check if a role has the specific permission
# $1: role
# $2: permission
role_has_permission() {
  local role="$1"
  local permission="$2"
  gcloud iam roles describe "$role" --format="json" | \
  jq -r ".includedPermissions" | \
  grep -q "$permission"
}

# Function to add $1 into the service account array all_service_accounts
# $1: service account
add_service_account() {
  local service_account="$1"
  all_service_accounts+=( ${service_account} )
}

# Function to add service accounts into the global array all_service_accounts for a Standard GKE cluster
# $1: project_id
# $2: location
# $3: cluster_name
add_service_accounts_for_standard() {
  local project_id="$1"
  local cluster_location="$2"
  local cluster_name="$3"

  while read nodepool; do
    nodepool_name=$(echo "$nodepool" | awk '{print $1}')
    if [[ "$nodepool_name" == "" ]]; then
      # skip the empty line which is from running `gcloud container node-pools list` in GCP console
      continue
    fi
    while read nodepool_details; do
      service_account=$(echo "$nodepool_details" | awk '{print $1}')

      if [[ "$service_account" == "default" ]]; then
        service_account="${project_number}-compute@developer.gserviceaccount.com"
      fi
      if [[ -n "$service_account" ]]; then
        printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id  $cluster_name $cluster_location $nodepool_name
        add_service_account "${service_account}"
      else
        echo "cannot find service account for node pool $project_id\t$cluster_name\t$cluster_location\t$nodepool_details"
      fi
    done <<< "$(gcloud container node-pools describe "$nodepool_name" --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](config.serviceAccount)")"
  done <<< "$(gcloud container node-pools list --cluster "$cluster_name" --zone "$cluster_location" --project "$project_id" --format="table[no-heading](name)")"

}

# Function to add service accounts into the global array all_service_accounts for an Autopilot GKE cluster
# Autopilot cluster only has one node service account.
# $1: project_id
# $2: location
# $3: cluster_name
add_service_account_for_autopilot(){
  local project_id="$1"
  local cluster_location="$2"
  local cluster_name="$3"

  while read service_account; do
      if [[ "$service_account" == "default" ]]; then
        service_account="${project_number}-compute@developer.gserviceaccount.com"
      fi
      if [[ -n "$service_account" ]]; then
        printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" $service_account $project_id  $cluster_name $cluster_location $nodepool_name
        add_service_account "${service_account}"
      else
        echo "cannot find service account" for cluster  "$project_id\t$cluster_name\t$cluster_location\t"
      fi
  done <<< "$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --project "$project_id" --format="table[no-heading](autoscaling.autoprovisioningNodePoolDefaults.serviceAccount)")"
}


# Function to check whether the cluster is an Autopilot cluster or not
# $1: project_id
# $2: location
# $3: cluster_name
is_autopilot_cluster() {
  local project_id="$1"
  local cluster_location="$2"
  local cluster_name="$3"
  autopilot=$(gcloud container clusters describe "$cluster_name" --location "$cluster_location" --format="table[no-heading](autopilot.enabled)")
  echo "$autopilot"
}


echo "--- 1. List all service accounts in all GKE node pools"
printf "%-60s| %-40s| %-40s| %-10s| %-20s\n" "service_account" "project_id" "cluster_name" "cluster_location" "nodepool_name"
while read cluster; do
  cluster_name=$(echo "$cluster" | awk '{print $1}')
  cluster_location=$(echo "$cluster" | awk '{print $2}')
  # how to find a cluster is a Standard cluster or an Autopilot cluster
  autopilot=$(is_autopilot_cluster "$project_id" "$cluster_location" "$cluster_name")
  if [[ "$autopilot" == "True" ]]; then
    add_service_account_for_autopilot "$project_id" "$cluster_location"  "$cluster_name"
  else
    add_service_accounts_for_standard "$project_id" "$cluster_location"  "$cluster_name"
  fi
done <<< "$(gcloud container clusters list --project "$project_id" --format="value(name,location)")"

echo "--- 2. Check if service accounts have permissions"
unique_service_accounts=($(echo "${all_service_accounts[@]}" | tr ' ' '\n' | sort -u | tr '\n' ' '))

echo "Service accounts: ${unique_service_accounts[@]}"
printf "%-60s| %-40s| %-40s| %-20s\n" "service_account" "has_logging_permission" "has_monitoring_permission" "has_performance_hpa_metric_write_permission"
for sa in "${unique_service_accounts[@]}"; do
  logging_permission=$(service_account_has_permission "$project_id" "$sa" "logging.logEntries.create")
  time_series_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.timeSeries.create")
  metric_descriptors_create_permission=$(service_account_has_permission "$project_id" "$sa" "monitoring.metricDescriptors.create")
  if [[ "$time_series_create_permission" == "No" || "$metric_descriptors_create_permission" == "No" ]]; then
    monitoring_permission="No"
  else
    monitoring_permission="Yes"
  fi
  performance_hpa_metric_write_permission=$(service_account_has_permission "$project_id" "$sa" "autoscaling.sites.writeMetrics")
  printf "%-60s| %-40s| %-40s| %-20s\n" $sa $logging_permission $monitoring_permission $performance_hpa_metric_write_permission

  if [[ "$logging_permission" == "No" || "$monitoring_permission" == "No" || "$performance_hpa_metric_write_permission" == "No" ]]; then
    sa_missing_permissions+=( ${sa} )
  fi
done

echo "--- 3. List all service accounts that don't have the above permissions"
if [[ "${#sa_missing_permissions[@]}" -gt 0 ]]; then
  printf "Grant roles/container.defaultNodeServiceAccount to the following service accounts: %s\n" "${sa_missing_permissions[@]}"
else
  echo "All service accounts have the above permissions"
fi

验证是否未达到 Cloud Logging Write API 配额

确认未达到 Cloud Logging 的 API 写入配额。

  1. 前往 Google Cloud 控制台中的配额页面。

    转到“配额”

  2. 按“Cloud Logging API”过滤表。

  3. 确认未达到任何配额。

使用 gcpdiag 调试 GKE 日志记录问题

如果您的 GKE 集群缺少日志或是获取的日志不完整,可使用 gcpdiag 工具进行问题排查。

gcpdiag 是一种开源工具,不是官方支持的 Google Cloud 产品。您可以使用 gcpdiag 工具来帮助识别和修复 Google Cloud项目问题。如需了解详情,请参阅 GitHub 上的 gcpdiag 项目

如果 GKE 集群中的日志缺失或不完整,请重点检查以下对日志记录功能至关重要的核心配置设置,调查潜在原因:

  • 项目级日志记录:确保托管 GKE 集群的 Google Cloud项目已启用 Cloud Logging API。
  • 集群级日志记录:确认是否在 GKE 集群的配置中明确启用了日志记录。
  • 节点池权限:确认集群节点池中的节点已启用“Cloud Logging 写入”范围,从而允许它们发送日志数据。
  • 服务账号权限:确认节点池使用的服务账号是否具有与 Cloud Logging 交互所需的 IAM 权限。具体而言,“roles/logging.logWriter”角色通常是必需的。
  • Cloud Logging API 写入配额:确认是否未超出指定时间范围内的 Cloud Logging API 写入配额。

Docker

您可以使用封装容器运行 gcpdiag,以在 Docker 容器中启动 gcpdiag。必须安装 Docker 或 Podman

  1. 在本地工作站上复制并运行以下命令。
    curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag
  2. 执行 gcpdiag 命令:
    ./gcpdiag runbook gke/logs \
        --parameter project_id=PROJECT_ID \
        --parameter name=GKE_NAME \
        --parameter location=LOCATION

查看此 Runbook 的可用参数

替换以下内容:

  • PROJECT_ID:资源所在项目的 ID。
  • GKE_NAME:GKE 集群的名称。
  • LOCATION:GKE 集群所在的可用区或区域。

实用标志:

如需查看所有 gcpdiag 工具标志的列表和说明,请参阅 gcpdiag 使用说明

后续步骤