本頁面由 Cloud Translation API 翻譯而成。

使用 Prometheus 監控 Config Sync

本頁面說明如何將 Config Sync 的指標傳送至 Prometheus。

本頁面說明如何使用 Prometheus 查看 Config Sync 指標。如要瞭解其他匯出指標的方式，請參閱「使用 Cloud Monitoring 監控 Config Sync」或「使用自訂監控功能監控 Config Sync」。

Config Sync 會自動收集指標並匯出至 Prometheus。您可以設定 Cloud Monitoring，從 Prometheus 提取自訂指標。這樣您就可以同時在 Prometheus 和 Monitoring 中查看自訂指標。詳情請參閱 GKE 說明文件中的「使用 Prometheus」。

擷取指標

您可以透過通訊埠 8675 擷取所有 Prometheus 指標。不過，您必須先使用以下其中一種方式為 Prometheus 設定叢集，才能擷取指標。請採取下列任一項做法：

按照 Prometheus 說明文件設定叢集，以便進行擷取；或者

將 Prometheus Operator 與下列資訊清單搭配使用，即可以每 10 秒鐘一次的頻率擷取所有 Config Sync 指標。

建立一個臨時目錄來保存資訊清單檔案。
```
mkdir config-sync-monitor
cd config-sync-monitor
```
從 CoreOS 存放區下載 Prometheus Operator 資訊清單，並使用 curl 指令：
```
curl -o bundle.yaml https://raw.githubusercontent.com/coreos/prometheus-operator/master/bundle.yaml
```
這份資訊清單已設定為使用 default 命名空間 (這不是建議做法)。下一個步驟會修改設定，改用稱為 monitoring 的命名空間。如要使用不同的命名空間，請替換其餘步驟中顯示的所有 monitoring。

建立檔案，更新上述套件中 ClusterRoleBinding 的命名空間。

# patch-crb.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-operator
subjects:
- kind: ServiceAccount
  name: prometheus-operator
  namespace: monitoring # we are patching from default namespace

建立 kustomization.yaml 檔案，用於套用修補程式並修改資訊清單中其他資源的命名空間。

# kustomization.yaml
resources:
- bundle.yaml

namespace: monitoring

patchesStrategicMerge:
- patch-crb.yaml

如果沒有 monitoring 命名空間，請建立一個。您可以使用不同的命名空間名稱，但請同時變更先前步驟中 YAML 資訊清單內的 namespace 值。
```
kubectl create namespace monitoring
```

使用下列指令套用 Kustomize 資訊清單：

kubectl apply -k .

until kubectl get customresourcedefinitions servicemonitors.monitoring.coreos.com ; \
do date; sleep 1; echo ""; done

第二個指令會封鎖，直到叢集上提供 CRD 為止。

為設定 Prometheus 伺服器所需的資源建立資訊清單，該伺服器會擷取 Config Sync 的指標。

# config-sync-monitoring.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus-config-sync
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus-config-sync
rules:
- apiGroups: [""]
  resources:
  - nodes
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources:
  - configmaps
  verbs: ["get"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus-config-sync
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus-config-sync
subjects:
- kind: ServiceAccount
  name: prometheus-config-sync
  namespace: monitoring
---
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: config-sync
  namespace: monitoring
  labels:
    prometheus: config-sync
spec:
  replicas: 2
  serviceAccountName: prometheus-config-sync
  serviceMonitorSelector:
    matchLabels:
      prometheus: config-management
  alerting:
    alertmanagers:
    - namespace: default
      name: alertmanager
      port: web
  resources:
    requests:
      memory: 400Mi
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus-config-sync
  namespace: monitoring
  labels:
    prometheus: config-sync
spec:
  type: NodePort
  ports:
  - name: web
    nodePort: 31900
    port: 9190
    protocol: TCP
    targetPort: web
  selector:
    prometheus: config-sync
--- 
---

使用下列指令套用資訊清單：

kubectl apply -f config-sync.yaml

until kubectl rollout status statefulset/prometheus-config-sync -n monitoring; \
do sleep 1; done

第二個指令會封鎖，直到 Pod 執行為止。

您可以將 Prometheus 伺服器的網路通訊埠轉送到本機電腦，確認是否順利安裝。
```
kubectl -n monitoring port-forward svc/prometheus-config-sync 9190
```
您現在可以前往 http://localhost:9190 存取 Prometheus 網頁 UI。
移除臨時目錄。
```
cd ..
rm -rf config-sync-monitor
```

可用的 Prometheus 指標

Config Sync 會收集下列指標，並提供給 Prometheus 使用。「標籤」欄會列出適用於各項指標的所有標籤。沒有標籤的指標代表一段時間內的單一測量結果，而有標籤的指標則代表多個測量結果，每個標籤值組合各有一個。

如果這份表格未保持同步，您可以在 Prometheus 使用者介面中依據前置字元篩選指標。所有指標開頭的前置字串都是 config_sync_。

名稱	類型	標籤	說明
`config_sync_api_duration_seconds_bucket`	直方圖	狀態、作業	API 伺服器呼叫的延遲時間分布情形 (依每個週期的時間長度分組)
`config_sync_api_duration_seconds_count`	直方圖	狀態、作業	API 伺服器呼叫的延遲時間分布情形 (忽略時間長度)
`config_sync_api_duration_seconds_sum`	直方圖	狀態、作業	所有 API 伺服器呼叫的持續時間總和
`config_sync_apply_duration_seconds_bucket`	直方圖	commit、status	將可靠資料來源中宣告的資源套用至叢集的延遲時間分布情形 (依每個週期的時間長度分配至不同儲存區)
`config_sync_apply_duration_seconds_count`	直方圖	commit、status	將可靠資料來源中宣告的資源套用至叢集的延遲分布情形 (忽略時間長度)
`config_sync_apply_duration_seconds_sum`	直方圖	commit、status	將可靠來源中宣告的資源套用至叢集時，所有延遲時間的總和
`config_sync_apply_operations_total`	計數器	運作、狀態、控制器	為將資源從單一資訊來源同步至叢集而執行的作業數
`config_sync_cluster_scoped_resource_count`	度量圖	resourcegroup	ResourceGroup 中的叢集範圍資源數量
`config_sync_crd_count`	度量圖	resourcegroup	ResourceGroup 中的 CRD 數量
`config_sync_declared_resources`	度量圖	commit	從 Git 剖析的已宣告資源數量
`config_sync_internal_errors_total`	計數器	來源	Config Sync 觸發的內部錯誤數。如果沒有發生內部錯誤，指標可能不會顯示
`config_sync_kcc_resource_count`	度量圖	resourcegroup	ResourceGroup 中的 Config Connector 資源數量
`config_sync_last_apply_timestamp`	度量圖	commit、status	最近一次套用作業的時間戳記
`config_sync_last_sync_timestamp`	度量圖	commit、status	最近一次從 Git 同步處理的時間戳記
`config_sync_parser_duration_seconds_bucket`	直方圖	狀態、觸發條件、來源	從真實來源同步至叢集所涉及不同階段的延遲時間分布情形
`config_sync_parser_duration_seconds_count`	直方圖	狀態、觸發條件、來源	從真實來源同步到叢集所涉及不同階段的延遲時間分布情形 (忽略時間長度)
`config_sync_parser_duration_seconds_sum`	直方圖	狀態、觸發條件、來源	從真實來源同步到叢集所涉及不同階段的延遲時間總和
`config_sync_pipeline_error_observed`	度量圖	名稱、協調器、元件	RootSync 和 RepoSync 自訂資源的狀態。值為 1 表示失敗
`config_sync_ready_resource_count`	度量圖	resourcegroup	ResourceGroup 中可用的資源總數
`config_sync_reconcile_duration_seconds_bucket`	直方圖	狀態	調解管理工具處理的調解事件延遲時間分布情形 (依每次呼叫的持續時間分配到不同值區)
`config_sync_reconcile_duration_seconds_count`	直方圖	狀態	調解管理工具處理的調解事件延遲時間分布情形 (忽略時間長度)
`config_sync_reconcile_duration_seconds_sum`	直方圖	狀態	由協調器管理員處理的所有協調事件延遲時間總和
`config_sync_reconciler_errors`	度量圖	component、errorclass	從可靠來源將資源同步至叢集時發生的錯誤數
`config_sync_remediate_duration_seconds_bucket`	直方圖	狀態	補救措施調解事件的延遲時間分布情形 (依時間長度劃分到不同區間)
`config_sync_remediate_duration_seconds_count`	直方圖	狀態	補救措施協調事件的延遲分布情形 (忽略時間長度)
`config_sync_remediate_duration_seconds_sum`	直方圖	狀態	所有補救措施協調事件延遲時間的總和
`config_sync_resource_count`	度量圖	resourcegroup	ResourceGroup 追蹤的資源數量
`config_sync_resource_conflicts_total`	計數器	commit	快取資源與叢集資源不符，導致資源衝突的次數。如果沒有發生資源衝突，指標可能不會顯示
`config_sync_resource_fights_total`	計數器		同步頻率過高的資源數量。如果沒有發生資源爭用，指標可能不會顯示
`config_sync_resource_group_total`	度量圖		ResourceGroup CR 數量
`config_sync_resource_ns_count`	度量圖	resourcegroup	ResourceGroup 中資源使用的命名空間數量
`config_sync_rg_reconcile_duration_seconds_bucket`。	直方圖	stallreason	調解 ResourceGroup CR 的時間分布情形 (依時間長度分組)
`config_sync_rg_reconcile_duration_seconds_count`	直方圖	stallreason	調解 ResourceGroup CR 的時間分布情形 (忽略時間長度)
`config_sync_rg_reconcile_duration_seconds_sum`	直方圖	stallreason	ResourceGroup CR 協調時間總和
`config_sync_kustomize_build_latency_bucket`	直方圖		`kustomize build` 執行時間的延遲分布情形 (依各項作業的時間長度分組)
`config_sync_kustomize_build_latency_count`	直方圖		`kustomize build` 執行時間的延遲時間分布情形 (忽略時間長度)
`config_sync_kustomize_build_latency_sum`	直方圖		所有 `kustomize build` 的執行時間總和
`config_sync_kustomize_ordered_top_tier_metrics`	度量圖	top_tier_field	使用資源、產生器、SecretGenerator、ConfigMapGenerator、轉換器和驗證器
`config_sync_kustomize_builtin_transformers`	度量圖	k8s_builtin_transformer	使用與 Kubernetes 物件中繼資料相關的內建轉換器
`config_sync_kustomize_resource_count`	度量圖		「`kustomize build`」輸出的資源數量
`config_sync_kustomize_field_count`	度量圖	field_name	在自訂檔案中使用特定欄位的次數
`config_sync_kustomize_patch_count`	度量圖	patch_field	`patches`、`patchesStrategicMerge` 和 `patchesJson6902` 欄位中的修補程式數量
`config_sync_kustomize_base_count`	度量圖	base_source	遠端和本機基地數量
`kustomize_deprecating_field_count`	度量圖	deprecating_field	使用可能遭到淘汰的欄位
`kustomize_simplification_adoption_count`	度量圖	simplification_field	簡化轉換器圖片、副本和替代項目的使用方式
`kustomize_helm_inflator_count`	度量圖	helm_inflator	在 Kustomize 中使用 Helm，無論是透過內建欄位或自訂函式

Prometheus 的偵錯程序範例

以下範例說明如何使用 Prometheus 指標、物件狀態欄位和物件註解，偵測及診斷與 Config Sync 相關的問題。這些範例說明如何從偵測問題的高階監控開始，然後逐步精確搜尋，深入瞭解並診斷問題的根本原因。

依狀態查詢設定

reconciler 程序提供高階指標，可深入瞭解叢集上 Config Sync 的整體運作情況。您可以查看是否發生任何錯誤，甚至可以設定錯誤快訊。

config_sync_reconciler_errors

依對帳員查詢指標

如果您使用 Config Sync RootSync 和 RepoSync API，則可以監控 RootSync 和 RepoSync 物件。RootSync 和 RepoSync 物件會搭配高階指標，讓您深入瞭解 Config Sync 在叢集上的運作方式。幾乎所有指標都會以協調器名稱標記，因此您可以查看是否發生任何錯誤，並在 Prometheus 中設定相關快訊。

如要篩選，請參閱可用指標標籤的完整清單。

在 Prometheus 中，您可以對 RootSync 或 RepoSync 使用下列篩選器：

# Querying RootSync
config_sync_reconciler_errors{configsync_sync_name=ROOT_SYNC_NAME}

# Querying RepoSync
config_sync_reconciler_errors{configsync_sync_name=REPO_SYNC_NAME}

依狀態查詢匯入和同步作業

在 Prometheus 中，您可以使用下列查詢：

# Check for errors that occurred when sourcing configs.
config_sync_reconciler_errors{component="source"}

# Check for errors that occurred when syncing configs to the cluster.
config_sync_reconciler_errors{component="sync"}

您也可以查看來源和同步程序本身的指標：

config_sync_parser_duration_seconds{status="error"}
config_sync_apply_duration_seconds{status="error"}
config_sync_remediate_duration_seconds{status="error"}

使用 Google Cloud Managed Service for Prometheus 監控資源

Google Cloud Managed Service for Prometheus 是全代管多雲解決方案，適用於 Prometheus 指標。 Google Cloud支援兩種資料收集模式：代管收集 (建議模式) 或自行部署資料收集。請按照下列步驟，在代管收集模式下，使用 Google Cloud Managed Service for Prometheus 設定 Config Sync 監控功能。

按照「設定代管型收集」一文的說明，在叢集上啟用 Managed Prometheus。

將下列範例資訊清單儲存為 pod-monitoring-config-sync-monitoring.yaml。這份資訊清單會設定 PodMonitoring 資源，以便在 config-management-monitoring 命名空間下，擷取 otel-collector-* Pod 8675 通訊埠的 Config Sync 指標。PodMonitoring 資源會使用 Kubernetes 標籤選取器尋找 otel-collector-* Pod。

apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: config-sync-monitoring
  namespace: config-management-monitoring
spec:
  selector:
    matchLabels:
      app: opentelemetry
      component: otel-collector
  endpoints:
  - port: 8675
    interval: 10s

將資訊清單套用至叢集：

kubectl apply -f pod-monitoring-config-sync-monitoring.yaml

按照「Cloud Monitoring 中的 Managed Service for Prometheus 資料」一文的說明，在 Google Cloud 控制台使用 Cloud Monitoring Metrics Explorer 頁面，確認 Prometheus 資料是否已匯出。

後續步驟

搭配 Config Sync SLI 使用 Prometheus 警報規則