Node Exporter

本文件說明如何設定 Google Kubernetes Engine 部署作業,以便使用 Google Cloud Managed Service for Prometheus 收集 Node Exporter 的指標。本文件說明如何執行下列操作:

  • 設定 Node Exporter 以回報指標。
  • 為 Managed Service for Prometheus 設定 PodMonitoring 資源,以便收集匯出的指標。
  • 在 Cloud Monitoring 中存取資訊主頁,查看指標。
  • 設定快訊規則來監控指標。

只有在您將 代管收集作業與 Managed Service for Prometheus 搭配使用時,才適用這些操作說明。如果您使用的是自行部署的收集,請參閱 Node Exporter 的來源存放區,瞭解安裝資訊。

這些操作說明僅供參考,應可在大多數 Kubernetes 環境中運作。如果您因安全性限制或機構政策而無法順利安裝應用程式或匯出程式,建議您參閱開放原始碼文件尋求支援。

事前準備

如要使用 Managed Service for Prometheus 和代管收集作業,從 Node Exporter 收集指標,您的部署作業必須符合下列規定:

  • 您的叢集必須執行 Google Kubernetes Engine 1.21.4-gke.300 以上版本。
  • 您必須在啟用代管收集作業的情況下,執行 Managed Service for Prometheus。詳情請參閱「 開始使用代管集合」。

  • 如要使用 Cloud Monitoring 提供的整合資訊主頁,您必須使用 node_exporter 1.3.1 以上版本。

    如要進一步瞭解可用的資訊主頁,請參閱「安裝資訊主頁」。

這些指標已為 GKE Autopilot 叢集中的節點啟用。

安裝 Node Exporter

您可以使用下列設定安裝 Node Exporter:

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  namespace: gmp-public
  name: node-exporter
  labels:
    app.kubernetes.io/name: node-exporter
    app.kubernetes.io/version: 1.8.2
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 10%
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-exporter
        app.kubernetes.io/version: 1.8.2
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/arch
                operator: In
                values:
                - arm64
                - amd64
              - key: kubernetes.io/os
                operator: In
                values:
                - linux
      containers:
      - name: node-exporter
        image: quay.io/prometheus/node-exporter:v1.8.2
        args:
        - --web.listen-address=:8080
        - --path.sysfs=/host/sys
        - --path.rootfs=/host/root
        - --no-collector.wifi
        - --no-collector.hwmon
        - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
        - --collector.netclass.ignored-devices=^(veth.*)$
        - --collector.netdev.device-exclude=^(veth.*)$
        ports:
        - name: metrics
          containerPort: 8080
        resources:
          limits:
            memory: 180Mi
          requests:
            cpu: 102m
            memory: 180Mi
        volumeMounts:
        - mountPath: /host/sys
          mountPropagation: HostToContainer
          name: sys
          readOnly: true
        - mountPath: /host/root
          mountPropagation: HostToContainer
          name: root
          readOnly: true
      hostNetwork: true
      hostPID: true
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
      volumes:
      - hostPath:
          path: /sys
        name: sys
      - hostPath:
          path: /
        name: root
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  namespace: gmp-public
  name: node-exporter
  labels:
    app.kubernetes.io/name: node-exporter
    app.kubernetes.io/part-of: google-cloud-managed-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-exporter
  endpoints:
  - port: metrics
    interval: 30s

如要套用本機檔案中的設定變更,請執行下列指令:

kubectl apply -f FILE_NAME

您也可以使用 Terraform 管理設定。

定義規則和快訊

您可以使用下列 Rules 設定,針對指標定義警示:

# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: monitoring.googleapis.com/v1
kind: Rules
metadata:
  namespace: gmp-public
  name: node-exporter-rules
  labels:
    app.kubernetes.io/component: rules
    app.kubernetes.io/name: node-exporter-rules
    app.kubernetes.io/part-of: google-cloud-managed-prometheus
spec:
  groups:
  - name: node-exporter
    interval: 30s
    rules:
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left and is filling
          up.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 24 hours.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 15
        and
          predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24 * 60 * 60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left and is filling
          up fast.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 4 hours.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 10
        and
          predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeFilesystemAlmostOutOfSpace
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace
        summary: Filesystem has less than 5% space left.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 30m
      labels:
        severity: warning
    - alert: NodeFilesystemAlmostOutOfSpace
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available space left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace
        summary: Filesystem has less than 3% space left.
      expr: |
        (
          node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 3
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 30m
      labels:
        severity: critical
    - alert: NodeFilesystemFilesFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left and is filling
          up.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup
        summary: Filesystem is predicted to run out of inodes within the next 24 hours.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40
        and
          predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemFilesFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left and is filling
          up fast.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup
        summary: Filesystem is predicted to run out of inodes within the next 4 hours.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 20
        and
          predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeFilesystemAlmostOutOfFiles
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles
        summary: Filesystem has less than 5% inodes left.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 5
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: warning
    - alert: NodeFilesystemAlmostOutOfFiles
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
          has only {{ printf "%.2f" $value }}% available inodes left.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles
        summary: Filesystem has less than 3% inodes left.
      expr: |
        (
          node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 3
        and
          node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
        )
      for: 1h
      labels:
        severity: critical
    - alert: NodeNetworkReceiveErrs
      annotations:
        description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
          {{ printf "%.0f" $value }} receive errors in the last two minutes.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworkreceiveerrs
        summary: Network interface is reporting many receive errors.
      expr: |
        rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
      for: 1h
      labels:
        severity: warning
    - alert: NodeNetworkTransmitErrs
      annotations:
        description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
          {{ printf "%.0f" $value }} transmit errors in the last two minutes.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworktransmiterrs
        summary: Network interface is reporting many transmit errors.
      expr: |
        rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
      for: 1h
      labels:
        severity: warning
    - alert: NodeHighNumberConntrackEntriesUsed
      annotations:
        description: '{{ $value | humanizePercentage }} of conntrack entries are used.'
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodehighnumberconntrackentriesused
        summary: Number of conntrack are getting close to the limit.
      expr: |
        (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75
      labels:
        severity: warning
    - alert: NodeTextFileCollectorScrapeError
      annotations:
        description: Node Exporter text file collector failed to scrape.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodetextfilecollectorscrapeerror
        summary: Node Exporter text file collector failed to scrape.
      expr: |
        node_textfile_scrape_error{job="node-exporter"} == 1
      labels:
        severity: warning
    - alert: NodeClockSkewDetected
      annotations:
        description: Clock on {{ $labels.instance }} is out of sync by more than 300s.
          Ensure NTP is configured correctly on this host.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodeclockskewdetected
        summary: Clock skew detected.
      expr: |
        (
          node_timex_offset_seconds{job="node-exporter"} > 0.05
        and
          deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) >= 0
        )
        or
        (
          node_timex_offset_seconds{job="node-exporter"} < -0.05
        and
          deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) <= 0
        )
      for: 10m
      labels:
        severity: warning
    - alert: NodeClockNotSynchronising
      annotations:
        description: Clock on {{ $labels.instance }} is not synchronising. Ensure
          NTP is configured on this host.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodeclocknotsynchronising
        summary: Clock not synchronising.
      expr: |
        min_over_time(node_timex_sync_status{job="node-exporter"}[5m]) == 0
        and
        node_timex_maxerror_seconds{job="node-exporter"} >= 16
      for: 10m
      labels:
        severity: warning
    - alert: NodeRAIDDegraded
      annotations:
        description: RAID array '{{ $labels.device }}' on {{ $labels.instance }} is
          in degraded state due to one or more disks failures. Number of spare drives
          is insufficient to fix issue automatically.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/noderaiddegraded
        summary: RAID Array is degraded
      expr: |
        node_md_disks_required{job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"} - ignoring (state) (node_md_disks{state="active",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}) > 0
      for: 15m
      labels:
        severity: critical
    - alert: NodeRAIDDiskFailure
      annotations:
        description: At least one device in RAID array on {{ $labels.instance }} failed.
          Array '{{ $labels.device }}' needs attention and possibly a disk swap.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/noderaiddiskfailure
        summary: Failed device in RAID array
      expr: |
        node_md_disks{state="failed",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"} > 0
      labels:
        severity: warning
    - alert: NodeFileDescriptorLimit
      annotations:
        description: File descriptors limit at {{ $labels.instance }} is currently
          at {{ printf "%.2f" $value }}%.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit
        summary: Kernel is predicted to exhaust file descriptors limit soon.
      expr: |
        (
          node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 70
        )
      for: 15m
      labels:
        severity: warning
    - alert: NodeFileDescriptorLimit
      annotations:
        description: File descriptors limit at {{ $labels.instance }} is currently
          at {{ printf "%.2f" $value }}%.
        runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit
        summary: Kernel is predicted to exhaust file descriptors limit soon.
      expr: |
        (
          node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 90
        )
      for: 15m
      labels:
        severity: critical
  - name: node-exporter.rules
    interval: 30s
    rules:
    - expr: |
        count without (cpu, mode) (
          node_cpu_seconds_total{job="node-exporter",mode="idle"}
        )
      record: instance:node_num_cpu:sum
    - expr: |
        1 - avg without (cpu) (
          sum without (mode) (rate(node_cpu_seconds_total{job="node-exporter", mode=~"idle|iowait|steal"}[5m]))
        )
      record: instance:node_cpu_utilisation:rate5m
    - expr: |
        (
          node_load1{job="node-exporter"}
        /
          instance:node_num_cpu:sum{job="node-exporter"}
        )
      record: instance:node_load1_per_cpu:ratio
    - expr: |
        1 - (
          (
            node_memory_MemAvailable_bytes{job="node-exporter"}
            or
            (
              node_memory_Buffers_bytes{job="node-exporter"}
              +
              node_memory_Cached_bytes{job="node-exporter"}
              +
              node_memory_MemFree_bytes{job="node-exporter"}
              +
              node_memory_Slab_bytes{job="node-exporter"}
            )
          )
        /
          node_memory_MemTotal_bytes{job="node-exporter"}
        )
      record: instance:node_memory_utilisation:ratio
    - expr: |
        rate(node_vmstat_pgmajfault{job="node-exporter"}[5m])
      record: instance:node_vmstat_pgmajfault:rate5m
    - expr: |
        rate(node_disk_io_time_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}[5m])
      record: instance_device:node_disk_io_time_seconds:rate5m
    - expr: |
        rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+)"}[5m])
      record: instance_device:node_disk_io_time_weighted_seconds:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_receive_bytes_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_receive_bytes_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_transmit_bytes_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_transmit_bytes_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_receive_drop_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_receive_drop_excluding_lo:rate5m
    - expr: |
        sum without (device) (
          rate(node_network_transmit_drop_total{job="node-exporter", device!="lo"}[5m])
        )
      record: instance:node_network_transmit_drop_excluding_lo:rate5m

如要套用本機檔案中的設定變更,請執行下列指令:

kubectl apply -f FILE_NAME

您也可以使用 Terraform 管理設定。

如要進一步瞭解如何將規則套用至叢集,請參閱「受管理的規則評估和快訊」。

這項 Rules 設定是根據 kube-prometheus 存放區提供的規則和快訊所調整。

驗證設定

您可以使用 Metrics Explorer 驗證是否已正確設定 Node Exporter。Cloud Monitoring 可能需要一或兩分鐘的時間才能擷取指標。

如要確認已擷取指標,請執行下列操作:

  1. 前往 Google Cloud 控制台的 「Metrics Explorer」頁面:

    前往 Metrics Explorer

    如果您是使用搜尋列尋找這個頁面,請選取子標題為「Monitoring」的結果

  2. 在查詢建構工具窗格的工具列中,選取名稱為  MQL PromQL 的按鈕。
  3. 確認「Language」切換鈕中已選取「PromQL」。語言切換鈕位於可讓您設定查詢格式的工具列中。
  4. 輸入並執行以下查詢:
    up{job="node-exporter", cluster="CLUSTER_NAME", namespace="gmp-public"}
    

安裝資訊主頁

Cloud Monitoring 提供範例資訊主頁程式庫,可用於整合。範例程式庫包含「Prometheus」資訊主頁,您可以安裝該資訊主頁,在 Google Cloud 主控台中查看資料。

請注意,Kubernetes 叢集 Prometheus 總覽資訊主頁需要安裝 Kube State Metrics。如要使用 Kubernetes Pod Prometheus 總覽資訊主頁,必須先安裝 Kube State MetricsKubelet/cAdvisor

如要從樣本資料庫安裝資訊主頁,請按照下列步驟操作:

  1. 在 Google Cloud 控制台中,前往「Dashboards」(資訊主頁) 頁面:

    前往「Dashboards」(資訊主頁)

    如果您是使用搜尋列尋找這個頁面,請選取子標題為「Monitoring」的結果

  2. 選取「Sample Library」分頁標籤。
  3. 選擇「其他」類別。
  4. (選用) 如要查看資訊主頁的靜態預覽畫面,但不必安裝資訊主頁,請按一下「預覽」
  5. 選取要安裝的資訊主頁,然後按一下 「匯入」

如要進一步瞭解如何安裝資訊主頁,請參閱「 安裝範例資訊主頁」。

疑難排解

如要瞭解如何排解指標攝入問題,請參閱「 排解攝入端問題」一文中的「 從匯出工具收集資料時發生的問題」。