Google Distributed Cloud includes multiple options for cluster logging and monitoring, including cloud-based managed services, open source tools, and validated compatibility with third-party commercial solutions. This page explains these options and provides some basic guidance on selecting the proper solution for your environment.
Options for Google Distributed Cloud
You have several logging and monitoring options for your Google Distributed Cloud:
- Cloud Logging and Cloud Monitoring, enabled by default on Bare Metal system components.
- Prometheus and Grafana are available from the Cloud Marketplace.
- Validated configurations with third-party solutions.
Cloud Logging and Cloud Monitoring
Google Cloud Observability is the built-in observability solution for Google Cloud. It offers a fully managed logging solution, metrics collection, monitoring, dashboarding, and alerting. Cloud Monitoring monitors Google Distributed Cloud clusters in a similar way as cloud-based GKE clusters.
The agents can be configured with two different levels of logging and monitoring:
- System components only (default).
- System components and applications.
Logging and Monitoring provide a single, easy-to-configure, powerful cloud-based observability solution. We highly recommend Logging and Monitoring when running workloads only on Google Distributed Cloud, or workloads on GKE and Google Distributed Cloud. For applications with components running on Google Distributed Cloud and traditional on-premises infrastructure, you might consider other solutions for an end-to-end view of those applications.
For details about architecture, configuration, and what data is replicated to your Google Cloud project by default, see How Logging and Monitoring for Google Distributed Cloud works.
For more information about Logging, see the Cloud Logging documentation.
For more information about Monitoring, see the Cloud Monitoring documentation.
Prometheus and Grafana
Prometheus and Grafana are two popular open source monitoring products available in the Cloud Marketplace:
Prometheus collects application and system metrics.
Alertmanager handles sending out alerts with several different alerting mechanisms.
Grafana is a dashboarding tool.
Prometheus and Grafana can be enabled on each admin cluster and user cluster. Prometheus and Grafana is recommended for application teams with prior experience with those products. These products are also recommended for operational teams who prefer to retain application metrics within the cluster and for troubleshooting issues when network connectivity is lost.
Third-party solutions
Google has worked with several third-party logging and monitoring solution providers to help their products work well with Google Distributed Cloud. These include Datadog, Elastic, and Splunk. Additional validated third parties will be added in the future.
The following solution guides are available for using third-party solutions with Google Distributed Cloud:
- Monitoring Google Distributed Cloud with the Elastic Stack
- Collect logs on Google Distributed Cloud with Splunk Connect
How Logging and Monitoring for Google Distributed Cloud works
Cloud Logging and Cloud Monitoring are installed and activated in each cluster when you create a new admin or user cluster.
The Stackdriver agents include several components on each cluster:
Stackdriver Operator (
stackdriver-operator-*
). Manages the lifecycle for all other Stackdriver agents deployed onto the cluster.Stackdriver Custom Resource. A resource that is automatically created as part of the Google Distributed Cloud installation process.
GKE Metrics Agent (
gke-metrics-agent-*
). An OpenTelemetry Collector based DaemonSet that scrapes metrics from each node to Cloud Monitoring. Anode-exporter
DaemonSet and akube-state-metrics
deployment are also included to provide more metrics about the cluster.Stackdriver Log Forwarder (
stackdriver-log-forwarder-*
). A Fluent Bit DaemonSet that forwards logs from each machine to the Cloud Logging. The log Forwarder buffers the log entries on the node locally and re-sends them for up to 4 hours. If the buffer gets full or if the Log Forwarder can't reach the Cloud Logging API for more than 4 hours, logs are dropped.Anthos Metadata Agent (
stackdriver-metadata-agent-
). A deployment that sends metadata for Kubernetes resources such as pods, deployments, or nodes to the Config Monitoring for Ops API; this data is used to enrich metric queries by enabling you to query by deployment name, node name, or even Kubernetes service name.
You can see the agents installed by Stackdriver by running the following command:
kubectl -n kube-system get pods -l "managed-by=stackdriver"
The output of this command is similar to the following:
kube-system gke-metrics-agent-4th8r 1/1 Running 1 (40h ago) 40h kube-system gke-metrics-agent-8lt4s 1/1 Running 1 (40h ago) 40h kube-system gke-metrics-agent-dhxld 1/1 Running 1 (40h ago) 40h kube-system gke-metrics-agent-lbkl2 1/1 Running 1 (40h ago) 40h kube-system gke-metrics-agent-pblfk 1/1 Running 1 (40h ago) 40h kube-system gke-metrics-agent-qfwft 1/1 Running 1 (40h ago) 40h kube-system kube-state-metrics-9948b86dd-6chhh 1/1 Running 1 (40h ago) 40h kube-system node-exporter-5s4pg 1/1 Running 1 (40h ago) 40h kube-system node-exporter-d9gwv 1/1 Running 2 (40h ago) 40h kube-system node-exporter-fhbql 1/1 Running 1 (40h ago) 40h kube-system node-exporter-gzf8t 1/1 Running 1 (40h ago) 40h kube-system node-exporter-tsrpp 1/1 Running 1 (40h ago) 40h kube-system node-exporter-xzww7 1/1 Running 1 (40h ago) 40h kube-system stackdriver-log-forwarder-8lwxh 1/1 Running 1 (40h ago) 40h kube-system stackdriver-log-forwarder-f7cgf 1/1 Running 2 (40h ago) 40h kube-system stackdriver-log-forwarder-fl5gf 1/1 Running 1 (40h ago) 40h kube-system stackdriver-log-forwarder-q5lq8 1/1 Running 2 (40h ago) 40h kube-system stackdriver-log-forwarder-www4b 1/1 Running 1 (40h ago) 40h kube-system stackdriver-log-forwarder-xqgjc 1/1 Running 1 (40h ago) 40h kube-system stackdriver-metadata-agent-cluster-level-5bb5b6d6bc-z9rx7 1/1 Running 1 (40h ago) 40h
Cloud Monitoring metrics
For a list of metrics collected by Cloud Monitoring, see View Google Distributed Cloud metrics.
Configuring Stackdriver agents for Google Distributed Cloud
The Stackdriver agents installed with Google Distributed Cloud collect data about system components for the purposes of maintaining and troubleshooting issues with your clusters. The following sections describe Stackdriver configuration and operating modes.
System Components Only (Default Mode)
Upon installation, Stackdriver agents are configured by default to collect logs and metrics, including performance details (for example, CPU and memory utilization), and similar metadata, for Google-provided system components. These include all workloads in the admin cluster, and for user clusters, workloads in the kube-system, gke-system, gke-connect, istio-system, and config-management- system namespaces.
System Components and Applications
To enable application logging and monitoring on top of the default mode, follow the steps in Enable application logging and monitoring.
Overriding the default CPU and memory requests and limits for a Stackdriver component
Clusters with high pod density introduce higher logging and monitoring overhead. In extreme cases, Stackdriver components may report close to the CPU and memory utilization limit or even may be subject to constant restarts due to resource limits. In this case, to override the default values for CPU and memory requests and limits for a Stackdriver component, use the following steps:
Run the following command to open your Stackdriver custom resource in a command line editor:
kubectl -n kube-system edit stackdriver stackdriver
In the Stackdriver custom resource, add the
resourceAttrOverride
section under thespec
field:resourceAttrOverride: DAEMONSET_OR_DEPLOYMENT_NAME/CONTAINER_NAME: LIMITS_OR_REQUESTS: RESOURCE: RESOURCE_QUANTITY
Note that the
resourceAttrOverride
section overrides all existing default limits and requests for the component you specify. The following components are supported byresourceAttrOverride
:gke-metrics-agent/gke-metrics-agent
stackdriver-log-forwarder/stackdriver-log-forwarder
stackdriver-metadata-agent-cluster-level/metadata-agent
node-exporter/node-exporter
kube-state-metrics/kube-state-metrics
An example file looks like the following:
apiVersion: addons.gke.io/v1alpha1 kind: Stackdriver metadata: name: stackdriver namespace: kube-system spec: anthosDistribution: baremetal projectID: my-project clusterName: my-cluster clusterLocation: us-west-1a resourceAttrOverride: gke-metrics-agent/gke-metrics-agent: requests: cpu: 110m memory: 240Mi limits: cpu: 200m memory: 4.5Gi
To save changes to the Stackdriver custom resource, save and quit your command line editor.
Check the health of your Pod:
kubectl -n kube-system get pods -l "managed-by=stackdriver"
A response for a healthy Pod looks like the following:
gke-metrics-agent-4th8r 1/1 Running 1 40h
Check the Pod spec of the component to make sure the resources are set correctly.
kubectl -n kube-system describe pod POD_NAME
Replace
POD_NAME
with the name of the Pod you just changed. For example,gke-metrics-agent-4th8r
.The response looks like the following:
Name: gke-metrics-agent-4th8r Namespace: kube-system ... Containers: gke-metrics-agent: Limits: cpu: 200m memory: 4.5Gi Requests: cpu: 110m memory: 240Mi ...
Metrics Server
Metrics Server is the source of the container resource metrics for various autoscaling pipelines. Metrics Server retrieves metrics from kubelets and exposes them through the Kubernetes Metrics API. HPA and VPA then use these metrics to determine when to trigger autoscaling. Metrics server is scaled using addon- resizer.
In extreme cases where high pod density creates too much logging and monitoring
overhead, Metrics Server might be stopped and restarted due to resource
limitations. In this case, you can allocate more resources to metrics server by editing the metrics-server-config
configmap in kube-system namespace, and changing the value for cpuPerNode
and memoryPerNode
.
kubectl edit cm metrics-server-config -n kube-system
The example content of the ConfigMap is:
apiVersion: v1
data:
NannyConfiguration: |-
apiVersion: nannyconfig/v1alpha1
kind: NannyConfiguration
cpuPerNode: 3m
memoryPerNode: 20Mi
kind: ConfigMap
After updating the ConfigMap, recreate the metrics-server pods with the following command:
kubectl delete pod -l k8s-app=metrics-server -n kube-system
Configuration requirements for Logging and Monitoring
There are several configuration requirements to enable Cloud Logging and Cloud Monitoring with Google Distributed Cloud. These steps are included in Configuring a service account for use with Logging and Monitoring on the Enabling Google services page, and in the following list:
- A Cloud Monitoring Workspace must be created within the Google Cloud project. This is accomplished by clicking Monitoring in Google Cloud console and following the workflow.
You need to enable the following Stackdriver APIs:
You need to assign the following IAM roles to the service account used by the Stackdriver agents:
logging.logWriter
monitoring.metricWriter
stackdriver.resourceMetadata.writer
monitoring.dashboardEditor
opsconfigmonitoring.resourceMetadata.writer
Pricing
There is no charge for Anthos system logs and metrics.
In a Google Distributed Cloud cluster, Anthos system logs and metrics include the following:
- Logs and metrics from all components in an admin cluster
- Logs and metrics from components in these namespaces in a user cluster:
kube-system
,gke-system
,gke-connect
,knative-serving
,istio-system
,monitoring-system
,config-management-system
,gatekeeper-system
,cnrm-system
For more information, see Pricing for Google Cloud Observability.
To learn about credit for Cloud Logging metrics, contact sales for pricing.