Dataproc monitoring and troubleshooting tools

Dataproc is a fully managed and highly scalable service for running open-source distributed processing platforms such as Apache Hadoop, Apache Spark, Apache Flink, and Trino. You can use the tools and files discussed in the following sections to investigate, troubleshoot, and monitor your Dataproc clusters and jobs.

AI-powered Investigations with Gemini Cloud Assist (Preview)

Overview

The Gemini Cloud Assist Investigations preview feature uses Gemini advanced capabilities to assist in creating and running Dataproc clusters and jobs. This feature analyzes failed clusters and failed and slow-running jobs to identify root causes and recommend fixes. It creates persistent analysis that you can review, save, and share with Google Cloud support to facilitating collaboration and accelerate issue resolution.

Features

Use this feature to create investigations from the Google Cloud console:

Add a natural language context description to an issue before creating an investigation.
Analyse failed clusters and slow and failed jobs.
Get insights into issue root causes with recommended fixes.
Create Google Cloud support cases with the full investigation context attached.

Before you begin

To get started using the Investigation feature, in your Google Cloud project, enable the Gemini Cloud Assist API.

Create an investigation

To create an investigation, do the following:

In the Google Cloud console, go to the Cloud Assist Investigations page.

Cloud Assist Investigations
Click Create.
Describe the issue: Provide a description of the cluster or job issue.
Select time range: Prove a time range when the issue occurred (default is 30 minutes).
Select resources:
1. Click addAdd resource.
  1. In the Quick filters field, type "dataproc", and then select one or more of dataproc.Batch, dataproc.Job, or dataproc.Cluster as filters.
    You can also filter by Location.
  2. Select the listed batch, job, or cluster to investigate.
    You can add multiple resources that are affected by the issue.
Click Create.

Interpret investigation results

Once an investigation is complete, the Investigation details page opens. This page contains the full Gemini analysis, which is organized into the following sections:

Issue: A collapsed section containing auto-populated details of the job being investigated.
Relevant Observations: A collapsed section that lists key data points and anomalies that Gemini found during its analysis of logs and metrics.
Hypotheses: This is the primary section, which is expanded by default. It presents a list of potential root causes for the observed issue. Each hypothesis includes:
- Overview: A description of the possible cause, such as "High Shuffle Write Time and Potential Task Skew."
- Recommended Fixes: A list of actionable steps to address the potential issue.

Take action

After reviewing the hypotheses and recommendations:

Apply one or more of the suggested fixes to the job configuration or code, and then rerun the job.
Provide feedback on the helpfulness of the investigation by clicking the thumbs-up or thumbs-down icons at the top of the panel.

Review and escalate investigations

The results of a previously run investigation can be reviewed by clicking the investigation name on the Cloud Assist Investigations page to open the Investigation details page.

If further assistance is needed, you can use open a Google Cloud support case. This process provides the support engineer with the complete context of the previously performed investigation, including the observations and hypotheses generated by Gemini. This context sharing significantly reduces the back-and-forth communication required with the support team, and leads to faster case resolution.

To create a support case from an investigation:

In the Investigation details page, click Request support.

Preview status and pricing

There is no charge for Gemini Cloud Assist investigations during public preview. Charges will apply to the feature when it becomes generally available (GA).

For more information about pricing after general availability, see Gemini Cloud Assist Pricing.

Open source web interfaces

Many Dataproc cluster open source components, such as Apache Hadoop and Apache Spark, provide web interfaces. These interfaces can be used to monitor cluster resources and job performance. For example, you can use the YARN Resource Manager UI to view YARN application resource allocation on a Dataproc cluster.

Persistent History Server

Open Source web interfaces running on a cluster are available when the cluster is running, but they terminate when you delete the cluster. To view cluster and job data after a cluster is deleted, you can create a Persistent History Server (PHS).

Example: You encounter a job error or slowdown that you want to analyze. You stop or delete the job cluster, then view and analyze job history data using your PHS.

After you create a PHS, you enable it on a Dataproc cluster or Google Cloud Serverless for Apache Spark batch workload when you create the cluster or submit the batch workload. A PHS can access history data for jobs run on multiple clusters, letting you monitor jobs across a project instead of monitoring separate UIs running on different clusters.

Dataproc logs

Dataproc collects the logs generated by Apache Hadoop, Spark, Hive, Zookeeper and other open source systems running on your clusters, and sends them to Logging. These logs are grouped based on the source of logs, which lets you select and view logs of interest to you: for example, YARN NodeManager and Spark Executor logs generated on a cluster are labelled separately. See Dataproc logs for more information on Dataproc log contents and options.

Cloud Logging

Logging is a fully-managed, real-time log management system. It provides storage for logs ingested from Google Cloud services and tools to search, filter, and analyze logs at scale. Dataproc clusters generate multiple logs, including Dataproc service agent logs, cluster startup logs, and OSS component logs, such as YARN NodeManager logs.

Logging is enabled by default on Dataproc clusters and Serverless for Apache Spark batch workloads. Logs are periodically exported to Logging, where they persist after the cluster is deleted or the workload is completed.

Dataproc metrics

Dataproc cluster and job metrics, prefixed with dataproc.googleapis.com/, consist of time-series data that provide insights into the performance of a cluster, such as CPU utilization or job status. Dataproc custom metrics, prefixed with custom.googleapis.com/, include metrics emitted by open source systems running on the cluster, such as the YARN running applications metric. Gaining insight into Dataproc metrics can help you configure your clusters efficiently. Setting up metric-based alerts can help you recognize and respond to problems quickly.

Dataproc cluster and job metrics are collected by default without charge. The collection of custom metrics is charged to customers. You can enable the collection of custom metrics when you create a cluster. The collection of Serverless for Apache Spark Spark metrics is enabled by default on Spark batch workloads.

Cloud Monitoring

Monitoring uses cluster metadata and metrics, including HDFS, YARN, job, and operation metrics, to provide visibility into the health, performance, and availability of Dataproc clusters and jobs. You can use Monitoring to explore metrics, add charts, build dashboards, and create alerts.

Metrics Explorer

You can use the Metrics Explorer to view Dataproc metrics. Dataproc cluster, job, and Serverless for Apache Spark batch metrics are listed under the Cloud Dataproc Cluster, Cloud Dataproc Job, and Cloud Dataproc Batch resources. Dataproc custom metrics are listed under the VM Instances resource, Custom category.

Charts

You can use Metrics Explorer to create charts that visualize Dataproc metrics.

Example: You create a chart to see the number of active Yarn applications running on your clusters, and then add a filter to select visualized metrics by cluster name or region.

Dashboards

You can build dashboards to monitor Dataproc clusters and jobs using metrics from multiple projects and different Google Cloud products. You can build dashboards in the Google Cloud console from the Dashboards Overview page by clicking, creating, and then saving a chart from the Metrics Explorer page.

Alerts

You can create Dataproc metric alerts to receive timely notice of cluster or job issues.

What's next

Learn how to troubleshoot Dataproc error messages.
Learn how to view Dataproc cluster diagnostic data.
See Dataproc FAQ.