BigQuery Engine for Apache Flink metrics

You can view charts in the Metrics tabs of the BigQuery Engine for Apache Flink deployments and jobs pages in the Google Cloud console.

Deployment metrics:

CPU utilization
Memory utilization

Job metrics:

Number of records in per second
Number of records out per second
Input watermark
Output watermark
Autoscaling

Autoscaling metrics:

Current parallelism
Recommended parallelism
Maximum parallelism

Support and limitations

To export custom metrics from your BigQuery Engine for Apache Flink job to Cloud Monitoring, the Managed Flink Default Workload Identity must have the IAM role roles/monitoring.metricWriter.

Access job metrics

Sign in to the Google Cloud console.
Select your Google Cloud project.
Open the navigation menu and select BigQuery Engine for Apache Flink.
Click Deployments or Jobs.
In the deployments or jobs list, click the name.
Click the Metrics tab.

To access additional information in the job metrics charts, click Explore data.

Use Cloud Monitoring

BigQuery Engine for Apache Flink is fully integrated with Cloud Monitoring. Use Metrics Explorer to build queries and adjust the timespan of the metrics.

For instructions about using Metrics Explorer, see Use Cloud Monitoring with BigQuery Engine for Apache Flink.

View in Metrics Explorer

You can view the BigQuery Engine for Apache Flink metrics charts in Metrics Explorer, where you can build queries and adjust the timespan of the metrics.

To view the BigQuery Engine for Apache Flink charts in Metrics Explorer, in the Metrics view, open More chart options, and then click View in Metrics Explorer.

When you adjust the timespan of the metrics, you can select a predefined duration or select a custom time interval to analyze your job.

Deployment metrics

The deployment details Metrics tab includes the following charts.

CPU utilization

CPU utilization is the amount of CPU used by the deployment at each point in time. Use this chart to track changes in CPU usage over time.

Memory utilization

Memory utilization is the amount of memory used by the deployment at each point in time. Use this chart to track changes in memory usage over time.

Job metrics

The job details Metrics tab includes the following charts. Use these metrics to monitor and debug your BigQuery Engine for Apache Flink jobs.

Number of records in per second

Number of records in per second is the number of records each operator in the job is receiving at each point in time. The data is split based on the operator, with each operator having a separate line on the graph.

This metric shows whether the job is running and processing records.

Refer to this chart when you don't see data in downstream systems or if you have a stale input watermark.
Use this metric to verify whether the job is ingesting records at the expected rate.

Number of records out per second

Number of records out per second is the number of records each operator in the job is sending at each point in time. The data is split based on the operator, with each operator having a separate line on the graph.

This metric shows whether the job is outputting records.

Refer to this chart when you don't see data in downstream systems or if you have a stale output watermark.
Use this metric to verify whether the job is processing records at the expected rate.

Input watermark

Input watermark is the most recent watermark received by each operator, in milliseconds since the Unix epoch (00:00:00 UTC on January 1, 1970), ignoring leap seconds. The data is split based on the operator, with each operator having a separate line on the graph.

This metric confirms whether the job is making progress. A healthy watermark increases with time.

If the input and output watermarks are stale, the job processing might be stuck.
This metric indicates when a job is stuck and where the job becomes stuck.

Output watermark

Output watermark is the most recent watermark outputted by each operator, in milliseconds since the Unix epoch (00:00:00 UTC on January 1, 1970), ignoring leap seconds. The data is split based on the operator, with each operator having a separate line on the graph.

This metrics confirms whether the job is making progress. A healthy watermark increases with time.

This metric indicates when a job is stuck and where the job becomes stuck.
If the input and output watermarks are stale, the job processing might be stuck.
If the input is progressing but the output watermark is stale, the job is ingesting data but not outputting data.

Autoscaling metrics

In the job metrics tab, the autoscaling charts provide information about the autoscaling behavior of the job.

Current parallelism

The Current parallelism chart shows the number of task slots the job is using at any point in time. You can use this chart to understand whether the job is scaling up or down.

Recommended parallelism

The Recommended parallelism chart shows the number of task slots that the autoscaler recommends. When autoscaling is enabled for a job, BigQuery Engine for Apache Flink tries to allocate a number of task slots equal to the recommended parallelism. The current number of task slots is shown by the current parallelism metric. This value might lag behind recommended parallelism.

Recommended parallelism is always greater than or equal to the minimum parallelism, and always less than or equal to the maximum parallelism.

If the recommended parallelism consistently stays close to the maximum, consider updating the job with a higher maximum parallelism. In response, the autoscaler might raise the recommended parallelism, to take advantage of the additional slots. For more information, see Update autoscaling.

Maximum parallelism

The Maximum parallelism chart shows the maximum number of task slots available to the job at any point in time.

Minimum parallelism

The Minimum parallelism chart shows the minimum number of task slots available to the job at any point in time.