BigQuery Engine for Apache Flink metrics

You can view charts in the Metrics tabs of the BigQuery Engine for Apache Flink deployments and jobs pages in the Google Cloud console.

Deployment metrics:

Job metrics:

Autoscaling metrics:

Support and limitations

To export custom metrics from your BigQuery Engine for Apache Flink job to Cloud Monitoring, the Managed Flink Default Workload Identity must have the IAM role roles/monitoring.metricWriter.

Access job metrics

  1. Sign in to the Google Cloud console.
  2. Select your Google Cloud project.
  3. Open the navigation menu and select BigQuery Engine for Apache Flink.
  4. Click Deployments or Jobs.
  5. In the deployments or jobs list, click the name.
  6. Click the Metrics tab.

To access additional information in the job metrics charts, click Explore data.

Use Cloud Monitoring

BigQuery Engine for Apache Flink is fully integrated with Cloud Monitoring. Use Metrics Explorer to build queries and adjust the timespan of the metrics.

For instructions about using Metrics Explorer, see Use Cloud Monitoring with BigQuery Engine for Apache Flink.

View in Metrics Explorer

You can view the BigQuery Engine for Apache Flink metrics charts in Metrics Explorer, where you can build queries and adjust the timespan of the metrics.

To view the BigQuery Engine for Apache Flink charts in Metrics Explorer, in the Metrics view, open More chart options, and then click View in Metrics Explorer.

When you adjust the timespan of the metrics, you can select a predefined duration or select a custom time interval to analyze your job.

Deployment metrics

The deployment details Metrics tab includes the following charts.

CPU utilization

CPU utilization is the amount of CPU used by the deployment at each point in time. Use this chart to track changes in CPU usage over time.

Memory utilization

Memory utilization is the amount of memory used by the deployment at each point in time. Use this chart to track changes in memory usage over time.

Job metrics

The job details Metrics tab includes the following charts. Use these metrics to monitor and debug your BigQuery Engine for Apache Flink jobs.

Number of records in per second

Number of records in per second is the number of records each operator in the job is receiving at each point in time. The data is split based on the operator, with each operator having a separate line on the graph.

This metric shows whether the job is running and processing records.

  • Refer to this chart when you don't see data in downstream systems or if you have a stale input watermark.
  • Use this metric to verify whether the job is ingesting records at the expected rate.

Number of records out per second

Number of records out per second is the number of records each operator in the job is sending at each point in time. The data is split based on the operator, with each operator having a separate line on the graph.

This metric shows whether the job is outputting records.

  • Refer to this chart when you don't see data in downstream systems or if you have a stale output watermark.
  • Use this metric to verify whether the job is processing records at the expected rate.

Input watermark

Input watermark is the most recent watermark received by each operator, in milliseconds since the Unix epoch (00:00:00 UTC on January 1, 1970), ignoring leap seconds. The data is split based on the operator, with each operator having a separate line on the graph.

This metric confirms whether the job is making progress. A healthy watermark increases with time.

  • If the input and output watermarks are stale, the job processing might be stuck.
  • This metric indicates when a job is stuck and where the job becomes stuck.

Output watermark

Output watermark is the most recent watermark outputted by each operator, in milliseconds since the Unix epoch (00:00:00 UTC on January 1, 1970), ignoring leap seconds. The data is split based on the operator, with each operator having a separate line on the graph.

This metrics confirms whether the job is making progress. A healthy watermark increases with time.

  • This metric indicates when a job is stuck and where the job becomes stuck.
  • If the input and output watermarks are stale, the job processing might be stuck.
  • If the input is progressing but the output watermark is stale, the job is ingesting data but not outputting data.

Vertex metrics

You can view metrics for individual vertices (nodes) in the job graph. To view vertex metrics:

  1. In the jobs list, click the job name. The Graph tab displays the job graph.
  2. In the job graph, click the vertex. Vertex metrics are shown in the Vertex info panel.

The following metrics are shown.

Current parallelism

The number of task slots assigned to this vertex.

Input watermark

The last watermark this operator received, in milliseconds, since the Unix epoch, ignoring leap seconds.

Backlog elements

The number of elements in the operator's backlog. This metric is defined only for Apache Kafka sources.

State milliseconds per second

The number of milliseconds within the last second that this vertex was in each of the following states:

  • backpressured. The vertex is waiting for downstream vertices to finish.
  • busy. The vertex is processing data.
  • idle. The vertex has no work to perform.

Because a vertex can contain subtasks, the total value might sum to more than 1000 milliseconds.

Input metrics

If the vertex has inputs, you can select from the following charts:

  • Records in. The total number of records ingested by this vertex.
  • Input bytes. The total number of bytes ingested by this vertex.

Output metrics

If the vertex has outputs, you can select from the following charts:

  • Records out. The total number of records output by this vertex.
  • Output bytes.The total number of bytes output by this vertex.

Autoscaling metrics

In the job metrics tab, the autoscaling charts provide information about the autoscaling behavior of the job.

Current parallelism

The Current parallelism chart shows the number of task slots the job is using at any point in time. You can use this chart to understand whether the job is scaling up or down.

The Recommended parallelism chart shows the number of task slots that the autoscaler recommends. When autoscaling is enabled for a job, BigQuery Engine for Apache Flink tries to allocate a number of task slots equal to the recommended parallelism. The current number of task slots is shown by the current parallelism metric. This value might lag behind recommended parallelism.

Recommended parallelism is always greater than or equal to the minimum parallelism, and always less than or equal to the maximum parallelism.

If the recommended parallelism consistently stays close to the maximum, consider updating the job with a higher maximum parallelism. In response, the autoscaler might raise the recommended parallelism, to take advantage of the additional slots. For more information, see Update autoscaling.

Maximum parallelism

The Maximum parallelism chart shows the maximum number of task slots available to the job at any point in time.

Minimum parallelism

The Minimum parallelism chart shows the minimum number of task slots available to the job at any point in time.