Monitor and optimize job resources by viewing metrics

This document describes how to monitor and try to optimize the resources for a Batch job by viewing metrics in Cloud Monitoring. To learn more about the resources that a job runs on, see Job resources.

For any job, Monitoring provides basic metrics such as CPU utilization and network traffic. However, some metrics, such as memory and process utilization, can only be collected if a job installs the Ops Agent. Metrics for a job's resources help you evaluate the performance and utilization of each resource. This information can help you identify improvements for any future iterations of the job. For example, you might remove unutilized resources to help optimize costs, or you might improve or increase strained resources to help enhance performance.

Before you begin

If you haven't used Batch before, review Get started with Batch and enable Batch by completing the prerequisites for projects and users.
Optional: To collect additional metrics for a job, create and run a job that automatically installs the Ops Agent.
If your project hasn't already, enable the Monitoring API:

Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.
Enable the API
To get the permissions that you need to view observability metrics, ask your administrator to grant you the Monitoring Metric Viewer (roles/monitoring.metricViewer) IAM role on the project. For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

View metrics for job resources

Observe and monitor VMs in the Compute Engine documentation provides relevant conceptual information about VM metrics; however, different methods are recommended for viewing VM metrics for Batch jobs. Specifically, the Compute Engine documentation explains how to view metrics by using the predefined Monitoring dashboards for Compute Engine or Compute Engine pages in the Google Cloud console. But, importantly, those methods don't display information about VMs that have been deleted. As a result, don't use those methods unless you only want to view metrics for Batch jobs while they are running.

View metrics for running and finished Batch jobs by using Metrics Explorer charts as explained in this section. Notably, charts are temporary unless you save them to custom dashboards.

To create a chart for viewing one or more metrics, do the following:

Optional: If you plan to save the chart, identify or create a custom dashboard for the chart.
Create a Metrics Explorer chart for one or more metrics.

Without filters, each VM metric in a chart includes data from all the VMs in your project. Optionally, if you want to filter the chart to only include metrics from all or specific Batch jobs, add the following filter:
```
group=RESOURCE_GROUP_NAME
```
Replace RESOURCE_GROUP_NAME with the name of a resource group for Batch jobs. For more information, see Create resource groups to filter metrics in this document.

Create resource groups to filter metrics

You can use resource groups as customizable filters for Metrics Explorer charts. To create a resource group for all or specific Batch jobs in your project, do the following:

Select a label to use as the membership criteria based on which jobs you want to include in the group:
- All Batch jobs: Use the predefined batch-node label, which is automatically applied to all the resources for all Batch jobs and has a null value.
- Specific Batch jobs: Use a label that is applied to the resources only for specific Batch jobs.
  
  For example, if you want to create a group based on full or partial job names, use the predefined batch-job-id label name with a specific value. The batch-job-id label is automatically applied to all the resources for all Batch jobs and defined with the job name.
  
  Alternatively, if you use a custom label, you must apply the custom label to all the resources of the Batch jobs that you want to be included in the group when you create the jobs.
Ensure that your project has at least one job with your selected label and that this job is in the RUNNING state. Otherwise, this label won't appear as an option when you try to create the resource group.
Create a resource group. When you are specifying the membership criteria, do the following:
1. Set the Type to Tag.
2. Set the Tag field to the name of your selected label. Then, set the following fields based on the label values that you want the group to include.
  
  For example, if you want this group to include all Batch jobs, set Tag to batch-node, and set Operator to Exists. Alternatively, you want this group to include Batch jobs with names that start with test, set Tag to batch-job-id, set Operator to Starts with, and set Value to test.

What's next

Learn more about job resource metrics:
Learn about other methods to monitor and optimize Batch jobs:
- Monitor job status using Pub/Sub notifications and BigQuery.
- Colocate VMs to reduce latency.
- Learn about more job creation options.