You can view logs and metrics for Vertex AI and the pre-trained APIs installed with it. The installed pre-trained APIs are Optical Character Recognition (OCR), Speech-to-Text, and Translation.
You can monitor some of the Vertex AI metrics in the observability tools. You can also create queries to monitor specific Vertex AI metrics. For information about observability in Google Distributed Cloud (GDC) air-gapped, see Monitor metrics and logs.
Before you begin
To get the permissions you need to view logs and metrics for
Vertex AI, ask your Organization IAM Admin to grant you the
Organization Grafana Viewer (organization-grafana-viewer
) cluster role in the
platform-obs
namespace.
The following topics help you monitor Vertex AI and troubleshoot issues using logs and metrics.
View Vertex AI logs and metrics
To view Vertex AI logs and metrics, you must enable the Vertex AI pre-trained APIs. For more information, see Get the statuses of the pre-trained APIs.
To view Vertex AI logs and metrics, do the following:
If you aren't signed in to the GDC console, sign in using the steps in Sign in.
In the navigation pane, expand Vertex AI, then click Pre-trained APIs.
On the Pre-trained APIs page, click Monitor services to open the monitoring dashboard.
In the monitoring dashboard, click Explore to access logs and metrics.
Use the monitoring dashboard to view Vertex AI logs and metrics
You can view Vertex AI metrics in the monitoring dashboard. For example, you can create a query to view how Vertex AI affects CPU usage.
To view metrics in the monitoring dashboard, do the following:
Open the monitoring dashboard from Vertex AI. For more information, see View Vertex AI logs and metrics.
In the monitoring dashboard, on the Explore tab, select one of the following data sources:
For metrics, select the prometheus data source.
For Vertex AI operational logs, select the Operational Logs data source.
For Vertex AI audit logs, select the Audit Logs data source.
Click the plus symbol (+) to create a custom dashboard for your metric and log queries.
Create queries and run them in the custom dashboard. The custom dashboard preserves your queries so that you can access them later.
Sample Vertex AI platform queries for the monitoring dashboard
The following are sample queries to help you construct your own metric and log queries to monitor Vertex AI in your air-gapped environment.
Sample Vertex AI platform metric queries
For metric queries, the data source must be prometheus.
The following are sample queries that show the effect on an operator's container CPU usage:
- L1 operator CPU usage percentage:
rate(container_cpu_usage_seconds_total{namespace="ai-system",container="l1operator"}[30s]) * 100
- L2 operator CPU usage percentage:
rate(container_cpu_usage_seconds_total{namespace="ai-system",container="l2operator"}[30s]) * 100
The following are sample queries that show the effect on an operator's container memory usage:
- Level 1 memory usage in MB:
container_memory_usage_bytes{namespace="ai-system",container="l1operator"} * 1e-6
- Level 2 memory usage in MB:
container_memory_usage_bytes{namespace="ai-system",container="l2operator"} * 1e-6
Sample Vertex AI platform log queries
For operational logs, the data source must be Operational Logs. For audit logs, the data source must be Audit Logs.
Sample Vertex AI platform operational log queries
The following are sample queries to view Vertex AI operational logs:
- L1 operator logs:
{service_name="vai-l1operator"}
- L2 operator logs:
{service_name="vai-l2operator"}
Sample Vertex AI platform audit log queries
Sample queries to generate Vertex AI audit logs.
- Vertex AI platform frontend:
{namespace="istio-system",service_name="istio"} | json | resource_cluster_name="vai-web-plugin-frontend.ai-system"
- Vertex AI platform backend:
{namespace="istio-system",service_name="istio"} | json | resource_cluster_name="vai-web-plugin-backend.ai-system"
Sample Vertex AI service queries for the monitoring dashboard
The following are sample queries to help you construct your own metric and log queries to monitor the pre-trained APIs installed with Vertex AI on GDC. You can monitor metrics and logs for Optical Character Recognition (OCR), Speech-to-Text, and Translation.
Sample Vertex AI pre-trained API metric queries
For metric queries, the data source must be prometheus.
The following are sample queries that show the effect of a pre-trained API on CPU usage. There is one sample query for each pre-trained API.
- OCR CPU usage:
rate(container_cpu_usage_seconds_total{namespace="g-vai-ocr-sie",container="CONTAINER_NAME"}[30s]) * 100 CONTAINER_NAME values: vision-extractor | vision-frontend | vision-vms-ocr
- Speech-to-Text CPU usage:
rate(container_cpu_usage_seconds_total{namespace="g-vai-speech-sie",container="CONTAINER_NAME"}[30s]) * 100
- Translation CPU usage:
rate(container_cpu_usage_seconds_total{namespace="g-vai-translation-sie",container="CONTAINER_NAME"}[30s]) * 100 CONTAINER_NAME values: translation-aligner | translation-frontend | translation-prediction
The following are sample queries that use the destination_service
filter
label to get the error rate over the last 60 minutes:
- OCR error rate:
rate(istio_requests_total{destination_service=~".*g-vai-ocr-sie.svc.cluster.local",response_code=~"[4-5][0-9][0-9]"}[60m])
- Speech-to-Text error rate:
rate(istio_requests_total{destination_service=~".*g-vai-speech-sie.svc.cluster.local",response_code=~"[4-5][0-9][0-9]"}[60m])
- Translation error rate:
rate(istio_requests_total{destination_service=~".*g-vai-translation-sie.svc.cluster.local",response_code=~"[4-5][0-9][0-9]"}[60m])
Sample Vertex AI pre-trained API log queries
For Operational Logs, the data source must be Operational Logs. For audit logs, the data source must be Audit Logs.
Sample pre-trained API operational log queries
Operational log queries for pre-trained APIs are constructed similarly to Vertex AI operational logs. The primary difference is the namespace used as the main filter specifies the pre-trained API. The three namespaces are:
g-vai-translation-sie
g-vai-speech-sie
g-vai-ocr-sie
You can create more granular results by adding additional labels, such as
service_name
or pod
, to your query. The following are operational
log query examples for the pre-trained APIs:
- OCR:
{namespace="g-vai-ocr-sie"}
- Speech-to-Text:
{namespace="g-vai-speech-sie"}
- Translation:
{namespace="g-vai-translation-sie"}
Sample pre-trained API audit log queries
The following are sample queries to generate audit logs for the pre-trained APIs:
- OCR:
{service_name="istio", cluster="g-org-1-shared-service"} |= "vision-frontend-server"
- Translation:
{service_name="istio", cluster="g-org-1-shared-service"} |= "translation-frontend-server"
- Speech-to-Text:
{service_name="istio", cluster="g-org-1-shared-service"} |= "speech-frontend-server"
- Chirp model of Speech-to-Text (Preview):
{service_name="istio", cluster="g-org-1-shared-service"} |= "speech-frontend-server"
Get the statuses of the pre-trained APIs
To view the statuses of the pre-trained APIs, see View service statuses and endpoints.