Troubleshoot the Monitoring API

This guide explains some of the issues that might arise when you use the Monitoring API.

The Monitoring API is one of the set of Cloud APIs. These APIs share a common set of error codes. For a list of the error codes defined by the Cloud APIs and general suggestions on handling the errors, see Handling errors.

Use APIs Explorer for debugging

APIs Explorer is a widget built into the reference pages for API methods. It lets you invoke the method by filling out fields; it doesn't require you to write code.

If you are having trouble with a method invocation, use the APIs Explorer (Try this API) widget on the reference page for that method to debug your problem. For more information, see APIs Explorer.

General API errors

Here are some of the Monitoring API errors and messages you might see from your API calls:

  • 404 NOT_FOUND with "The requested URL was not found on this server": Some part of the URL is incorrect. Compare the URL against the URL for the method shown on the method's reference page. This error might mean that there is a spelling error, such as "project" instead of "projects", or a capitalization error, such as "TimeSeries" instead of "timeSeries".

  • 401 UNAUTHENTICATED with "User is not authorized to access the project (or metric)": This error code typically indicate an authorization problem, but it can mean that there is an error in the project ID or metric type name. Verify the spelling and capitalization.

    If you aren't using APIs Explorer, then try using it. When your API call works in APIs Explorer, there is probably an authorization issue in the environment where you're making the API call. Go to the API manager page to verify that the Monitoring API is enabled for your project.

  • 400 INVALID_ARGUMENT with "Field filter had an invalid value": Verify the spelling and formatting of the monitoring filter. For more information, see Monitoring Filters.

  • 400 INVALID_ARGUMENT with "Request was missing field interval.endTime"": You see this message when the end time missing, or when it is present but not properly formatted. If you are using APIs Explorer, then don't quote the value of the time field.

    Here are some examples of valid time specifications:


Missing results

When an API call returns the status code 200 and an empty response, consider the following:

  • When the call uses a filter, the filter might not have matched anything. The filter match is case-sensitive. To resolve filter problems, start by specifying only one filter component, such as metric.type, and verify that you get results. Add the other filter components one by one to build up your request.
  • When working with a custom metric, verify that that the project which defines the metric is specified.

There are several reasons why data points might be missing when you use the timeSeries.list method:

  • The data might have aged out. For more information, see Data retention.

  • The data might not have propagated to Monitoring yet. For more information, see Latency of metric data.

  • The interval is invalid:

    • Verify that the end time is correct.
    • Verify that the start time is correct and that it is earlier than the end time. When the start time is missing or malformed, the API sets the start-time to the end-time. For GAUGE metrics, this time interval only matches points whose start and end times are exactly the interval's end time. For CUMULATIVE or DELTA metrics, which measure across time intervals, no points are matched. For more information, see Time intervals.

Retrying API errors

Two of the Cloud APIs error codes indicate circumstances in which it might be useful to retry the request:

  • 503 UNAVAILABLE: retries are useful when the problem is a short-lived or transient condition.
  • 429 RESOURCE_EXHAUSTED: retries are useful, after a delay, for long-running background jobs with time-based quota such as n calls per t seconds. Retries aren't useful when the problem is a short-lived or transient condition, or when you've exhausted a volume-based quota. For transient conditions, consider tolerating the failure. For quota-related issues, consider reducing your quota usage or requesting a quota increase.

When writing code that might retry requests, first ensure that the request is safe to retry.

Is the request safe to retry?

If your request is idempotent, then it is safe to retry. An idempotent action is one where any change in state does not depend on the current state. For example:

  • Reading x is idempotent; there is no change to the value.
  • Setting x to 10 is idempotent; this might change the state, if the value isn't already 10, but it doesn't matter what the current value is. And it doesn't matter how many times you attempt to set the value.
  • Incrementing x is not idempotent; the new value depends on the current value.

Retry with exponential backoff

When implementing code to retry requests, you don't want to rapidly issue new requests indefinitely. If a system is overloaded, this approach contributes to the problem.

Instead, use a truncated exponential backoff approach. When requests fail because of transient overloads rather than true unavailability, the solution is reduce the load. A truncated exponential backoff follows this general pattern:

  • Establish how long you are willing to wait while retrying or how many attempts you are willing to make. When this limit is exceeded, consider the service unavailable and handle that condition appropriately for your application. This is what makes the backoff truncated; you stop retrying at some point.

  • Retry the request with increasingly long pauses to back off the frequency of retries. Retry until the request succeeds or your established limit is reached.

    The interval is typically increased by some function of the power of the retry count, making it an exponential backoff.

There are many ways to implement an exponential backoff. The following is an example that adds an increasing backoff delay to a minimum delay of 1000ms. The initial backoff delay is 2ms, and it increases to 2retry_countms with each attempt.

The following table shows the retry intervals using the initial values:

  • Minimum delay = 1s = 1000ms
  • Initial backoff = 2ms
Retry count Additional delay (ms) Retry after (ms)
0 20 = 1 1001
1 21 = 2 1002
2 22 = 4 1004
3 23 = 8 1008
4 24 = 16 1016
... ... ...
n 2n 1000 + 2n

You can truncate the retry cycle by stopping either after n attempts or when the time spent exceeds a reasonable value for your application.

For more information, see the Wikipedia article Exponential backoff.