Troubleshoot issues

This page explains various error scenarios, and provides guidance for resolving the errors.

Replication scenarios

This section explains replication issues that might occur with your instance.

How do you monitor replication lags?

Memorystore for Valkey has the /instance/replication/maximum_offset_diff metric. This metric monitors the maximum replication offset difference (in bytes) for a node in a primary instance.

By keeping the replication offset difference low, replicas can perform incremental sync operations more frequently and at a lower cost than full sync operations.

We recommend that you set a threshold for the maximum_offset_diff metric. If the threshold is exceeded, then Memorystore for Valkey can notify you by an alert.

Based on the node type for your instance, we recommend that you set the threshold, as follows:

  • If the node type is shared-core-nano, standard-small, or highmem-medium, then set the threshold to be less than 64 MB.
  • If the node type is highmem-xlarge, then set the threshold to be less than 1 GB.

What do you do if there's a replication lag between your primary instance and its replicas?

There might be a significant replication lag if the primary instance has too many write operations, and the replicas can't catch up to replicate these operations. To resolve this issue, we recommend that you scale the capacity of the instance by increasing the number of shards for the instance.

Memory management scenarios

This section explains memory management issues that your instance might encounter.

Which metric can you use to determine that your instance is under memory stress?

To monitor the memory usage for a Memorystore for Valkey instance, we recommend that you view the /instance/memory/maximum_utilization metric. If the memory usage of the instance approaches 80% and you expect data usage to grow, then scale up the size of the instance to improve performance and to make room for new data.

Monitoring scenarios

This section explains monitoring issues that your instance might encounter.

How do you set up alerts for Memorystore for Valkey?

You can use Cloud Monitoring to set alerts to notify you if any metrics exceed thresholds that you set for your instance. For more information about setting alerts in Cloud Monitoring, see Set a Monitoring alert for memory usage.

Connection management scenarios

This section explains connection management issues that your instance might encounter.

If you reach your connection limit or receive a connection timeout, then what do you do?

When you reach your connection limit, your client fails to connect to your server. This is known as a connection rejection.

If this happens, then do the following:

Timeout scenarios

This section explains timeout issues that your instance might encounter.

If you receive an I/O timeout, then what do you do?

When a read or write operation in Memorystore for Valkey fails to complete within a specified time, then an I/O timeout occurs. This timeout might happen because of various reasons. For example, one or more nodes of your instance might be overloaded.

If you receive an I/O timeout, then do the following:

  • Use the instance/cpu/maximum_utilization metric to determine the CPU utilization for a node in your instance, from 0.0 (0%) to 1.0 (100%). We recommend that all nodes have a CPU utilization percentage of less than 80%. For more information, see CPU usage best practices.
  • When the client disconnects from the server because the server times out, retry with exponential backoff and with Jitter. This helps to avoid multiple clients overloading the server simultaneously.

Connectivity error scenarios

This section explains connectivity issues your instance might encounter.

Connection error caused by firewall rules

Firewall rules can cause connection errors by blocking the ports used by Memorystore for Valkey. You should allow list all ports for both of your instance's Private Service Connect endpoints. For more information about the endpoints, see Reserved network addresses.

Connection error caused by organization policies.

You can have an organization policy that blocks your Private Service Connect connections to your Memorystore for Valkey instance.

If your organization policy uses the .restrictPrivateServiceConnectProducer policy, then allow list the 961333125034 folder number, which is a folder specifically for Memorystore for Valkey. For example:

name: organizations/Consumer-org-1/policies/compute.restrictPrivateServiceConnectProducer
spec:
    rules:
      - values:
          allowedValues:
          - under:folders/961333125034

If your organization policy uses the .disablePrivateServiceConnectCreationForConsumers policy, you should allow list SERVICE_PRODUCERS. For example:

name: organizations/Consumer-org-1/policies/compute.disablePrivateServiceConnectCreationForConsumers
spec:
    rules:
      - values:
          allowedValues:
          - SERVICE_PRODUCERS

Handling errors for Cluster Mode Disabled instances

  • If the application connects to the read endpoint of an instance which has no read replicas, then the connection closes and the ERR no replicas found error message appears. In this case, either try to connect the application to the primary endpoint or add read replicas to the instance.

  • In the event of a failover, the existing connections from your application closes and the ERR role change occurred error message appears. You would also see this error message if your application connects to the read endpoint of an instance, and all read replicas of the instance are failing. In this case, the application should retry the connection with exponential backoff.