This page explains various error scenarios, and provides guidance for resolving the errors.
Replication scenarios
This section explains replication issues that might occur with your instance.
How do you monitor replication lags?
Memorystore for Valkey has the /instance/replication/maximum_offset_diff
metric. This metric
monitors the maximum replication offset difference (in bytes) for a node in a
primary instance.
By keeping the replication offset difference low, replicas can perform incremental sync operations more frequently and at a lower cost than full sync operations.
We recommend that you set a threshold for the maximum_offset_diff
metric. If
the threshold is exceeded, then Memorystore for Valkey can notify you by an
alert.
Based on the node type for your instance, we recommend that you set the threshold, as follows:
- If the node type is
shared-core-nano
,standard-small
, orhighmem-medium
, then set the threshold to be less than 64 MB. - If the node type is
highmem-xlarge
, then set the threshold to be less than 1 GB.
What do you do if there's a replication lag between your primary instance and its replicas?
There might be a significant replication lag if the primary instance has too many write operations, and the replicas can't catch up to replicate these operations. To resolve this issue, we recommend that you scale the capacity of the instance by increasing the number of shards for the instance.
Memory management scenarios
This section explains memory management issues that your instance might encounter.
Which metric can you use to determine that your instance is under memory stress?
To monitor the memory usage for a Memorystore for Valkey
instance, we recommend that you view the /instance/memory/maximum_utilization
metric. If the memory usage of the instance approaches 80% and you expect data
usage to grow, then scale up the size of the instance to improve performance and to make
room for new data.
Monitoring scenarios
This section explains monitoring issues that your instance might encounter.
How do you set up alerts for Memorystore for Valkey?
You can use Cloud Monitoring to set alerts to notify you if any metrics exceed thresholds that you set for your instance. For more information about setting alerts in Cloud Monitoring, see Set a Monitoring alert for memory usage.
Connection management scenarios
This section explains connection management issues that your instance might encounter.
If you reach your connection limit or receive a connection timeout, then what do you do?
When you reach your connection limit, your client fails to connect to your server. This is known as a connection rejection.
If this happens, then do the following:
- Use the
/instance/node/stats/rejected_connections_count
metric to determine the number of connections that Memorystore for Valkey rejects because the instance node reaches the maximum clients limit. - Use the
/instance/node/clients/connected_clients
metric to determine the number of clients connected to the instance node. This way, you can see if all of the nodes in the instance are under the limit. - Stop any leaked or undesired connections by using the
client kill
command. - Reduce the connection count or pool size in the client application. For more information, see the documentation associated with the client application.
- Adjust the maximum clients limit. For more information, see Configure an instance.
- Scale your instance up to a larger node type so that your instance has a higher connection limit.
Timeout scenarios
This section explains timeout issues that your instance might encounter.
If you receive an I/O timeout, then what do you do?
When a read or write operation in Memorystore for Valkey fails to complete within a specified time, then an I/O timeout occurs. This timeout might happen because of various reasons. For example, one or more nodes of your instance might be overloaded.
If you receive an I/O timeout, then do the following:
- Use the
instance/cpu/maximum_utilization
metric to determine the CPU utilization for a node in your instance, from 0.0 (0%) to 1.0 (100%). We recommend that all nodes have a CPU utilization percentage of less than 80%. For more information, see CPU usage best practices. - When the client disconnects from the server because the server times out, retry with exponential backoff and with Jitter. This helps to avoid multiple clients overloading the server simultaneously.
Connectivity error scenarios
This section explains connectivity issues your instance might encounter.
Connection error caused by firewall rules
Firewall rules can cause connection errors by blocking the ports used by Memorystore for Valkey. You should allow list all ports for both of your instance's Private Service Connect endpoints. For more information about the endpoints, see Reserved network addresses.
Connection error caused by organization policies.
You can have an organization policy that blocks your Private Service Connect connections to your Memorystore for Valkey instance.
If your organization policy uses the .restrictPrivateServiceConnectProducer
policy,
then allow list the 961333125034
folder number, which is a folder specifically for Memorystore for Valkey. For example:
name: organizations/Consumer-org-1/policies/compute.restrictPrivateServiceConnectProducer spec: rules: - values: allowedValues: - under:folders/961333125034
If your organization policy uses the .disablePrivateServiceConnectCreationForConsumers
policy,
you should allow list SERVICE_PRODUCERS
. For example:
name: organizations/Consumer-org-1/policies/compute.disablePrivateServiceConnectCreationForConsumers spec: rules: - values: allowedValues: - SERVICE_PRODUCERS
Handling errors for Cluster Mode Disabled instances
If the application connects to the read endpoint of an instance which has no read replicas, then the connection closes and the
ERR no replicas found
error message appears. In this case, either try to connect the application to the primary endpoint or add read replicas to the instance.In the event of a failover, the existing connections from your application closes and the
ERR role change occurred
error message appears. You would also see this error message if your application connects to the read endpoint of an instance, and all read replicas of the instance are failing. In this case, the application should retry the connection with exponential backoff.