General best practices

This page provides guidance on optimally using Memorystore for Redis Cluster. This page also points out potential issues to avoid.

Memory management best practices

This section describes strategies for managing instance memory so Memorystore for Redis Cluster works efficiently for your application.

Memory management concepts

  • Write load - The volume and speed at which you add or update keys on your Redis cluster. Your write load can range from normal to very high depending on your Redis use case and application usage patterns.

  • Eviction policy - Memorystore for Redis Cluster uses the volatile-lru eviction policy. You can use commands like EXPIRE command to set evictions for keys.

Monitoring a cluster that has a normal write load

View the /cluster/memory/maximum_utilization metric. If /cluster/memory/maximum_utilization is at 100% or lower, your Redis cluster performs well when you use a normal write load.

However, if your memory usage approaches 100% and you expect data usage to grow, you should scale up your cluster size to make room for new data.

Monitoring a cluster that has a high write load

View the /cluster/memory/maximum_utilization metric. Depending on severity of your high write load, your cluster can experience performance issues at the following thresholds:

  • Very high write loads can experience issues if /cluster/memory/maximum_utilization reaches 65% or higher.

  • Moderately high write loads can experience issues if /cluster/memory/maximum_utilization reaches 85% or higher.

In these scenarios you should scale up your cluster size to improve performance.

If you run into issues, or are concerned your instance has a high write load, reach out to Google Cloud Support.

Scaling shards

When you Scale the number of shards in an instance, you should scale during periods of low writes. Scaling during periods of high write load can put memory pressure on your instance due to memory overhead caused by replication or slot migration.

If your Redis use case uses key evictions, scaling to a smaller cluster size can reduce your cache hit ratio. In this circumstance, however, you don't need to worry about losing data, since key eviction is expected.

For Redis use cases where you don't want to lose keys, you should only scale down to a smaller cluster that still has enough room for your data. Your new target shard count should allow for at least 1.5 times the memory used by data. In other words, you should provision enough shards for 1.5 times the amount of data currently in your cluster. You can use the /cluster/memory/total_used_memory metric to see how much data is stored in your instance.

CPU usage best practices

If an unexpected zonal outage occurs, this leads to reduced CPU resources for your cluster due to lost capacity from nodes in the unavailable zone. We recommend using high available clusters. Using two replicas per shard (as opposed to one replica per shard) provides additional CPU resources during and outage. Additionally, we recommend managing node CPU usage so nodes have enough CPU overhead to handle additional traffic from lost capacity if an unexpected zonal outage happens. You should monitor CPU usage for primaries and replicas using the Main Thread CPU Seconds /cluster/cpu/maximum_utilization metric.

Depending on how many replicas you provision per node, we recommend the following /cluster/cpu/maximum_utilization CPU usage targets:

  • For instances with 1 replica per node, you should target a /cluster/cpu/maximum_utilization value of 0.5 seconds for the primary and the replica.
  • For instances with 2 replica per node, you should target a /cluster/cpu/maximum_utilization value of 0.9 seconds for the primary and 0.5 seconds for the replicas.

If values for the metric exceed these recommendations, we recommend scaling up the number of shards or replicas in your instance.

CPU expensive Redis commands

You should avoid using the following expensive Redis commands, especially for clusters with large key counts and collection sizes:

The main risks of using expensive commands are:

  • High latency and client time outs.
  • Memory pressure caused by commands that increase memory usage.
  • Data loss during node replication and synchronization, due to blocking the Redis main thread.
  • Starved health checks, observability, and replication.

Redis client best practices

Your application must use a cluster aware Redis client when connecting to a Memorystore for Redis Cluster instance. For examples of cluster aware clients and sample configurations, see Client library code samples. Your client must maintain a map of hash slots to the corresponding nodes in the cluster in order to send requests to the right nodes and avoid the performance overhead caused by cluster redirections.

Client mapping

Clients must obtain a complete list of slots and the mapped nodes in the following situations:

  • When the client is initialized, it must populate the initial slot to nodes mapping.

  • When a MOVED redirection is received from the server, such as in the situation of a failover when all slots served by the former primary node are taken over by the replica, or re-sharding when slots are being moved from the source primary to the target primary node.

  • When a CLUSTERDOWN error is received from the server or connections to a particular server run into timeouts persistently.

  • When a READONLY error is received from the server. This can happen when a primary is demoted to replica.

  • Additionally, clients should periodically refresh the topology to keep the clients warmed up for any changes and learn about changes that may not result in redirections / errors from the server, such as when new replica nodes are added. Note that any stale connections should also be closed as part of topology refresh to reduce the need to handle failed connections during command runtime.

Client discovery

Client discovery is usually done by issuing a CLUSTER SLOT, CLUSTER NODE, or CLUSTER SHARDS command to the Redis server. We recommend using the CLUSTER SHARDS command. CLUSTER SHARDS replaces the CLUSTER SLOTS command (deprecated), by providing a more efficient and extensible representation of the cluster.

The size of the response for the cluster client discovery commands can vary based on the cluster size and topology. Larger clusters with more nodes produce a larger response. As a result, it's important to ensure that the number of clients doing the cluster topology discovery doesn't grow unbounded.

These topology refreshes are expensive on the Redis server but are also important for application availability. Therefore it is important to ensure that each client makes a single discovery request at any given time (and caches result in-memory), and the number of clients making the requests be kept bounded to avoid overloading the server.

For example, when the client application starts up or loses connection from the server and must perform cluster discovery, one common mistake is that the client application makes several reconnection and discovery requests without adding exponential backoff upon retry. This can render the Redis server unresponsive for a prolonged period of time, causing very high CPU utilization.

Avoid discovery overload on Redis

To mitigate the impact caused by a sudden influx of connection and discovery requests, we recommend the following:

  • Implement a client connection pool with a finite and small size to bound the number of concurrent incoming connections from the client application.

  • When the client disconnects from the server due to timeout, retry with exponential backoff with jitter. This helps to avoid multiple clients overwhelming the server at the same time.

  • Use the Memorystore for Redis Cluster discovery endpoint to perform cluster discovery. The discovery endpoint is highly available and is load balanced across all the nodes in the cluster. Moreover, the discovery endpoint attempts to route the cluster discovery requests to nodes with the most up-to-date topology view.

Persistence best practices

This section explains best practices for persistence.

RDB persistence

For best results backing up your instance with RDB snapshots, you should use the following best practices:

Memory management

RDB snapshots use a process fork and 'copy-on-write' mechanism to take a snapshot of node data. Depending on the pattern of writes to nodes, the used memory of the nodes grows as pages touched by the writes are copied. In the worst case, the memory footprint can be double the size of the data in the node.

To ensure nodes have sufficient memory to complete the snapshot, you should keep or set maxmemory at 80% of the node capacity so that 20% is reserved for overhead. See Monitor a cluster that has a high write load to learn more. This memory overhead, in addition to Monitoring snapshots, helps you manage your workload to have successful snapshots.

Stale snapshots

Recovering nodes from a stale snapshot can cause performance issues for your application as it tries to reconcile a significant amount of stale keys or other changes to your database such as a schema change. If you are concerned about recovering from a stale snapshot, you can disable the RDB persistence feature. Once you re-enable persistence, a snapshot is taken at the next scheduled snapshot interval.

Performance impact of RDB snapshots

Depending on your workload pattern RDB snapshots can impact the performance of the instance and increase latency for your applications. You can minimize the performance impact of RDB snapshots by scheduling them to run during periods of low instance traffic if you are comfortable with less frequent snapshots.

For example, if your instance has low traffic from 1 AM to 4 AM, you can set the start time to 3 AM and set the interval to 24 hours.

If your system has a constant load and requires frequent snapshots, you should carefully evaluate the performance impact, and weigh the benefits of using RDB snapshots for the workload.

Choose a single-zone instance if your instance doesn't use replicas

When configuring an instance without replicas, we recommend a single-zone architecture for improved reliability. Here's why:

Minimize outage impact

Zonal outages are less likely to impact your instance. By placing all nodes within a single zone, the chance of a zonal outage affecting your server drops from 100% to 33%. This is because there is a 33% chance the zone where your instance is located goes down, as opposed to 100% chance that nodes located in the unavailable zone are impacted.

Rapid recovery

Should a zonal outage occur, recovery is streamlined. You can respond by quickly provisioning a new instance in a functioning zone and redirecting your application for minimally interrupted operations.