Debug hotspots in your database

This page describes how to detect and debug hotspots in your database. You can access statistics about hotspots in splits with both GoogleSQL and PostgreSQL.

Spanner stores your data as a contiguous key space, ordered by the primary keys of your tables and indexes. A split is a range of rows from a set of tables or an index. The split's start is called the split start. The split limit sets the end of the split. The split includes the split start, but not the split limit.

In Spanner, hotspots are situations where too many requests are sent to the same server which saturates the resources of the server and potentially causes high latencies. The splits affected by hotspots are known as hot or warm splits.

A split's hotspot statistic (identified in the system as CPU_USAGE_SCORE) is a measurement of the load on a split that's constrained by the resources available on the server. This measurement is given as a percentage. If more than 50% of the load on a split is constrained by the available resources, then the split is considered warm. If 100% of the load on a split is constrained, then the split is considered hot.

Spanner uses load-based splitting to evenly distribute the data load across the instance's servers. The warm and hot splits can be moved across servers for load balancing or can be broken into smaller splits. However, Spanner might not be able to balance the load, even after multiple attempts at splitting, due to anti-patterns in the application. Hence, persistent hotspots that last at least 10 minutes might need further troubleshooting and potential application changes.

The Spanner hot split statistics help you identify the splits where hotspots occur. You can then make changes to your application or schema, as needed. You can retrieve these statistics from the SPANNER_SYS.SPLIT_STATS_TOP_MINUTE system tables using SQL statements.

Availability of hot split statistics

Spanner provides the hot split statistics in the SPANNER_SYS schema. SPANNER_SYS data is available only through GoogleSQL and PostgreSQL interfaces. You can use the following ways to access this data:

Spanner single read APIs don't support SPANNER_SYS.

Hot split statistics

You use the following table to track hot splits:

  • SPANNER_SYS.SPLIT_STATS_TOP_MINUTE: shows splits that are hot during 1-minute intervals.

These tables have the following properties:

  • Each table contains data for non-overlapping time intervals of the duration the table name specifies.
  • Intervals are based on clock times:

    • 1-minute intervals end on the minute.
  • After each interval, Spanner collects data from all servers and then makes the data available in the SPANNER_SYS tables shortly thereafter.

    For example, at 11:59:30 AM, the most recent intervals available to SQL queries are:

    • 1 minute: 11:58:00-11:58:59 AM
  • Spanner groups the statistics by splits.

  • Each row contains a percentage that indicates how hot or warm a split is, for each split that Spanner captures statistics for during the specified interval.

  • If less than 50% of the load on a split is constrained by the available resources, then Spanner doesn't capture the statistic. If Spanner is unable to store all the hot splits during the interval, the system prioritizes the splits with the highest CPU_USAGE_SCORE percentage during the specified interval. If there are no splits returned, it's an indication of the absence of any hotspots.

Table schema

The following table shows the table schema for the following stats:

  • SPANNER_SYS.SPLIT_STATS_TOP_MINUTE
Column name Type Description
INTERVAL_END TIMESTAMP End of the time interval during which the split was hot
SPLIT_START STRING The starting key of the range of rows in the split. The split start might also be <begin>, indicating the beginning of the key space
SPLIT_LIMIT STRING The limit key for the range of rows in the split. The limit: key might also be <end>, indicating the end of the key space|
CPU_USAGE_SCORE INT64 The CPU_USAGE_SCORE percentage of the splits. A CPU_USAGE_SCORE percentage of 50% indicates the presence of warm or hot | splits |
AFFECTED_TABLES STRING ARRAY The tables whose rows might be in the split

What are split start and split limit keys?

A split is a contiguous row range of a database, and is defined by its start and limit keys. A split can be a single row, a narrow row range, or a wide row range, and the split can include multiple tables or indexes.

The SPLIT_START and SPLIT_LIMIT columns identify the primary keys of a warm or hot split.

Example schema

The following schema is an example table for the topics in this page.

GoogleSQL

CREATE TABLE Users (
  UserId INT64 NOT NULL,
  FirstName STRING(MAX),
  LastName STRING(MAX),
) PRIMARY KEY(UserId);

CREATE INDEX UsersByFirstName ON Users(FirstName DESC);

CREATE TABLE Threads (
  UserId INT64 NOT NULL,
  ThreadId INT64 NOT NULL,
  Starred BOOL,
) PRIMARY KEY(UserId, ThreadId),
  INTERLEAVE IN PARENT Users ON DELETE CASCADE;

CREATE TABLE Messages (
  UserId INT64 NOT NULL,
  ThreadId INT64 NOT NULL,
  MessageId INT64 NOT NULL,
  Subject STRING(MAX),
  Body STRING(MAX),
) PRIMARY KEY(UserId, ThreadId, MessageId),
  INTERLEAVE IN PARENT Threads ON DELETE CASCADE;

CREATE INDEX MessagesIdx ON Messages(UserId, ThreadId, Subject),
INTERLEAVE IN Threads;

PostgreSQL

CREATE TABLE users
(
   userid    BIGINT NOT NULL PRIMARY KEY,-- INT64 to BIGINT
   firstname VARCHAR(max),-- STRING(MAX) to VARCHAR(MAX)
   lastname  VARCHAR(max)
);

CREATE INDEX usersbyfirstname
  ON users(firstname DESC);

CREATE TABLE threads
  (
    userid   BIGINT NOT NULL,
    threadid BIGINT NOT NULL,
    starred  BOOLEAN, -- BOOL to BOOLEAN
    PRIMARY KEY (userid, threadid),
    CONSTRAINT fk_threads_user FOREIGN KEY (userid) REFERENCES users(userid) ON
    DELETE CASCADE -- Interleave to Foreign Key constraint
  );

CREATE TABLE messages
  (
    userid    BIGINT NOT NULL,
    threadid  BIGINT NOT NULL,
    messageid BIGINT NOT NULL PRIMARY KEY,
    subject   VARCHAR(max),
    body      VARCHAR(max),
    CONSTRAINT fk_messages_thread FOREIGN KEY (userid, threadid) REFERENCES
    threads(userid, threadid) ON DELETE CASCADE
  -- Interleave to Foreign Key constraint
  );

CREATE INDEX messagesidx ON messages(userid, threadid, subject), REFERENCES
threads(userid, threadid);

Imagine your key space looks like this:

PRIMARY KEY
<begin>
Users()
Threads()
Users(2)
Users(3)
Threads(3)
Threads(3,"a")
Messages(3,"a",1)
Messages(3,"a",2)
Threads(3, "aa")
Users(9)
Users(10)
Threads(10)
UsersByFirstName("abc")
UsersByFirstName("abcd")
<end>

Example of splits

The following shows some example splits to help you understand what splits look like.

The SPLIT_START and SPLIT_LIMIT might indicate the row of a table or index, or they can be <begin> and <end>, representing the boundaries of the key space of the database. The SPLIT_START and SPLIT_LIMIT might also contain truncated keys, which are keys preceding any full key in the table. For example, Threads(10) is a prefix for any Threads row interleaved in Users(10).

SPLIT_START SPLIT_LIMIT AFFECTED_TABLES EXPLANATION
Users(3) Users(10) UsersByFirstName, Users, Threads, Messages, MessagesIdx Split starts at row with UserId=3 and ends at the row before the row with UserId = 10. The split contains the Users table rows and all its interleaved tables rows for UserId=3 to 10.
Messages(3,"a",1) Threads(3,"aa") Threads, Messages, MessagesIdx The split starts at the row with UserId=3, ThreadId="a" and MessageId=1 and ends at the row preceding the row with the key of UserId=3 and ThreadsId = "aa". The split contains all the tables between Messages(3,"a",1) and Threads(3,"aa"). As the split_start and split_limit are interleaved in the same top-level table row, the split contains the interleaved tables rows between the start and limit. See schemas-overview to understand how interleaved tables are co-located.
Messages(3,"a",1) <end> UsersByFirstName, Users, Threads, Messages, MessagesIdx The split starts in the messages table at the row with key UserId=3, ThreadId="a" and MessageId=1. The split hosts all the rows from the split_start to <end>, the end of the key space of the database. All the rows of the tables following the split_start, like Users(4) are included in the split.
<begin> Users(9) UsersByFirstName, Users, Threads, Messages, MessagesIdx The split starts at <begin>, the beginning of the key space of the database and ends at the row preceding the Users row with UserId=9. So the split has all the table rows preceding Users and all the rows of Users table preceding UserId=9 and the rows of its interleaved tables.
Messages(3,"a",1) Threads(10) UsersByFirstName, Users, Threads, Messages, MessagesIdx Split starts at Messages(3,"a", 1) interleaved in Users(3) and ends at the row preceding Threads(10). Threads(10) is a truncated split key that is a prefix of any key of the Threads table interleaved in Users(10).
Users() <end> UsersByFirstName, Users, Threads, Messages, MessagesIdx The split starts at the truncated split key of Users() which precedes any full key of the Users table. The split extends until the end of the possible key space in the database. The affected_tables hence cover the Users table, its interleaved tables and indexes and all the tables that might appear after users.
Threads(10) UsersByFirstName("abc") UsersByFirstName, Users, Threads, Messages, MessagesIdx The split starts at the Threads row with UserId = 10 and ends at the index, UsersByFirstName at the key preceding "abc".

Example queries to find hot splits

The following example shows a SQL statement that you can use to retrieve the hot split statistics. You can run these SQL statements using the client libraries, gcloud, or the Google Cloud console.

GoogleSQL

SELECT t.split_start,
       t.split_limit,
       t.cpu_usage_score,
       t.affected_tables,
FROM   SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE  t.interval_end =
  (SELECT MAX(interval_end)
  FROM    SPANNER_SYS.SPLIT_STATS_TOP_MINUTE)
ORDER BY  t.cpu_usage_score DESC;

PostgreSQL

SELECT t.split_start,
       t.split_limit,
       t.cpu_usage_score,
       t.affected_tables
FROM   SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE  t.interval_end = (
  SELECT MAX(interval_end)
  FROM   SPANNER_SYS.SPLIT_STATS_TOP_MINUTE
)
ORDER BY t.cpu_usage_score DESC;

The query output looks like the following:

SPLIT_START SPLIT_LIMIT CPU_USAGE_SCORE AFFECTED_TABLES
Users(13) Users(76) 82 Messages,Users,Threads
Users(101) Users(102) 90 Messages,Users,Threads
Threads(10, "a") Threads(10, "aa") 100 Messages,Threads
Messages(631, "abc", 1) Messages(631, "abc", 3) 100 Messages
Threads(12, "zebra") Users(14) 76 Messages,Users,Threads
Users(620) <end> 100 Messages,Users,Threads

Data retention for the hot split statistics

At a minimum, Spanner keeps data for each table for the following time period:

  • SPANNER_SYS.SPLIT_STATS_TOP_MINUTE: Intervals that cover the previous 6 hours.

Troubleshoot hotspots using hot split statistics

This section describes how to detect and troubleshoot hotspots.

Select a time period to investigate

Check the latency metrics for your Spanner database to find the time period when your application experienced high latency and CPU usage. For example, it might show you that an issue started around 10:50 PM on May 18, 2024.

Find persistent hotspotting

As Spanner balances your load with load-based splitting, we recommend that you investigate if hotspotting has continued for more than 10 minutes. You can do so by querying the SPANNER_SYS.SPLIT_STATS_TOP_MINUTE table, as shown in the following example:

GoogleSQL

SELECT Count(DISTINCT t.interval_end)
FROM   SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE  t.utilization >= 50
  AND  t.interval_end >= "interval_end_date_time"
  AND  t.interval_end <= "interval_end_date_time";

Replace interval_end_date_time with the date and time for the interval, using the format 2024-05-18T17:40:00Z.

PostgreSQL

SELECT COUNT(DISTINCT t.interval_end)
FROM   SPLIT_STATS_TOP_MINUTE t
WHERE  t.utilization >= 50
  AND  t.interval_end >= 'interval_end_date_time'::timestamptz
  AND  t.interval_end <= 'interval_end_date_time'::timestamptz;

Replace interval_end_date_time with the date and time for the interval, using the format 2024-05-18T17:40:00Z.

If the previous query result is equal to 10, it means that your database is experiencing hotspotting that might need further debugging.

Find the splits with the highest CPU_USAGE_SCORE level

For this example, we run the following SQL to find the row ranges with the highest CPU_USAGE_SCORE level:

GoogleSQL

SELECT t.split_start,
       t.split_limit,
       t.affected_tables,
       t.cpu_usage_score
FROM   SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE  t.cpu_usage_score >= 50
  AND  t.interval_end = "interval_end_date_time";

Replace interval_end_date_time with the date and time for the interval, using the format 2024-05-18T17:40:00Z.

PostgreSQL

SELECT t.split_start,
       t.split_limit,
       t.affected_tables,
       t.cpu_usage_score
FROM   SPLIT_STATS_TOP_MINUTE t
WHERE  t.cpu_usage_score = 100
  AND  t.interval_end = 'interval_end_date_time'::timestamptz;

Replace interval_end_date_time with the date and time for the interval, using the format 2024-05-18T17:40:00Z.

The previous SQL outputs the following:

SPLIT_START SPLIT_LIMIT CPU_USAGE_SCORE AFFECTED_TABLES
Users(180) <end> 85 Messages,Users,Threads
Users(24) Users(76) 76 Messages,Users,Threads

From this table of results, we can see that hotspots occurred on two splits. Spanner load-based splitting might try to resolve hotspots on these splits. However, it might not be able to do so if there are problematic patterns in the schema or workload. To detect if there are splits that need your intervention, we recommend tracking the splits for at least 10 minutes. For example, the following SQL tracks the first split over the last ten minutes.

GoogleSQL

SELECT t.interval_end,
       t.split_start,
       t.split_limit,
       t.cpu_usage_score
FROM   SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE  t.split_start = "users(180)"
  AND  t.split_limit = "<end>"
  AND  t.interval_end >= "interval_end_date_time"
  AND  t.interval_end <= "interval_end_date_time";

Replace interval_end_date_time with the date and time for the interval, using the format 2024-05-18T17:40:00Z.

PostgreSQL

SELECT t.interval_end,
       t.split_start,
       t.split_limit,
       t.cpu_usage_score
FROM   SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE  t.split_start = 'users(180)'
  AND  t.split_limit = ''
  AND  t.interval_end >= 'interval_end_date_time'::timestamptz
  AND  t.interval_end <= 'interval_end_date_time'::timestamptz;

Replace interval_end_date_time with the date and time for the interval, using the format 2024-05-18T17:40:00Z.

The previous SQL outputs the following:

INTERVAL_END SPLIT_START SPLIT_LIMIT CPU_USAGE_SCORE
2024-05-18T17:46:00Z Users(180) <end> 85
2024-05-18T17:47:00Z Users(180) <end> 85
2024-05-18T17:48:00Z Users(180) <end> 85
2024-05-18T17:49:00Z Users(180) <end> 85
2024-05-18T17:50:00Z Users(180) <end> 85

The split seems to have been hot for the past few minutes. You might observe the split for longer to determine that the Spanner load-based splitting mitigates the hotspot. There might be cases wherein Spanner can't load balance any further.

For example, query the SPANNER_SYS.SPLIT_STATS_TOP_MINUTE table. See the following example scenarios.

GoogleSQL

SELECT t.interval_end,
      t.split_start,
      t.split_limit,
      t.cpu_usage_score
FROM  SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE t.interval_end >= "interval_end_date_time"
      AND t.interval_end <= "interval_end_date_time";

Replace interval_end_date_time with the date and time for the interval, using the format 2024-05-18T17:40:00Z.

PostgreSQL

SELECT t.interval_end,
       t.split_start,
       t.split_limit,
       t._cpu_usage
FROM   SPANNER_SYS.SPLIT_STATS_TOP_MINUTE t
WHERE  t.interval_end >= 'interval_end_date_time'::timestamptz
  AND  t.interval_end <= 'interval_end_date_time'::timestamptz;

Replace interval_end_date_time with the date and time for the interval, using the format 2024-05-18T17:40:00Z.

Single hot row

In the following example, it looks like Threads(10,"spanner") is in a single row split which remained hot for over 10 minutes. This could happen when there's a persistent load on a popular row.

INTERVAL_END SPLIT_START SPLIT_LIMIT CPU_USAGE_SCORE
2024-05-16T20:40:00Z Threads(10,"spanner") Threads(10,"spanner1") 62
2024-05-16T20:41:00Z Threads(10,"spanner") Threads(10,"spanner1") 62
2024-05-16T20:42:00Z Threads(10,"spanner") Threads(10,"spanner1") 62
2024-05-16T20:43:00Z Threads(10,"spanner") Threads(10,"spanner1") 62
2024-05-16T20:44:00Z Threads(10,"spanner") Threads(10,"spanner1") 62
2024-05-16T20:45:00Z Threads(10,"spanner") Threads(10,"spanner1") 62
2024-05-16T20:46:00Z Threads(10,"spanner") Threads(10,"spanner1") 80
2024-05-16T20:47:00Z Threads(10,"spanner") Threads(10,"spanner1") 80
2024-05-16T20:48:00Z Threads(10,"spanner") Threads(10,"spanner1") 80
2024-05-16T20:49:00Z Threads(10,"spanner") Threads(10,"spanner1") 100
2024-05-16T20:50:00Z Threads(10,"spanner") Threads(10,"spanner1") 100

Spanner can't balance the load for this single key as it can't be split further.

Moving hotspot

In the following example, the load moves through contiguous splits over time, moving to a new split across time intervals.

INTERVAL_END SPLIT_START SPLIT_LIMIT CPU_USAGE_SCORE
2024-05-16T20:40:00Z Threads(1,"a") Threads(1,"aa") 100
2024-05-16T20:41:00Z Threads(1,"aa") Threads(1,"ab") 100
2024-05-16T20:42:00Z Threads(1,"ab") Threads(1,"c") 100
2024-05-16T20:43:00Z Threads(1,"c") Threads(1,"ca") 100

This could occur, for example, due to a workload that reads or writes keys in monotonically increasing order. Spanner can't balance the load to mitigate the effects of this application behavior.

Normal load balancing

Spanner tries to balance the load by adding more splits or moving splits around. The following example shows what that might look like.

INTERVAL_END SPLIT_START SPLIT_LIMIT CPU_USAGE_SCORE
2024-05-16T20:40:00Z Threads(1000,"zebra") <end> 82
2024-05-16T20:41:00Z Threads(1000,"zebra") <end> 90
2024-05-16T20:42:00Z Threads(1000,"zebra") <end> 100
2024-05-16T20:43:00Z Threads(1000,"zebra") Threads(2000,"spanner") 100
2024-05-16T20:44:00Z Threads(1200,"c") Threads(2000) 92
2024-05-16T20:45:00Z Threads(1500,"c") Threads(1700,"zach") 76
2024-05-16T20:46:00Z Threads(1700) Threads(1700,"c") 76
2024-05-16T20:47:00Z Threads(1700) Threads(1700,"c") 50
2024-05-16T20:48:00Z Threads(1700) Threads(1700,"c") 39

Here, the larger split at 2024-05-16T17:40:00Z was split further into a smaller split and as a result, the CPU_USAGE_SCORE statistic decreased. Spanner might not create splits into individual rows. The splits mirror the workload causing the high CPU_USAGE_SCORE statistic.

If you have observed a persistent hot split for over 10 minutes, see Best practices to mitigate hotspots.

Best practices to mitigate hotspots

If load-balancing doesn't decrease latency, the next step is to identify the cause of the hotspots. After that, options are to either reduce the hotspotting workload, or optimize the application schema and logic to avoid hotspots.

Identify the cause

  • Use Lock & Transaction Insights to look for transactions that have high lock wait time where the row range start key is within the hot split.

  • Use Query Insights to look for queries that read from the table that contains the hot split, and have recently increased latency, or a higher ratio of latency to CPU.

  • Use Oldest Active Queries to look for queries that read from the table that contains the hot split, and that have higher than expected latency.

Some special cases to watch for:

  • Check to see if time to live (TTL) was enabled recently. If there are a lot of splits from old data, then TTL can raise CPU_USAGE_SCORE levels during mass deletes. In this case, the issue should self-resolve once the initial deletions complete.

Optimize the workload

  • Follow SQL best practices. Consider stale reads, writes that don't perform reads first, or adding indexes.
  • Follow Schema best practices. Ensure your schema is designed to handle load balancing and avoid hotspots.

What's next