Best practices for designing a Spanner Graph schema

This document describes how to create efficient queries by using best practices for designing Spanner Graph schemas. Schema design can be iterative, so we recommend that you first identify critical query patterns to guide your schema design.

For general information about Spanner schema design best practices, see Schema design best practices.

Optimize edge traversal

Edge traversal is the process of navigating through a graph by following its edges, starting at a particular node and moving along connected edges to reach other nodes. Edge traversal is a fundamental operation in Spanner Graph, so improving edge traversal efficiency is key to your application performance.

You can traverse an edge in two directions: traversing from the source node to the destination node is called forward edge traversal, while traversing from the destination node to the source node is called reverse edge traversal.

Optimize forward edge traversal using interleaving

To improve forward edge traversal performance, interleave the edge input table into the source node input table to colocate edges with source nodes. Interleaving is a storage optimization technique in Spanner that physically colocates child table rows with their corresponding parent rows in storage. For more information about interleaving, see Schemas overview.

The following example demonstrates these best practices:

CREATE TABLE Person (
  id               INT64 NOT NULL,
  name             STRING(MAX),
) PRIMARY KEY (id);

CREATE TABLE PersonOwnAccount (
  id               INT64 NOT NULL,
  account_id       INT64 NOT NULL,
  create_time      TIMESTAMP,
) PRIMARY KEY (id, account_id),
  INTERLEAVE IN PARENT Person ON DELETE CASCADE;

Optimize reverse edge traversal using foreign key

To efficiently traverse reverse edges, create a foreign key constraint between the edge and the destination node. This foreign key automatically creates a secondary index on the edge keyed by the destination node keys. The secondary index is automatically used during query execution.

The following example demonstrates these best practices:

CREATE TABLE Person (
  id               INT64 NOT NULL,
  name             STRING(MAX),
) PRIMARY KEY (id);

CREATE TABLE Account (
  id               INT64 NOT NULL,
  create_time      TIMESTAMP,
) PRIMARY KEY (id);

CREATE TABLE PersonOwnAccount (
  id               INT64 NOT NULL,
  account_id       INT64 NOT NULL,
  create_time      TIMESTAMP,
  CONSTRAINT FK_Account FOREIGN KEY (account_id) REFERENCES Account (id),
) PRIMARY KEY (id, account_id),
  INTERLEAVE IN PARENT Person ON DELETE CASCADE;

Optimize reverse edge traversal using secondary index

If you don't want to create a foreign key on the edge, for example, due to the strict data integrity it enforces, you can directly create a secondary index on the edge input table, as shown in the following example:

CREATE TABLE PersonOwnAccount (
  id               INT64 NOT NULL,
  account_id       INT64 NOT NULL,
  create_time      TIMESTAMP,
) PRIMARY KEY (id, account_id),
  INTERLEAVE IN PARENT Person ON DELETE CASCADE;

CREATE INDEX Reverse_PersonOwnAccount
ON PersonOwnAccount (account_id);

Disallow dangling edges

A dangling edge is an edge that connects fewer than two nodes. A dangling edge can occur when a node is deleted without removing its associated edges, or when an edge is created without properly linking it to its nodes.

Disallowing dangling edges provides the following benefits:

  • Enforces graph structure integrity.
  • Improves query performance by avoiding the extra work to filter out edges where endpoints don't exist.

Disallow dangling edges using referential constraints

To disallow dangling edges, specify constraints on both endpoints:

  • Interleave the edge input table into the source node input table. This approach ensures that the source node of an edge always exists.
  • Create a foreign key constraint on edges to ensure that the destination node of an edge always exists.

The following example uses interleaving and a foreign key to enforce referential integrity:

CREATE TABLE PersonOwnAccount (
  id               INT64 NOT NULL,
  account_id       INT64 NOT NULL,
  create_time      TIMESTAMP,
  CONSTRAINT FK_Account FOREIGN KEY (account_id) REFERENCES Account (id) ON DELETE CASCADE,
) PRIMARY KEY (id, account_id),
  INTERLEAVE IN PARENT Person ON DELETE CASCADE;

Use ON DELETE CASCADE to automatically remove edges when deleting a node

When you use interleaving or a foreign key to disallow dangling edges, use the ON DELETE clause to control the behavior when you want to delete a node with edges that are still attached. For more information, see delete cascading for interleaved tables and delete cascading with foreign keys.

You can use ON DELETE in the following ways:

  • ON DELETE NO ACTION (or omitting the ON DELETE clause): Deleting a node with edges will fail.
  • ON DELETE CASCADE: Deleting a node automatically removes the associated edges in the same transaction.

Delete cascade for edges connecting different types of nodes

  • Delete edges when the source node is deleted. For example,INTERLEAVE IN PARENT Person ON DELETE CASCADE deletes all outgoing PersonOwnAccount edges from the Person node being deleted. For more information, see Create interleaved tables.

  • Delete edges when the destination node is deleted. For example, CONSTRAINT FK_Account FOREIGN KEY(account_id) REFERENCES Account(id) ON DELETE CASCADE deletes all incoming PersonOwnAccount edges into the Account node being deleted. For more information, see Foreign keys.

Delete cascade for edges connecting the same type of nodes

When the source and destination nodes of an edge have the same type and the edge is interleaved into the source node, you can define ON DELETE CASCADE only for the source node or destination node (but not both nodes).

To automatically remove dangling edges in both cases, create a foreign key on the edge source node reference instead of interleaving the edge input table into the source node input table.

We recommend interleaving for optimizing forward edge traversal. Be sure to verify the impact on your workloads before you proceed. See the following example, which uses AccountTransferAccount as the edge input table:

--Define two Foreign Keys, each on one end Node of Transfer Edge, both with ON DELETE CASCADE action:
CREATE TABLE AccountTransferAccount (
  id               INT64 NOT NULL,
  to_id            INT64 NOT NULL,
  amount           FLOAT64,
  create_time      TIMESTAMP NOT NULL,
  order_number     STRING(MAX),
  CONSTRAINT FK_FromAccount FOREIGN KEY (id) REFERENCES Account (id) ON DELETE CASCADE,
  CONSTRAINT FK_ToAccount FOREIGN KEY (to_id) REFERENCES Account (id) ON DELETE CASCADE,
) PRIMARY KEY (id, to_id);

Filter by node or edge properties with secondary indexes

Secondary indexes are essential for efficient query processing. They support quick lookups of nodes and edges based on specific property values, without having to traverse the entire graph structure. This is important when you're working with large graphs, because traversing all nodes and edges can be very inefficient.

Speed up filtering nodes by property

To speed up filtering by node properties, create secondary indexes on properties. For example, the following query finds accounts for a given nickname. Without a secondary index, all Account nodes are scanned to match the filtering criteria.

GRAPH FinGraph
MATCH (acct:Account)
WHERE acct.nick_name = "abcd"
RETURN acct.id;

To speed up the query, create a secondary index on the filtered property, as shown in the following example:

CREATE TABLE Account (
  id               INT64 NOT NULL,
  create_time      TIMESTAMP,
  is_blocked       BOOL,
  nick_name        STRING(MAX),
) PRIMARY KEY (id);

CREATE INDEX AccountByEmail
ON Account (nick_name);

Tip: Use NULL-filtered indexes for sparse properties. For more information, see Disable indexing of NULL values.

Speed up forward edge traversal with filtering on edge properties

When you traverse an edge while filtering on its properties, you can speed up the query by creating a secondary index on the edge properties and interleaving the index into the source node.

For example, the following query finds accounts owned by a given person after a certain time:

GRAPH FinGraph
MATCH (person:Person)-[owns:Owns]->(acct:Account)
WHERE person.id = 1
  AND owns.create_time >= PARSE_TIMESTAMP("%c", "Thu Dec 25 07:30:00 2008")
RETURN acct.id;

By default, this query reads all edges of the specified person and then filters those edges that satisfy the condition on create_time.

The following example shows you how to improve query efficiency by creating a secondary index on the edge source node reference (id) and the edge property (create_time). Interleave the index under the source node input table to colocate the index with the source node.

CREATE TABLE PersonOwnAccount (
  id               INT64 NOT NULL,
  account_id       INT64 NOT NULL,
  create_time      TIMESTAMP,
) PRIMARY KEY (id, account_id),
  INTERLEAVE IN PARENT Person ON DELETE CASCADE;

CREATE INDEX PersonOwnAccountByCreateTime
ON PersonOwnAccount (id, create_time)
INTERLEAVE IN Person;

Using this approach, the query can efficiently find all edges satisfying the condition on create_time.

Speed up reverse edge traversal with filtering on edge properties

When you traverse a reverse edge while filtering on its properties, you can speed up the query by creating a secondary index using the destination node and the edge properties for filtering.

The following example query performs reverse edge traversal with filtering on edge properties:

GRAPH FinGraph
MATCH (acct:Account)<-[owns:Owns]-(person:Person)
WHERE acct.id = 1
  AND owns.create_time >= PARSE_TIMESTAMP("%c", "Thu Dec 25 07:30:00 2008") 
RETURN person.id;

To speed up this query using a secondary index, use one of the following options:

  • Create a secondary index on the edge destination node reference (account_id) and the edge property (create_time), as shown in the following example:

    CREATE TABLE PersonOwnAccount (
      id               INT64 NOT NULL,
      account_id       INT64 NOT NULL,
      create_time      TIMESTAMP,
    ) PRIMARY KEY (id, account_id),
      INTERLEAVE IN PARENT Person ON DELETE CASCADE;
    
    CREATE INDEX PersonOwnAccountByCreateTime
    ON PersonOwnAccount (account_id, create_time);
    

    This approach provides better performance because the reverse edges are sorted by account_id and create_time, which lets the query engine efficiently find edges for account_id satisfying the condition on create_time. However, if different query patterns filter on different properties, then each property might require a separate index, which can add overhead.

  • Create a secondary index on the edge destination node reference (account_id) and store the edge property (create_time) in a storing column, as shown in the following example:

    CREATE TABLE PersonOwnAccount (
      id               INT64 NOT NULL,
      account_id       INT64 NOT NULL,
      create_time      TIMESTAMP,
    ) PRIMARY KEY (id, account_id),
      INTERLEAVE IN PARENT Person ON DELETE CASCADE;
    
    CREATE INDEX PersonOwnAccountByCreateTime
    ON PersonOwnAccount (account_id) STORING (create_time);
    

    This approach can store multiple properties; however, the query must read all edges of the destination node and then filter on edge properties.

You can combine these approaches by following these guidelines:

  • Use edge properties in index columns if they are used in performance-critical queries.
  • For properties used in less performance-sensitive queries, add them in the storing columns.

Model node and edge types with labels and properties

Node and edge types are commonly modeled with labels. However, you can also use properties to model types. Consider an example in which there are many different types of accounts, like BankAccount, InvestmentAccount, and RetirementAccount. You can store the accounts in separate input tables and model them as separate labels, or you can store the accounts in a single input table and use a property to differentiate between the types.

Start the modeling process by modeling the types with labels. Consider using properties in the following scenarios.

Improve schema management

If your graph has many different node and edge types, managing a separate input table for each can become difficult. To make schema management easier, model the type as a property.

Model types in a property to manage frequently changing types

When you model types as labels, adding or removing types requires changes to the schema. If you perform too many schema updates in a short period of time, Spanner might throttle the processing of queued schema updates. For more information, see Limit the frequency of schema updates.

If you need to change the schema frequently, we recommend that you model the type in a property to work around limitations on the frequency of schema updates.

Speed up queries

Modeling types with properties might speed up queries when the node or edge pattern references multiple labels. The following example query finds all instances of SavingsAccount and InvestmentAccount owned by a Person, assuming account types are modeled with labels:

GRAPH FinGraph
MATCH (:Person {id: 1})-[:Owns]->(acct:SavingsAccount|InvestmentAccount)
RETURN acct.id;

The acct node pattern references two labels. If this is a performance-critical query, consider modeling Account using a property. This approach might provide better query performance, as shown in the following query example. We recommend that you benchmark both queries.

GRAPH FinGraph
MATCH (:Person {id: 1})-[:Owns]->(acct:Account)
WHERE acct.type IN ("Savings", "Investment")
RETURN acct.id;

Store type in the node element key to speed up queries

To speed up queries with filtering on the node type when a node type is modeled with a property and the type doesn't change across the node lifetime, follow these steps:

  1. Include the property as part of the node element key.
  2. Add the node type in the edge input table.
  3. Include the node type in the edge referencing keys.

The following example applies this optimization to the Account node and the AccountTransferAccount edge.

CREATE TABLE Account (
  type             STRING(MAX) NOT NULL,
  id               INT64 NOT NULL,
  create_time      TIMESTAMP,
) PRIMARY KEY (type, id);

CREATE TABLE AccountTransferAccount (
  type             STRING(MAX) NOT NULL,
  id               INT64 NOT NULL,
  to_type          STRING(MAX) NOT NULL,
  to_id            INT64 NOT NULL,
  amount           FLOAT64,
  create_time      TIMESTAMP NOT NULL,
  order_number     STRING(MAX),
) PRIMARY KEY (type, id, to_type, to_id),
  INTERLEAVE IN PARENT Account ON DELETE CASCADE;

CREATE PROPERTY GRAPH FinGraph
  NODE TABLES (
    Account
  )
  EDGE TABLES (
    AccountTransferAccount
      SOURCE KEY (type, id) REFERENCES Account
      DESTINATION KEY (to_type, to_id) REFERENCES Account
  );

Configure TTL on nodes and edges

Spanner Time to live (TTL) is a mechanism that supports automatic expiration and removal of data after a specified period. This is often used for data that has a limited lifespan or relevance, like session information, temporary caches, or event logs. In these cases, TTL helps to maintain database size and performance.

The following example uses TTL to delete accounts 90 days after their closure:

CREATE TABLE Account (
  id               INT64 NOT NULL,
  create_time      TIMESTAMP,
  close_time       TIMESTAMP,
) PRIMARY KEY (id),
  ROW DELETION POLICY (OLDER_THAN(close_time, INTERVAL 90 DAY));

If the node table has a TTL and an edge table interleaved within, the interleave must be defined with ON DELETE CASCADE. Similarly, if the node table has a TTL and is referenced by an edge table through a foreign key, the foreign key must be defined with ON DELETE CASCADE.

In the following example, AccountTransferAccount is stored for up to ten years while an account remains active. When an account is deleted, the transfer history is also deleted.

CREATE TABLE AccountTransferAccount (
  id               INT64 NOT NULL,
  to_id            INT64 NOT NULL,
  amount           FLOAT64,
  create_time      TIMESTAMP NOT NULL,
  order_number     STRING(MAX),
) PRIMARY KEY (id, to_id),
  INTERLEAVE IN PARENT Account ON DELETE CASCADE,
  ROW DELETION POLICY (OLDER_THAN(create_time, INTERVAL 3650 DAY));

Merge node and edge input tables

You can use the same input table to define more than one node and edge in your schema.

In the following example tables, the Account nodes have a composite key (owner_id, account_id). There is an implicit edge definition, Person node with key (id) owns the Account node with composite key (owner_id, account_id) when id equals owner_id.

CREATE TABLE Person (
  id INT64 NOT NULL,
) PRIMARY KEY (id);

-- Assume each account has exactly one owner.
CREATE TABLE Account (
  owner_id INT64 NOT NULL,   
  account_id INT64 NOT NULL,        
) PRIMARY KEY (owner_id, account_id);

In this case, you can use the Account input table to define the Account node and the PersonOwnAccount edge, as shown in the following schema example. To ensure that all element table names are unique, the example gives the edge table definition the alias Owns.

CREATE PROPERTY GRAPH FinGraph
  NODE TABLES (
    Person,
    Account
  )
  EDGE TABLES (
    Account AS Owns
      SOURCE KEY (owner_id) REFERENCES Person
      DESTINATION KEY (owner_id, account_id) REFERENCES Account
  );

What's next