Introduction to connectors and data stores

Connectors get data from Google and third-party data sources into Agentspace, storing it in dedicated data stores. This document provides an overview of these connectors. Centralizing your data in Agentspace enhances data accessibility, search functionality, and analytical capabilities.

An overview image that shows how the connecotrs get data into Agentspace Enterprise. — Connector overview

Connector and data store concepts

Data stores

Each data source supports a set of entity types. For example, Jira Cloud has entities such as issues, attachments, comments, and worklogs, which are unique to the data source. Agentspace creates a separate data store for each entity. Therefore, when you create a data store using the Google Cloud console, you get a collection of data stores representing these ingested data entities.

Data federation versus ingestion (indexing)

Data federation directly retrieves information from the specified data source. Because data isn't copied into the Vertex AI Search index, you don't need to worry about data storage. However, because the data is not indexed, search quality might be lower.

Data ingestion (indexing) copies data into the Vertex AI Search index. This can result in improved search quality. However, this process consumes more storage and time.

Unstructured data

The supported data format is specific to the data source and the entity type. If the content in an entity is stored in an unstructured format such as HTML, PDF, TXT, PPTX, DOCX etc, an unstructured data store is created by Vertex AI Search. For more information and supported file types, see Unstructured search.

Structured data

The supported data format is specific to the data source and the entity type. If the content in an entity is stored in a structured format, a structured data store is created by Vertex AI Search. For more information, see Structured search.

Data schemas

The data schema defines the data structure. When you import structured data using Agentspace, the system auto-detects the schema. You can use the auto-detected schema or define the schema using the API. For more information, see Provide or auto-detect a schema.

Data store regions

When ingesting data, you need to select the region where you want to store the data, such as global, the US, or the EU. For more information, see Agentspace Locations. Data stored in the US or EU regions require data encryption. The default encryption is with Google-owned and Google-managed encryption keys, but alternatively, you can use customer-managed encryption keys.

Data syncs

A data sync pulls and updates identity data (such as roles, permissions, and users) and entity data (such as data related to a specific data source) from the original data source. For example, a data store for a system like Jira can sync issues, worklogs, comments, and attachments.

Here are the sync types:

A full sync captures the entire state of the third party app, including additions, updates, and deletions, and replaces the existing contents of the data store.
An incremental sync periodically captures entity data that has been added or updated since the last sync. It does not sync identity data or deletions of entity data.
An identity sync captures data about emails associated with an ACL group.
For supported data sources, real-time sync uses webhooks to receive notifications of data creation, update, and deletion events directly from the third party data source. Notifications typically arrive within minutes of the event.

Data sync schedules

For each data store, you can select a frequency for different sync types:

Full syncs of all identity data and entity data can be scheduled simultaneously for every 3 hours, 6 hours, 12 hours, 1 day, or every 3 days.
Independent full syncs of all identity data, and independent full syncs of all entity data, can be scheduled separately using any of the following custom sync frequencies:
- Entity data: Every 3 hours, 6 hours, 12 hours, 1 day, 3 days, 5 days, and every 7 days.
- Identity data: Every 30 minutes, 1 hour, 3 hours, 6 hours, 12 hours, 1 day, 3 days, 5 days, and every 7 days.
Incremental syncs of updated or added entity data can be scheduled for every 3 hours, 6 hours, 12 hours, 1 day, 3 days, 5 days, or every 7 days. By default, an incremental sync is performed every 3 hours.

Note: Incremental syncs must be scheduled to occur more frequently than full sync. For example, if you have scheduled full syncs for every 12 hours, you can only schedule incremental syncs for every 3 hours or every 6 hours.

If you schedule incremental sync and full sync for the same frequency, incremental sync will be automatically paused and require manual intervention to resume.

You can pause, and resume, both full syncs and incremental syncs:

When you pause a sync type, the data store cancels ongoing syncs of that type and stops scheduling new syncs of that type.
When you resume a sync type, the data store schedules the new sync based on the last scheduled sync time, but does not continue the previously interrupted sync.

For example, if you pause full sync while a full sync is occurring, the data store cancels that sync. If you later resume full sync, the data store automatically schedules a new full sync according to the full sync schedule.

Frequency recommendations

Choose a data sync frequency that aligns with the volume of records fetched and the recommended queries per second (QPS).

The following table shows the typical number of records retrieved for one, three, five, and seven-day syncs. The actual number of records may vary depending on the data source and its configuration.

QPS	Record volume for 1-day sync	Record volume for 3-day sync	Record volume for 5-day sync	Record volume for 7-day sync
5	432k	1.296M	2.16M	3M
10	864k	2.592M	4.32M	6M
20	1.7M	5.1M	8.5M	11.9M
50	4.3M	12.9M	21.5M	30.1M
100	8.6M	25.8M	43M	60.2M

Identity sync example

The following example illustrates how identity sync works alongside full sync.

In this example, two pages, page_1 and page_2, are linked to a specific group, group_1 and group_2 respectively. When the initial identity sync runs, it retrieves information about groups group_1 and group_2. Assume that group_1 contains user user_1, and group_2 contains user user_2. This establishes the following mapping: user_1 maps to group_1, and user_2 maps to group_2.

Consider that alongside the identity sync, a full sync runs. This sync fetches both page_1 and page_2. The full sync infers the following mapping: user_1 has access to page_1 (via group_1), and user_2 has access to page_2 (via group_2).

Third-party data sources

Third-party data stores ingest third-party application data into Agentspace.

Checklist for third-party data sources

Before connecting a third-party data source to Agentspace, go through the following checklist:

Specific scopes and permissions must be configured for certain data sources. An administrator of the third-party application must review the required credentials to connect a data source and set up authentication and permissions. For information about the specific scopes and permissions, see the respective third-party data source documentation.
Set up access control for your data store. For more information, see Identity and permissions
Decide whether data should be federated or ingested (indexed).
If data is ingested, make sure that resources aren't restricted for the user credential that you use to ingest data into the data source.
Decide how often the data should be synced.
If you are using customer-managed encryption keys (CMEK), create multi-region and single-region keys. For more information, see About single-region keys for third-party data stores.
If you have personally identifiable information (PII) and intend to use autocomplete for query suggestions, see protect against PII leaks.

Supported third-party data sources

Microsoft Entra ID	Microsoft OneDrive	Microsoft Outlook	Microsoft SharePoint

Jira Cloud	Confluence Cloud	ServiceNow

Google data sources

You can connect to Google data sources, such as BigQuery, Spanner, Google Drive, and many more.

Checklist for Google data sources

Before sending data to Agentspace, go through the following checklist:

Set up access control for your data source. For more information see, Identity and permissions.
Decide whether data should be federated or ingested (indexed).
Decide how often the data should be synced.
If you are using customer-managed encryption keys (CMEK), create multi-region keys. For more information, see About single-region keys for third-party data sources.
If you have personally identifiable information (PII) and intend to use autocomplete for query suggestions, see protect against PII leaks.

Supported Google data sources

Shows the Google data sources. — Google data sources