Data stores

Data stores are used by data store tools to find answers for end-user's questions from your data. Data stores are a collection of websites, documents, or data in third-party systems, each of which reference your data.

When an end-user asks the agent a question, the agent searches for an answer from the given source content and summarizes the findings into a coherent agent response. It also provides supporting links to the sources of the response for the end-user to learn more. The agent can provide up to five answer snippets for a given question.

Data store sources

There are different sources that you can supply for your data:

  • Website URLs: Automatically crawl website content from a list of domains or web pages.
  • BigQuery: Import data from your BigQuery table.
  • Cloud Storage: Import data from your Cloud Storage bucket.

Restricted access data store sources

Google offers many additional first- and third-party data store sources as a restricted access feature. To see a list of available sources and to request access, see the additional data store sources section on this page.

Website content

When adding website content as a source, you can add and exclude multiple sites. When you specify a site, you can use individual pages or * as a wildcard for a pattern. All HTML and PDF content will be processed.

You must verify your domain when using website content as a source.

Limitations:

  • Files from public URLs must have been crawled by the Google Search indexer, so that they exist in the search index. You can check this with the Google Search Console.
  • A maximum of 200,000 pages are indexed. If the data store contains more pages, indexing will fail at that point. Any content already indexed will remain.

Import data

You can import your data from either BigQuery or Cloud Storage. This data can be in FAQ form or unstructured, and it can be with metadata or without metadata.

The following Data Import Options are available:

  • Add/Update Data: The provided documents are added to the data store. If a new document has the same ID as an old document, the new document replaces the old document.
  • Override Existing Data: All old data is deleted, then new data is uploaded. This is irreversible.

FAQ data store

FAQ data stores can hold answers to frequently asked questions (FAQ). When user questions are matched with high confidence to an uploaded question, the agent returns the answer to that question without any modification. You can provide a title and a URL for each question and answer pair that is displayed by the agent.

Data must be uploaded to the data store in CSV format. Each file must have a header row describing the columns.

For example:

"question","answer","title","url"
"Why is the sky blue?","The sky is blue because of Rayleigh scattering.","Rayleigh scattering","https://en.wikipedia.org/wiki/Rayleigh_scattering"
"What is the meaning of life?","42","",""

The title and url columns are optional and can be omitted:

"answer","question"
"42","What is the meaning of life?"

During the upload process, you can select a folder where each file is treated as a CSV file regardless of extension.

Limitations:

  • An extra space character after , causes an error.
  • Blank lines (even at the end of the file) cause an error.

Unstructured data store

Unstructured data stores can contain content in the following formats:

  • HTML
  • PDF
  • TXT
  • CSV

It's possible (but uncommon) to import files from another project's Cloud Storage bucket. To do this, you need to grant explicit access to the import process. Follow the instructions in the error message, which will contain the name of the user that needs read access to the bucket in order to perform the import.

Limitations:

  • The maximum file size is 2.5MB for text-based formats, 100MB for other formats.

Data store with metadata

A title and URL can be provided as metadata. When the agent is in a conversation with a user, the agent can provide this information to the user. This can help users to quickly link to internal web pages not accessible by the Google Search indexer.

To import content with metadata, you must provide one or more JSON Lines files. Each line of this file describes one document. You do not directly upload the actual documents; URIs that link to the Cloud Storage paths are provided in the JSON Lines file.

In order to provide your JSON Lines files, you must provide a Cloud Storage folder that contains these files. Do not put any other files in this folder.

Field descriptions:

Field Type Description
id string Unique identifier for the document.
content.mimeType string MIME type of the document. "application/pdf" and "text/html" are supported.
content.uri string URI for the document in Cloud Storage.
structData string Single line JSON object with optional title and url fields.

For example:

{ "id": "d001", "content": {"mimeType": "application/pdf", "uri": "gs://example-import/unstructured/first_doc.pdf"}, "structData": {"title": "First Document", "url": "https://internal.example.com/documents/first_doc.pdf"} }
{ "id": "d002", "content": {"mimeType": "application/pdf", "uri": "gs://example-import/unstructured/second_doc.pdf"}, "structData": {"title": "Second Document", "url": "https://internal.example.com/documents/second_doc.pdf"} }
{ "id": "d003", "content": {"mimeType": "text/html", "uri": "gs://example-import/unstructured/mypage.html"}, "structData": {"title": "My Page", "url": "https://internal.example.com/mypage.html"} }

Data store without metadata

This type of content has no metadata. Instead, you provide URI links to the individual documents. The content type is determined by the file extension.

Parse and chunk configuration

Depending on the data source, you might be able to configure parse and chunk settings as defined by Vertex AI Search.

Use Cloud Storage for a data store document

If your content is not public, storing your content in Cloud Storage is the recommended option. When you create data store documents, you provide the URLs for your Cloud Storage objects in the form: gs://bucket-name/folder-name. Each document within the folder is added to the data store.

When you create the Cloud Storage bucket:

Follow the Cloud Storage quickstart instructions to create a bucket and upload files.

Languages

For supported languages, see the data store column in the language reference.

For best performance, it is recommended that data stores be created in a single language.

After creating a data store, you can optionally specify the data store language. If you set the data store language, you can connect the data store to an agent that is configured for a different language. For example, you can create a French data store that is connected to an English agent.

Supported regions

For supported regions, see the region reference.

(Restricted access) Additional data store sources

Additional data store types are listed in the following table. They are available as restricted access features. You can fill out the allowlist form to request access. Once approved, you will be able to see these options when you create a new data store in Agent Builder.

Google data store sources

Data store source Description
Cloud SQL Import data from a Cloud SQL table.
(Preview) Spanner Import data from a Spanner table.
(Preview) Bigtable Import data from a Bigtable table.
Firestore Import data from your Firestore collection.
(Preview) AlloyDB Import data from your AlloyDB cluster.
Google Drive Link to your organization's drive.
(Preview) Google Gmail Link to your organization's Gmail.
(Preview) Google Sites Link to your organization's Sites.
(Preview) Google Calendar Link to your organization's Calendar.
(Preview) Google Groups Link to your organization's Groups.

Third-party data store sources

Data store source Description
(Preview) AODocs Import data from your AODocs document management system.
Box Import data from your organization's Box site.
Confluence Cloud Import data from your Confluence Cloud workspace.
(Preview) Confluence Data Center Import data from your Confluence Data Center workspace.
Dropbox Import data from your Dropbox storage.
(Preview) EntraID Import data from your organization's EntraID system.
(Preview) HubSpot Import data from your organization's HubSpot site.
Jira Cloud Import data from your Jira task management system.
(Preview) Jira Data Center Import data from your Jira Data Center site.
(Preview) Marketo Import data from your organization's Marketo marketing system.
(Preview) Notion Import data from your organization's Notion workspace.
OneDrive Import data from your organization's OneDrive storage.
Microsoft Outlook Import data from Microsoft Outlook.
Salesforce Import data from Salesforce.
ServiceNow Import data from ServiceNow.
SharePoint Import data from your organization's SharePoint system.
(Preview) Shopify Import data from your organization's Shopify system.
Slack Import data from Slack.
Microsoft Teams Import data from Microsoft Teams.
(Preview) WordPress Import data from your organization's WordPress site.

Set up a third-party data store using a connector

This section outlines the process of setting up a data store using third-party data. Instructions specific to each third-party data source can be found in the generative AI documentation.

Identity providers

Identity providers allow you to manage users, groups, and authentication. When you set up a third-party data store, you have the option of using either a Google identity provider or a third-party identity provider.

Google identity provider:

  • Any users of the agent need to sign in using their Google credentials. This can be any @gmail.com email address or any account that uses Google as the identity provider (for example, Google Workspace). This step is skipped if users talk to the agent using Google Cloud directly, because Google identity is automatically built into the system.
  • You can assign access to Google accounts using IAM.

Third-party identity provider:

  • Users of the agent sign in using non-Google credentials, for example a Microsoft email address.
  • You must create a Workforce Pool using Google Cloud containing the non-Google identity providers. You can then use IAM to grant access to either the entire pool or individual users within that pool.
  • This method can't be used with any Google Cloud projects set up under the @google.com organization.

Connectors

Third-party data stores are implemented using a connector. Each connector can contain multiple data stores, which are stored as entities in the Conversational Agents (Dialogflow CX) system.

  • Before you create a data store, you must set up each region with a single identity provider in Google Cloud -> Agent Builder -> Settings. All data stores in that region will use the same identity provider. You can choose either a Google identity or a third-party identity in a workforce pool. The same Google credential is considered to be a different identity if it's in a workforce pool. For example, test@gmail.com is considered to be a different identity than workforcePools/test-pool/subject/test@gmail.com.

    • Create a workforce pool (if needed).
    • Go to Agent Builder Settings and select either Google Identity or 3rd Party Identity. Click SAVE to save the identity to the region.
    • You are now able to create a data store in the region.
  • Each data store saves Access Control List (ACL) data with each document. This is a record of which users or groups have read access to which entities. During runtime, a user or group member will only receive responses from the agent that source from entities that they have read access to. If a user has no read access to any entities in the data store, the agent will return an empty response.

  • Because the data in the data store is a copy of the third-party instance, it needs to be periodically refreshed. You can configure the refresh intervals on a time scale of either hours or days.

  • After you configure your data store and click Create, it can take up to an hour for the data store to appear in your data stores list.

What's next

For instructions about creating a data store and using it with an agent, see the data store tools documentation.