This page explains how to perform data ingestion using a supported data source,
such as Cloud Storage, Google Drive, Slack, Jira, or SharePoint, and how to
use that data with Vertex AI RAG Engine. The Import RagFiles
API provides data
connectors to these data sources. The following data sources are supported: Google Drive: Import a directory from Google Drive. The service account must be granted the correct permissions to import files.
Otherwise, no files are imported and no error message displays. For more
information on file size limits, see Supported document types. To authenticate and grant permissions, do the following: Slack:
Import files from Slack by using a data connector. Jira:
Import files from Jira by using a data connector. For more information, see the RAG API reference. If the same file is imported multiple times with no changes, the file is skipped
since it already exists. Therefore, the A file is skipped when the following conditions are met: To understand import failures, this section explains the metadata in a response
to an import request and a data sink, which is the destination for the data that
you're importing. You can use In the SDK, If the The If the If the To import files from Cloud Storage or Google Drive into your corpus, do
the following: Create a corpus by following the instructions at
Create a RAG corpus. To import your files from Cloud Storage or Google Drive, use the
template. The system automatically checks your file's path, filename, and
If a file with the same filename and path has a content
update, the file is reindexed. To import files from Slack into your corpus, do the
following: The following curl and Python code samples demonstrate how to import files from
your Slack resources. If you want to get messages from a specific channel, change the
If you want to get messages for a given range of time or from a specific
channel, change any of the following fields: To import files from
Jira into your
corpus, do the following: When you import To import files from your SharePoint site into your corpus, do the following:Data sources supported for RAG
upload_file
(up to
25 MB), which is a synchronous call.
Viewer
permission to the service account on your Google Drive
folder or file. The Google Drive resource ID can be found in the web URL.Data deduplication
response.skipped_rag_files_count
refers to the number of files that were skipped during the import process.
Understand import failures
Response metadata
response.metadata
(a response object in the SDK) to view the
import results, the request time, and the response time.Import result sink
import_result_sink
is an optional function parameter that can be
set to a valid string value.import_result_sink
is provided, the successful and failed file results
are written to the sink. Having all results written to the sink makes it easier
to understand why some files might fail to be imported and which files didn't
import.import_result_sink
must be a Cloud Storage path or a BigQuery
table.
import_result_sink
is a Cloud Storage path, it should use
the format of gs://my-bucket/my/object.ndjson
, and the object must not
exist. After the import job completes, each line of the
Cloud Storage object contains a JSON object, which has an operation
ID, a create timestamp, a filename, a status, and a file ID.import_result_sink
is a BigQuery table, it should
use the format of bq://my-project.my-dataset.my-table
. The table doesn't
have to exist. If the table doesn't exist, it is created. If the table does
exist, the schema is verified. The first time the BigQuery
import result sink is provided, you will provide a non-existent table;
otherwise, you can reuse the existing table.Import files from Cloud Storage or Google Drive
version_id
. The version_id
is a file hash that's
calculated using the file's content, which prevents the file from being
reindexed.Import files from Slack
CHANNEL_ID
from the Slack channel ID.
channels:history
groups:history
im:history
mpim:history
curl
CHANNEL_ID
.API_KEY_SECRET_VERSION=SLACK_API_KEY_SECRET_VERSION
CHANNEL_ID=SLACK_CHANNEL_ID
PROJECT_ID=us-central1
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ ENDPOINT }/v1beta1/projects/${ PROJECT_ID }/locations/${ PROJECT_ID }/ragCorpora/${ RAG_CORPUS_ID }/ragFiles:import \
-d '{
"import_rag_files_config": {
"slack_source": {
"channels": [
{
"apiKeyConfig": {
"apiKeySecretVersion": "'"${ API_KEY_SECRET_VERSION }"'"
},
"channels": [
{
"channel_id": "'"${ CHANNEL_ID }"'"
}
]
}
]
}
}
}'
Python
# Slack example
start_time = protobuf.timestamp_pb2.Timestamp()
start_time.GetCurrentTime()
end_time = protobuf.timestamp_pb2.Timestamp()
end_time.GetCurrentTime()
source = rag.SlackChannelsSource(
channels = [
SlackChannel("CHANNEL1", "api_key1"),
SlackChannel("CHANNEL2", "api_key2", START_TIME, END_TIME)
],
)
response = rag.import_files(
corpus_name="projects/my-project/locations/us-central1/ragCorpora/my-corpus-1",
source=source,
chunk_size=512,
chunk_overlap=100,
)
Import files from Jira
projects
or customQueries
with your request. To learn more about
custom queries, see
Use advanced search with Jira Query Language (JQL).
projects
, projects
is expanded into the corresponding
queries to get the entire project. For example, MyProject
is expanded to
project = MyProject
.curl
EMAIL=JIRA_EMAIL
API_KEY_SECRET_VERSION=JIRA_API_KEY_SECRET_VERSION
SERVER_URI=JIRA_SERVER_URI
CUSTOM_QUERY=JIRA_CUSTOM_QUERY
PROJECT_ID=JIRA_PROJECT
REGION= "us-central1"
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ ENDPOINT }/v1beta1/projects/${ PROJECT_ID }/locations/REGION>/ragCorpora/${ RAG_CORPUS_ID }/ragFiles:import \
-d '{
"import_rag_files_config": {
"jiraSource": {
"jiraQueries": [{
"projects": ["'"${ PROJECT_ID }"'"],
"customQueries": ["'"${ CUSTOM_QUERY }"'"],
"email": "'"${ EMAIL }"'",
"serverUri": "'"${ SERVER_URI }"'",
"apiKeyConfig": {
"apiKeySecretVersion": "'"${ API_KEY_SECRET_VERSION }"'"
}
}]
}
}
}'
Python
# Jira Example
jira_query = rag.JiraQuery(
email="xxx@yyy.com",
jira_projects=["project1", "project2"],
custom_queries=["query1", "query2"],
api_key="api_key",
server_uri="server.atlassian.net"
)
source = rag.JiraSource(
queries=[jira_query],
)
response = rag.import_files(
corpus_name="projects/my-project/locations/REGION/ragCorpora/my-corpus-1",
source=source,
chunk_size=512,
chunk_overlap=100,
)
Import files from SharePoint
Sites.Read.All
permission.Files.Read.All
and Browser
SiteLists.Read.All
permissions.
curl
CLIENT_ID=SHAREPOINT_CLIENT_ID
API_KEY_SECRET_VERSION=SHAREPOINT_API_KEY_SECRET_VERSION
TENANT_ID=SHAREPOINT_TENANT_ID
SITE_NAME=SHAREPOINT_SITE_NAME
FOLDER_PATH=SHAREPOINT_FOLDER_PATH
DRIVE_NAME=SHAREPOINT_DRIVE_NAME
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ ENDPOINT }/v1beta1/projects/${ PROJECT_ID }/locations/REGION>/ragCorpora/${ RAG_CORPUS_ID }/ragFiles:import \
-d '{
"import_rag_files_config": {
"sharePointSources": {
"sharePointSource": [{
"clientId": "'"${ CLIENT_ID }"'",
"apiKeyConfig": {
"apiKeySecretVersion": "'"${ API_KEY_SECRET_VERSION }"'"
},
"tenantId": "'"${ TENANT_ID }"'",
"sharepointSiteName": "'"${ SITE_NAME }"'",
"sharepointFolderPath": "'"${ FOLDER_PATH }"'",
"driveName": "'"${ DRIVE_NAME }"'"
}]
}
}
}'
Python
from vertexai.preview import rag
from vertexai.preview.rag.utils import resources
CLIENT_ID="SHAREPOINT_CLIENT_ID"
API_KEY_SECRET_VERSION="SHAREPOINT_API_KEY_SECRET_VERSION"
TENANT_ID="SHAREPOINT_TENANT_ID"
SITE_NAME="SHAREPOINT_SITE_NAME"
FOLDER_PATH="SHAREPOINT_FOLDER_PATH"
DRIVE_NAME="SHAREPOINT_DRIVE_NAME"
# SharePoint Example.
source = resources.SharePointSources(
share_point_sources=[
resources.SharePointSource(
client_id=CLIENT_ID,
client_secret=API_KEY_SECRET_VERSION,
tenant_id=TENANT_ID,
sharepoint_site_name=SITE_NAME,
folder_path=FOLDER_PATH,
drive_id=DRIVE_ID,
)
]
)
response = rag.import_files(
corpus_name="projects/my-project/locations/REGION/ragCorpora/my-corpus-1",
source=source,
chunk_size=512,
chunk_overlap=100,
)
What's next
Use data ingestion with Vertex AI RAG Engine
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-23 UTC.