Method: projects.locations.ragCorpora.ragFiles.import

Import files from Google Cloud Storage or Google Drive into a RagCorpus.

Endpoint

post https://aiplatform.googleapis.com/v1beta1/{parent}/ragFiles:import

Path parameters

parent string

Required. The name of the RagCorpus resource into which to import files. Format: projects/{project}/locations/{location}/ragCorpora/{ragCorpus}

Request body

The request body contains data with the following structure:

Fields
importRagFilesConfig object (ImportRagFilesConfig)

Required. The config for the RagFiles to be synced and imported into the RagCorpus. VertexRagDataService.ImportRagFiles.

Response body

If successful, the response body contains an instance of Operation.

ImportRagFilesConfig

Config for importing RagFiles.

Fields
ragFileChunkingConfig
(deprecated)
object (RagFileChunkingConfig)

Specifies the size and overlap of chunks after importing RagFiles.

ragFileTransformationConfig object (RagFileTransformationConfig)

Specifies the transformation config for RagFiles.

ragFileParsingConfig object (RagFileParsingConfig)

Optional. Specifies the parsing config for RagFiles. RAG will use the default parser if this field is not set.

maxEmbeddingRequestsPerMin integer

Optional. The max number of queries per minute that this job is allowed to make to the embedding model specified on the corpus. This value is specific to this job and not shared across other import jobs. Consult the Quotas page on the project to set an appropriate value here. If unspecified, a default value of 1,000 QPM would be used.

import_source Union type
The source of the import. import_source can be only one of the following:
gcsSource object (GcsSource)

Google Cloud Storage location. Supports importing individual files as well as entire Google Cloud Storage directories. Sample formats: - gs://bucketName/my_directory/objectName/my_file.txt - gs://bucketName/my_directory

googleDriveSource object (GoogleDriveSource)

Google Drive location. Supports importing individual files as well as Google Drive folders.

slackSource object (SlackSource)

Slack channels with their corresponding access tokens.

jiraSource object (JiraSource)

Jira queries with their corresponding authentication.

sharePointSources object (SharePointSources)

SharePoint sources.

partial_failure_sink Union type
Optional. If provided, all partial failures are written to the sink. Deprecated. Prefer to use the import_result_sink. partial_failure_sink can be only one of the following:
partialFailureGcsSink
(deprecated)
object (GcsDestination)

The Cloud Storage path to write partial failures to. Deprecated. Prefer to use importResultGcsSink.

partialFailureBigquerySink
(deprecated)
object (BigQueryDestination)

The BigQuery destination to write partial failures to. It should be a bigquery table resource name (e.g. "bq://projectId.bqDatasetId.bqTableId"). The dataset must exist. If the table does not exist, it will be created with the expected schema. If the table exists, the schema will be validated and data will be added to this existing table. Deprecated. Prefer to use import_result_bq_sink.

JSON representation
{
  "ragFileChunkingConfig": {
    object (RagFileChunkingConfig)
  },
  "ragFileTransformationConfig": {
    object (RagFileTransformationConfig)
  },
  "ragFileParsingConfig": {
    object (RagFileParsingConfig)
  },
  "maxEmbeddingRequestsPerMin": integer,

  // import_source
  "gcsSource": {
    object (GcsSource)
  },
  "googleDriveSource": {
    object (GoogleDriveSource)
  },
  "slackSource": {
    object (SlackSource)
  },
  "jiraSource": {
    object (JiraSource)
  },
  "sharePointSources": {
    object (SharePointSources)
  }
  // Union type

  // partial_failure_sink
  "partialFailureGcsSink": {
    object (GcsDestination)
  },
  "partialFailureBigquerySink": {
    object (BigQueryDestination)
  }
  // Union type
}

BigQueryDestination

The BigQuery location for the output content.

Fields
outputUri string

Required. BigQuery URI to a project or table, up to 2000 characters long.

When only the project is specified, the Dataset and Table is created. When the full table reference is specified, the Dataset must exist and table must not exist.

Accepted forms:

  • BigQuery path. For example: bq://projectId or bq://projectId.bqDatasetId or bq://projectId.bqDatasetId.bqTableId.
JSON representation
{
  "outputUri": string
}

RagFileParsingConfig

Specifies the parsing config for RagFiles.

Fields
useAdvancedPdfParsing
(deprecated)
boolean

Whether to use advanced PDF parsing.

parser Union type
The parser to use for RagFiles. parser can be only one of the following:
advancedParser object (AdvancedParser)

The Advanced Parser to use for RagFiles.

layoutParser object (LayoutParser)

The Layout Parser to use for RagFiles.

llmParser object (LlmParser)

The LLM Parser to use for RagFiles.

JSON representation
{
  "useAdvancedPdfParsing": boolean,

  // parser
  "advancedParser": {
    object (AdvancedParser)
  },
  "layoutParser": {
    object (LayoutParser)
  },
  "llmParser": {
    object (LlmParser)
  }
  // Union type
}

AdvancedParser

Specifies the advanced parsing for RagFiles.

Fields
useAdvancedPdfParsing boolean

Whether to use advanced PDF parsing.

JSON representation
{
  "useAdvancedPdfParsing": boolean
}

LayoutParser

Document AI Layout Parser config.

Fields
processorName string

The full resource name of a Document AI processor or processor version. The processor must have type LAYOUT_PARSER_PROCESSOR. If specified, the additionalConfig.parse_as_scanned_pdf field must be false. Format: * projects/{projectId}/locations/{location}/processors/{processorId} * projects/{projectId}/locations/{location}/processors/{processorId}/processorVersions/{processor_version_id}

maxParsingRequestsPerMin integer

The maximum number of requests the job is allowed to make to the Document AI processor per minute. Consult https://cloud.google.com/document-ai/quotas and the Quota page for your project to set an appropriate value here. If unspecified, a default value of 120 QPM would be used.

JSON representation
{
  "processorName": string,
  "maxParsingRequestsPerMin": integer
}

LlmParser

Specifies the advanced parsing for RagFiles.

Fields
modelName string

The name of a LLM model used for parsing. Format: gemini-1.5-pro-002

maxParsingRequestsPerMin integer

The maximum number of requests the job is allowed to make to the LLM model per minute. Consult https://cloud.google.com/vertex-ai/generative-ai/docs/quotas and your document size to set an appropriate value here. If unspecified, a default value of 5000 QPM would be used.

customParsingPrompt string

The prompt to use for parsing. If not specified, a default prompt will be used.

JSON representation
{
  "modelName": string,
  "maxParsingRequestsPerMin": integer,
  "customParsingPrompt": string
}