本頁面由 Cloud Translation API 翻譯而成。

重新整理結構化和非結構化資料

本頁面說明如何重新整理結構化和非結構化資料。

如要重新整理網站應用程式，請參閱「重新整理網頁」。

重新整理結構化資料

只要使用與資料儲存庫中的結構定義相同或向下相容的結構定義，即可重新整理結構化資料儲存庫中的資料。舉例來說，如果只在現有結構定義中新增欄位，就會具備回溯相容性。

您可以在 Google Cloud 控制台或使用 API 重新整理結構化資料。

控制台

如要使用 Google Cloud 控制台，從資料儲存空間的分支中重新整理結構化資料，請按照下列步驟操作：

前往 Google Cloud 控制台的「AI Applications」頁面。

AI 應用程式
在導覽選單中，按一下「資料儲存庫」。
在「名稱」欄中，按一下要編輯的資料儲存庫。
在「Documents」分頁中，按一下「Import data」。
如何從 Cloud Storage 重新整理：
1. 在「Select a data source」(選取資料來源) 窗格中，選取「Cloud Storage」。
2. 在「Import data from Cloud Storage」(從 Cloud Storage 匯入資料) 窗格中，按一下「Browse」(瀏覽)，選取包含已重新整理資料的值區，然後按一下「Select」(選取)。您也可以直接在「gs://」欄位中輸入值區位置。
3. 在「資料匯入選項」下方，選取匯入選項。
4. 按一下「匯入」。
如要從 BigQuery 重新整理資料，請按照下列步驟操作：
1. 在「Select a data source」(選取資料來源) 窗格中，選取「BigQuery」。
2. 在「Import data from BigQuery」窗格中，按一下「Browse」，選取包含已重新整理資料的資料表，然後按一下「Select」。或者，您也可以直接在「BigQuery 路徑」欄位中輸入資料表位置。
3. 在「資料匯入選項」下方，選取匯入選項。
4. 按一下「匯入」。

REST

使用 documents.import 方法重新整理資料，並指定適當的 reconciliationMode 值。

如要使用指令列，從 BigQuery 或 Cloud Storage 重新整理結構化資料，請按照下列步驟操作：

找出資料儲存庫 ID。如果您已取得資料儲存庫 ID，請略過至下一個步驟。
1. 前往 Google Cloud 控制台的「AI Applications」頁面，然後在導覽選單中按一下「資料儲存庫」。
  
  前往「資料儲存庫」頁面
2. 點按資料儲存庫的名稱。
3. 在資料儲存庫的「資料」頁面中，取得資料儲存庫 ID。
如要從 BigQuery 匯入結構化資料，請呼叫下列方法。您可以從 BigQuery 或 Cloud Storage 匯入資料。如要從 Cloud Storage 匯入，請跳至下一個步驟。
```
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \
-d '{
  "bigquerySource": {
    "projectId": "PROJECT_ID",
    "datasetId":"DATASET_ID",
    "tableId": "TABLE_ID",
    "dataSchema": "DATA_SCHEMA_BQ",
  },
  "reconciliationMode": "RECONCILIATION_MODE",
  "autoGenerateIds": AUTO_GENERATE_IDS,
  "idField": "ID_FIELD",
  "errorConfig": {
    "gcsPrefix": "ERROR_DIRECTORY"
  }
}'
```
更改下列內容：
- PROJECT_ID： Google Cloud 專案的 ID。
- DATA_STORE_ID：Vertex AI Search 資料儲存庫的 ID。
- DATASET_ID：BigQuery 資料集名稱。
- TABLE_ID：BigQuery 資料表名稱。
- DATA_SCHEMA_BQ：選用欄位，可在剖析 BigQuery 來源的資料時指定要使用的結構定義。可使用下列值：
  - document：預設值。您使用的 BigQuery 資料表必須符合下列預設 BigQuery 結構定義。您可以自行定義每份文件的 ID，並將整個資料包裝在 json_data 字串中。
  - custom：系統會接受任何 BigQuery 資料表結構定義，AI 應用程式會自動為每份匯入的文件產生 ID。
- ERROR_DIRECTORY：可選欄位，可指定 Cloud Storage 目錄，用於儲存匯入作業的錯誤資訊，例如 gs://<your-gcs-bucket>/directory/import_errors。Google 建議您將這個欄位留空，讓 AI 應用程式自動建立暫存資料夾。
- RECONCILIATION_MODE：選用欄位，可指定匯入的文件如何與目的地資料儲存庫中的現有文件進行比對。可使用下列值：
  - INCREMENTAL：預設值。會導致從 BigQuery 到資料儲存庫的資料逐漸重新整理。這會執行更新/插入作業，新增新文件，並以 ID 相同的更新文件取代現有文件。
  - FULL：會導致資料儲存庫中的文件完全重新定基。因此，新文件和更新的文件會新增至資料儲存庫，而不在 BigQuery 中的文件則會從資料儲存庫中移除。如果您想自動刪除不再需要的文件，FULL 模式就很實用。
- AUTO_GENERATE_IDS：選用欄位，用於指定是否要自動產生文件 ID。如果設為 true，系統會根據酬載的雜湊值產生文件 ID。請注意，產生的文件 ID 在多次匯入後可能會有所變動。如果您要為多個匯入作業自動產生 ID，Google 強烈建議您將 reconciliationMode 設為 FULL，以便維持一致的文件 ID。
  
  只有在 bigquerySource.dataSchema 設為 custom 時，才能指定 autoGenerateIds。否則會傳回 INVALID_ARGUMENT 錯誤。如果未指定 autoGenerateIds 或將其設為 false，則必須指定 idField。否則文件無法匯入。
- ID_FIELD：選用欄位，用於指定哪些欄位是文件 ID。對於 BigQuery 來源檔案，idField 會指出 BigQuery 資料表中包含文件 ID 的欄名稱。
  
  請僅在滿足下列兩個條件時指定 idField，否則系統會傳回 INVALID_ARGUMENT 錯誤：
  - bigquerySource.dataSchema 已設為 custom
  - auto_generate_ids 設為 false 或未指定。
  此外，BigQuery 資料欄名稱的值必須是字串類型，長度介於 1 到 63 個半形字元之間，且必須符合 RFC-1034 規定。否則文件無法匯入。
以下是預設的 BigQuery 結構定義。將 dataSchema 設為 document 時，您的 BigQuery 資料表必須符合這個結構定義。
```
[
 {
   "name": "id",
   "mode": "REQUIRED",
   "type": "STRING",
   "fields": []
 },
 {
   "name": "jsonData",
   "mode": "NULLABLE",
   "type": "STRING",
   "fields": []
 }
]
```
如要從 Cloud Storage 匯入結構化資料，請呼叫下列方法。您可以從 BigQuery 或 Cloud Storage 匯入資料。如要從 BigQuery 匯入，請參閱上一個步驟。
```
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://discoveryengine.googleapis.com/v1beta/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/branches/0/documents:import" \
-d '{
  "gcsSource": {
    "inputUris": ["GCS_PATHS"],
    "dataSchema": "DATA_SCHEMA_GCS",
  },
  "reconciliationMode": "RECONCILIATION_MODE",
  "idField": "ID_FIELD",
  "errorConfig": {
    "gcsPrefix": "ERROR_DIRECTORY"
  }
}'
```
更改下列內容：
- PROJECT_ID： Google Cloud 專案的 ID。
- DATA_STORE_ID：Vertex AI Search 資料儲存庫的 ID。
- GCS_PATHS：以逗號分隔的 URI 清單，代表您要匯入的 Cloud Storage 位置。每個 URI 的長度上限為 2,000 個半形字元。URI 可比對儲存空間物件的完整路徑，或比對一或多個物件的模式。例如 gs://bucket/directory/*.json 就是有效路徑。
- DATA_SCHEMA_GCS：選用欄位，可在剖析 BigQuery 來源的資料時指定要使用的結構定義。可使用下列值：
  - document：預設值。您使用的 BigQuery 資料表必須符合下列預設 BigQuery 結構定義。您可以自行定義每份文件的 ID，並將整個資料包裝在 json_data 字串中。
  - custom：系統會接受任何 BigQuery 資料表結構定義，AI 應用程式會自動為每份匯入的文件產生 ID。
- ERROR_DIRECTORY：可選欄位，可指定 Cloud Storage 目錄，用於儲存匯入作業的錯誤資訊，例如 gs://<your-gcs-bucket>/directory/import_errors。Google 建議您將這個欄位留空，讓 AI 應用程式自動建立暫存資料夾。
- RECONCILIATION_MODE：選用欄位，可指定匯入的文件如何與目的地資料儲存庫中的現有文件進行比對。可使用下列值：
  - INCREMENTAL：預設值。會導致從 BigQuery 到資料儲存庫的資料逐漸重新整理。這會執行更新/插入作業，新增新文件，並以 ID 相同的更新文件取代現有文件。
  - FULL：會導致資料儲存庫中的文件完全重新定基。因此，新文件和更新的文件會新增至資料儲存庫，而不在 BigQuery 中的文件則會從資料儲存庫中移除。如果您想自動刪除不再需要的文件，FULL 模式就很實用。

Python

詳情請參閱 AI Applications Python API 參考說明文件。

如要向 AI Applications 進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。


from google.api_core.client_options import ClientOptions
from google.cloud import discoveryengine

# TODO(developer): Uncomment these variables before running the sample.
# project_id = "YOUR_PROJECT_ID"
# location = "YOUR_LOCATION" # Values: "global"
# data_store_id = "YOUR_DATA_STORE_ID"
# bigquery_dataset = "YOUR_BIGQUERY_DATASET"
# bigquery_table = "YOUR_BIGQUERY_TABLE"

#  For more information, refer to:
# https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
client_options = (
    ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
    if location != "global"
    else None
)

# Create a client
client = discoveryengine.DocumentServiceClient(client_options=client_options)

# The full resource name of the search engine branch.
# e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
parent = client.branch_path(
    project=project_id,
    location=location,
    data_store=data_store_id,
    branch="default_branch",
)

request = discoveryengine.ImportDocumentsRequest(
    parent=parent,
    bigquery_source=discoveryengine.BigQuerySource(
        project_id=project_id,
        dataset_id=bigquery_dataset,
        table_id=bigquery_table,
        data_schema="custom",
    ),
    # Options: `FULL`, `INCREMENTAL`
    reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
)

# Make the request
operation = client.import_documents(request=request)

print(f"Waiting for operation to complete: {operation.operation.name}")
response = operation.result()

# After the operation is complete,
# get information from operation metadata
metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)

# Handle the response
print(response)
print(metadata)

重新整理非結構化資料

您可以在 Google Cloud 控制台或使用 API 重新整理非結構化資料。

控制台

如要使用 Google Cloud 控制台重新整理資料儲存庫分支中的非結構化資料，請按照下列步驟操作：

前往 Google Cloud 控制台的「AI Applications」頁面。

AI 應用程式
在導覽選單中，按一下「資料儲存庫」。
在「名稱」欄中，按一下要編輯的資料儲存庫。
在「Documents」分頁中，按一下「Import data」。
如何從 Cloud Storage 值區擷取資料 (含或不含中繼資料)：
1. 在「Select a data source」(選取資料來源) 窗格中，選取「Cloud Storage」。
2. 在「Import data from Cloud Storage」(從 Cloud Storage 匯入資料) 窗格中，按一下「Browse」(瀏覽)，選取包含已重新整理資料的值區，然後按一下「Select」(選取)。或者，您也可以直接在「bucket」gs:// 欄位中輸入位置。
3. 在「資料匯入選項」下方，選取匯入選項。
4. 按一下「匯入」。
如要從 BigQuery 擷取資料，請按照下列步驟操作：
1. 在「Select a data source」(選取資料來源) 窗格中，選取「BigQuery」。
2. 在「Import data from BigQuery」窗格中，按一下「Browse」，選取包含已重新整理資料的資料表，然後按一下「Select」。或者，您也可以直接在「BigQuery 路徑」欄位中輸入資料表位置。
3. 在「資料匯入選項」下方，選取匯入選項。
4. 按一下「匯入」。

REST

如要使用 API 重新整理非結構化資料，請使用 documents.import 方法重新匯入資料，並指定適當的 reconciliationMode 值。如要進一步瞭解如何匯入非結構化資料，請參閱「非結構化資料」。

Python

詳情請參閱 AI Applications Python API 參考說明文件。

如要向 AI Applications 進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證機制」。

from google.api_core.client_options import ClientOptions
from google.cloud import discoveryengine

# TODO(developer): Uncomment these variables before running the sample.
# project_id = "YOUR_PROJECT_ID"
# location = "YOUR_LOCATION" # Values: "global"
# data_store_id = "YOUR_DATA_STORE_ID"

# Examples:
# - Unstructured documents
#   - `gs://bucket/directory/file.pdf`
#   - `gs://bucket/directory/*.pdf`
# - Unstructured documents with JSONL Metadata
#   - `gs://bucket/directory/file.json`
# - Unstructured documents with CSV Metadata
#   - `gs://bucket/directory/file.csv`
# gcs_uri = "YOUR_GCS_PATH"

#  For more information, refer to:
# https://cloud.google.com/generative-ai-app-builder/docs/locations#specify_a_multi-region_for_your_data_store
client_options = (
    ClientOptions(api_endpoint=f"{location}-discoveryengine.googleapis.com")
    if location != "global"
    else None
)

# Create a client
client = discoveryengine.DocumentServiceClient(client_options=client_options)

# The full resource name of the search engine branch.
# e.g. projects/{project}/locations/{location}/dataStores/{data_store_id}/branches/{branch}
parent = client.branch_path(
    project=project_id,
    location=location,
    data_store=data_store_id,
    branch="default_branch",
)

request = discoveryengine.ImportDocumentsRequest(
    parent=parent,
    gcs_source=discoveryengine.GcsSource(
        # Multiple URIs are supported
        input_uris=[gcs_uri],
        # Options:
        # - `content` - Unstructured documents (PDF, HTML, DOC, TXT, PPTX)
        # - `custom` - Unstructured documents with custom JSONL metadata
        # - `document` - Structured documents in the discoveryengine.Document format.
        # - `csv` - Unstructured documents with CSV metadata
        data_schema="content",
    ),
    # Options: `FULL`, `INCREMENTAL`
    reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.INCREMENTAL,
)

# Make the request
operation = client.import_documents(request=request)

print(f"Waiting for operation to complete: {operation.operation.name}")
response = operation.result()

# After the operation is complete,
# get information from operation metadata
metadata = discoveryengine.ImportDocumentsMetadata(operation.metadata)

# Handle the response
print(response)
print(metadata)

重新整理結構化和非結構化資料 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

重新整理結構化資料

控制台

REST

Python

重新整理非結構化資料

控制台

REST

Python

重新整理結構化和非結構化資料