이 페이지는 Cloud Translation API를 통해 번역되었습니다.

처리 응답 처리

처리 요청에 대한 응답에는 Document AI가 추출할 수 있었던 모든 정형 정보를 포함하여 처리된 문서에 대해 알려진 모든 정보를 담고 있는 Document 객체가 포함됩니다.

이 페이지에서는 샘플 문서를 제공한 다음 OCR 결과의 측면을 Document 객체 JSON의 특정 요소에 매핑하여 Document 객체의 레이아웃을 설명합니다. 클라이언트 라이브러리 코드 샘플과 Document AI Toolbox SDK 코드 샘플도 제공합니다. 이 코드 샘플은 온라인 처리를 사용하지만 Document 객체 파싱은 일괄 처리에서도 동일하게 작동합니다.

handle-response-1

요소를 펼치거나 접히도록 설계된 JSON 뷰어 또는 편집 유틸리티를 사용합니다. 일반 텍스트 유틸리티에서 원시 JSON을 검토하는 것은 비효율적입니다.

텍스트, 레이아웃, 품질 점수

다음은 샘플 텍스트 문서입니다.

handle-response-2

다음은 Enterprise Document OCR 프로세서에서 반환한 전체 문서 객체입니다.

JSON 다운로드

OCR은 프로세서에서 실행되므로 이 OCR 출력은 항상 Document AI 프로세서 출력에 포함됩니다. 기존 OCR 데이터를 사용하므로 인라인 문서 옵션을 사용하여 Document AI 프로세서에 이러한 JSON 데이터를 입력할 수 있습니다.

  image=None, # all our samples pass this var
  mime_type="application/json",
  inline_document=document_response # pass OCR output to CDE input - undocumented

다음은 몇 가지 중요 필드입니다.

원시 텍스트

text 필드에는 Document AI에서 인식한 텍스트가 포함됩니다. 이 텍스트에는 공백, 탭, 줄바꿈 외의 레이아웃 구조가 포함되어 있지 않습니다. 이 필드는 문서의 텍스트 정보를 저장하고 문서 텍스트의 정보 소스로 사용되는 유일한 필드입니다. 다른 필드는 위치 (startIndex 및 endIndex)를 기준으로 텍스트 필드의 일부를 참조할 수 있습니다.

  {
    text: "Sample Document\nHeading 1\nLorem ipsum dolor sit amet, ..."
  }

페이지 크기 및 언어

문서 객체의 각 page는 샘플 문서의 실제 페이지에 해당합니다. 샘플 JSON 출력에는 단일 PNG 이미지이므로 페이지가 하나 포함됩니다.

  {
    "pages:" [
      {
        "pageNumber": 1,
        "dimension": {
          "width": 679.0,
          "height": 460.0,
          "unit": "pixels"
        },
      }
    ]
  }

pages[].detectedLanguages[] 필드에는 특정 페이지에서 발견된 언어와 신뢰도 점수가 포함됩니다.

{
  "pages": [
    {
      "detectedLanguages": [
        {
          "confidence": 0.98009938,
          "languageCode": "en"
        },
        {
          "confidence": 0.01990064,
          "languageCode": "und"
        }
      ]
    }
  ]
}

OCR 데이터

문서 AI OCR은 텍스트 블록, 단락, 토큰, 기호와 같이 페이지에서 다양한 세부사항 또는 구성으로 텍스트를 감지합니다 (기호 수준은 기호 수준 데이터를 출력하도록 구성된 경우 선택사항임). 다음은 모두 페이지 객체의 구성원입니다.

모든 요소에는 위치와 텍스트를 설명하는 상응하는 layout가 있습니다. 텍스트가 아닌 시각적 요소(예: 체크박스)도 페이지 수준입니다.

{
  "pages": [
    {
      "paragraphs": [
        {
          "layout": {
            "textAnchor": {
              "textSegments": [
                {
                  "endIndex": "16"
                }
              ]
            },
            "confidence": 0.9939527,
            "boundingPoly": {
              "vertices": [ ... ],
              "normalizedVertices": [ ... ]
            },
            "orientation": "PAGE_UP"
          }
        }
      ]
    }
  ]
}

원시 텍스트는 startIndex 및 endIndex로 기본 텍스트 문자열에 색인이 생성된 textAnchor 객체에서 참조됩니다.

텍스트 문자열이 기본 텍스트 문자열의 시작 부분에서 시작하는 경우

startIndex가 생략될 수 있습니다.

boundingPoly의 경우 페이지의 왼쪽 상단이 원점 (0,0)입니다. 양수 X 값은 오른쪽이고 양수 Y 값은 아래쪽입니다.
vertices 객체는 원본 이미지와 동일한 좌표를 사용하는 반면 normalizedVertices는 [0,1] 범위에 있습니다. 이미지의 정규화의 측정값 기울기 보정 및 기타 속성을 나타내는 변환 행렬이 있습니다.

주의: 0 좌표 값은 생략됩니다. API에서 좌표 ('x' 또는 'y') 값 0을 감지하는 경우 해당 좌표는 JSON 응답에서 생략됩니다. 예를 들어 전체 이미지를 둘러싸는 BoundingPoly가 있는 응답은 "normalizedVertices": [{}, {"x": 1}, {"x": 1,"y": 1}, {"y": 1}] 또는 "vertices": [{}, {"x": X_MAX}, {"x": X_MAX,"y": Y_MAX}, {"y": Y_MAX}]입니다.

boundingPoly를 그리려면 한 정점에서 다음 정점으로 선분을 그립니다. 그런 다음 마지막 꼭지점에서 첫 번째 꼭지점으로 선분을 그려 다각형을 닫습니다. 레이아웃의 orientation 요소는 텍스트가 페이지를 기준으로 회전되었는지 여부를 나타냅니다.

문서 구조를 시각화하는 데 도움이 되도록 다음 이미지에는 page.paragraphs, page.lines, page.tokens의 경계 다각형이 그려져 있습니다.

단락

handle-response-3

선

handle-response-4

토큰

handle-response-5

블록

handle-response-6

Enterprise Document OCR 프로세서는 가독성을 기반으로 문서의 품질을 평가할 수 있습니다.

API 응답에서 이 데이터를 가져오려면 필드 processOptions.ocrConfig.enableImageQualityScores를 true로 설정해야 합니다.

이 품질 평가는 [0, 1]의 품질평가점수이며 1은 완벽한 품질을 의미합니다. 품질 점수는 Page.imageQualityScores 필드에 반환됩니다. 감지된 모든 결함은 quality/defect_*로 표시되며 신뢰도 값을 기준으로 내림차순으로 정렬됩니다.

다음은 너무 어둡고 흐려서 읽기 불편한 PDF입니다.

PDF 다운로드

다음은 Enterprise Document OCR 프로세서에서 반환한 문서 품질 정보입니다.

  {
    "pages": [
      {
        "imageQualityScores": {
          "qualityScore": 0.7811847,
          "detectedDefects": [
            {
              "type": "quality/defect_document_cutoff",
              "confidence": 1.0
            },
            {
              "type": "quality/defect_glare",
              "confidence": 0.97849524
            },
            {
              "type": "quality/defect_text_cutoff",
              "confidence": 0.5
            }
          ]
        }
      }
    ]
  }

코드 샘플

다음 코드 샘플은 처리 요청을 전송한 후 필드를 읽고 터미널에 출력하는 방법을 보여줍니다.

Java

자세한 내용은 Document AI Java API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


import com.google.cloud.documentai.v1beta3.Document;
import com.google.cloud.documentai.v1beta3.DocumentProcessorServiceClient;
import com.google.cloud.documentai.v1beta3.DocumentProcessorServiceSettings;
import com.google.cloud.documentai.v1beta3.ProcessRequest;
import com.google.cloud.documentai.v1beta3.ProcessResponse;
import com.google.cloud.documentai.v1beta3.RawDocument;
import com.google.protobuf.ByteString;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeoutException;

public class ProcessOcrDocument {
  public static void processOcrDocument()
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String location = "your-project-location"; // Format is "us" or "eu".
    String processerId = "your-processor-id";
    String filePath = "path/to/input/file.pdf";
    processOcrDocument(projectId, location, processerId, filePath);
  }

  public static void processOcrDocument(
      String projectId, String location, String processorId, String filePath)
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // Initialize client that will be used to send requests. This client only needs
    // to be created
    // once, and can be reused for multiple requests. After completing all of your
    // requests, call
    // the "close" method on the client to safely clean up any remaining background
    // resources.
    String endpoint = String.format("%s-documentai.googleapis.com:443", location);
    DocumentProcessorServiceSettings settings =
        DocumentProcessorServiceSettings.newBuilder().setEndpoint(endpoint).build();
    try (DocumentProcessorServiceClient client = DocumentProcessorServiceClient.create(settings)) {
      // The full resource name of the processor, e.g.:
      // projects/project-id/locations/location/processor/processor-id
      // You must create new processors in the Cloud Console first
      String name =
          String.format("projects/%s/locations/%s/processors/%s", projectId, location, processorId);

      // Read the file.
      byte[] imageFileData = Files.readAllBytes(Paths.get(filePath));

      // Convert the image data to a Buffer and base64 encode it.
      ByteString content = ByteString.copyFrom(imageFileData);

      RawDocument document =
          RawDocument.newBuilder().setContent(content).setMimeType("application/pdf").build();

      // Configure the process request.
      ProcessRequest request =
          ProcessRequest.newBuilder().setName(name).setRawDocument(document).build();

      // Recognizes text entities in the PDF document
      ProcessResponse result = client.processDocument(request);
      Document documentResponse = result.getDocument();

      System.out.println("Document processing complete.");

      // Read the text recognition output from the processor
      // For a full list of Document object attributes,
      // please reference this page:
      // https://googleapis.dev/java/google-cloud-document-ai/latest/index.html

      // Get all of the document text as one big string
      String text = documentResponse.getText();
      System.out.printf("Full document text: '%s'\n", escapeNewlines(text));

      // Read the text recognition output from the processor
      List<Document.Page> pages = documentResponse.getPagesList();
      System.out.printf("There are %s page(s) in this document.\n", pages.size());

      for (Document.Page page : pages) {
        System.out.printf("Page %d:\n", page.getPageNumber());
        printPageDimensions(page.getDimension());
        printDetectedLanguages(page.getDetectedLanguagesList());
        printParagraphs(page.getParagraphsList(), text);
        printBlocks(page.getBlocksList(), text);
        printLines(page.getLinesList(), text);
        printTokens(page.getTokensList(), text);
      }
    }
  }

  private static void printPageDimensions(Document.Page.Dimension dimension) {
    String unit = dimension.getUnit();
    System.out.printf("    Width: %.1f %s\n", dimension.getWidth(), unit);
    System.out.printf("    Height: %.1f %s\n", dimension.getHeight(), unit);
  }

  private static void printDetectedLanguages(
      List<Document.Page.DetectedLanguage> detectedLangauges) {
    System.out.println("    Detected languages:");
    for (Document.Page.DetectedLanguage detectedLanguage : detectedLangauges) {
      String languageCode = detectedLanguage.getLanguageCode();
      float confidence = detectedLanguage.getConfidence();
      System.out.printf("        %s (%.2f%%)\n", languageCode, confidence * 100.0);
    }
  }

  private static void printParagraphs(List<Document.Page.Paragraph> paragraphs, String text) {
    System.out.printf("    %d paragraphs detected:\n", paragraphs.size());
    Document.Page.Paragraph firstParagraph = paragraphs.get(0);
    String firstParagraphText = getLayoutText(firstParagraph.getLayout().getTextAnchor(), text);
    System.out.printf("        First paragraph text: %s\n", escapeNewlines(firstParagraphText));
    Document.Page.Paragraph lastParagraph = paragraphs.get(paragraphs.size() - 1);
    String lastParagraphText = getLayoutText(lastParagraph.getLayout().getTextAnchor(), text);
    System.out.printf("        Last paragraph text: %s\n", escapeNewlines(lastParagraphText));
  }

  private static void printBlocks(List<Document.Page.Block> blocks, String text) {
    System.out.printf("    %d blocks detected:\n", blocks.size());
    Document.Page.Block firstBlock = blocks.get(0);
    String firstBlockText = getLayoutText(firstBlock.getLayout().getTextAnchor(), text);
    System.out.printf("        First block text: %s\n", escapeNewlines(firstBlockText));
    Document.Page.Block lastBlock = blocks.get(blocks.size() - 1);
    String lastBlockText = getLayoutText(lastBlock.getLayout().getTextAnchor(), text);
    System.out.printf("        Last block text: %s\n", escapeNewlines(lastBlockText));
  }

  private static void printLines(List<Document.Page.Line> lines, String text) {
    System.out.printf("    %d lines detected:\n", lines.size());
    Document.Page.Line firstLine = lines.get(0);
    String firstLineText = getLayoutText(firstLine.getLayout().getTextAnchor(), text);
    System.out.printf("        First line text: %s\n", escapeNewlines(firstLineText));
    Document.Page.Line lastLine = lines.get(lines.size() - 1);
    String lastLineText = getLayoutText(lastLine.getLayout().getTextAnchor(), text);
    System.out.printf("        Last line text: %s\n", escapeNewlines(lastLineText));
  }

  private static void printTokens(List<Document.Page.Token> tokens, String text) {
    System.out.printf("    %d tokens detected:\n", tokens.size());
    Document.Page.Token firstToken = tokens.get(0);
    String firstTokenText = getLayoutText(firstToken.getLayout().getTextAnchor(), text);
    System.out.printf("        First token text: %s\n", escapeNewlines(firstTokenText));
    Document.Page.Token lastToken = tokens.get(tokens.size() - 1);
    String lastTokenText = getLayoutText(lastToken.getLayout().getTextAnchor(), text);
    System.out.printf("        Last token text: %s\n", escapeNewlines(lastTokenText));
  }

  // Extract shards from the text field
  private static String getLayoutText(Document.TextAnchor textAnchor, String text) {
    if (textAnchor.getTextSegmentsList().size() > 0) {
      int startIdx = (int) textAnchor.getTextSegments(0).getStartIndex();
      int endIdx = (int) textAnchor.getTextSegments(0).getEndIndex();
      return text.substring(startIdx, endIdx);
    }
    return "[NO TEXT]";
  }

  private static String escapeNewlines(String s) {
    return s.replace("\n", "\\n").replace("\r", "\\r");
  }
}

Node.js

자세한 내용은 Document AI Node.js API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION'; // Format is 'us' or 'eu'
// const processorId = 'YOUR_PROCESSOR_ID'; // Create processor in Cloud Console
// const filePath = '/path/to/local/pdf';

const {DocumentProcessorServiceClient} =
  require('@google-cloud/documentai').v1beta3;

// Instantiates a client
const client = new DocumentProcessorServiceClient();

async function processDocument() {
  // The full resource name of the processor, e.g.:
  // projects/project-id/locations/location/processor/processor-id
  // You must create new processors in the Cloud Console first
  const name = `projects/${projectId}/locations/${location}/processors/${processorId}`;

  // Read the file into memory.
  const fs = require('fs').promises;
  const imageFile = await fs.readFile(filePath);

  // Convert the image data to a Buffer and base64 encode it.
  const encodedImage = Buffer.from(imageFile).toString('base64');

  const request = {
    name,
    rawDocument: {
      content: encodedImage,
      mimeType: 'application/pdf',
    },
  };

  // Recognizes text entities in the PDF document
  const [result] = await client.processDocument(request);

  console.log('Document processing complete.');

  // Read the text recognition output from the processor
  // For a full list of Document object attributes,
  // please reference this page: https://googleapis.dev/nodejs/documentai/latest/index.html
  const {document} = result;
  const {text} = document;

  // Read the text recognition output from the processor
  console.log(`Full document text: ${JSON.stringify(text)}`);
  console.log(`There are ${document.pages.length} page(s) in this document.`);
  for (const page of document.pages) {
    console.log(`Page ${page.pageNumber}`);
    printPageDimensions(page.dimension);
    printDetectedLanguages(page.detectedLanguages);
    printParagraphs(page.paragraphs, text);
    printBlocks(page.blocks, text);
    printLines(page.lines, text);
    printTokens(page.tokens, text);
  }
}

const printPageDimensions = dimension => {
  console.log(`    Width: ${dimension.width}`);
  console.log(`    Height: ${dimension.height}`);
};

const printDetectedLanguages = detectedLanguages => {
  console.log('    Detected languages:');
  for (const lang of detectedLanguages) {
    const code = lang.languageCode;
    const confPercent = lang.confidence * 100;
    console.log(`        ${code} (${confPercent.toFixed(2)}% confidence)`);
  }
};

const printParagraphs = (paragraphs, text) => {
  console.log(`    ${paragraphs.length} paragraphs detected:`);
  const firstParagraphText = getText(paragraphs[0].layout.textAnchor, text);
  console.log(
    `        First paragraph text: ${JSON.stringify(firstParagraphText)}`
  );
  const lastParagraphText = getText(
    paragraphs[paragraphs.length - 1].layout.textAnchor,
    text
  );
  console.log(
    `        Last paragraph text: ${JSON.stringify(lastParagraphText)}`
  );
};

const printBlocks = (blocks, text) => {
  console.log(`    ${blocks.length} blocks detected:`);
  const firstBlockText = getText(blocks[0].layout.textAnchor, text);
  console.log(`        First block text: ${JSON.stringify(firstBlockText)}`);
  const lastBlockText = getText(
    blocks[blocks.length - 1].layout.textAnchor,
    text
  );
  console.log(`        Last block text: ${JSON.stringify(lastBlockText)}`);
};

const printLines = (lines, text) => {
  console.log(`    ${lines.length} lines detected:`);
  const firstLineText = getText(lines[0].layout.textAnchor, text);
  console.log(`        First line text: ${JSON.stringify(firstLineText)}`);
  const lastLineText = getText(
    lines[lines.length - 1].layout.textAnchor,
    text
  );
  console.log(`        Last line text: ${JSON.stringify(lastLineText)}`);
};

const printTokens = (tokens, text) => {
  console.log(`    ${tokens.length} tokens detected:`);
  const firstTokenText = getText(tokens[0].layout.textAnchor, text);
  console.log(`        First token text: ${JSON.stringify(firstTokenText)}`);
  const firstTokenBreakType = tokens[0].detectedBreak.type;
  console.log(`        First token break type: ${firstTokenBreakType}`);
  const lastTokenText = getText(
    tokens[tokens.length - 1].layout.textAnchor,
    text
  );
  console.log(`        Last token text: ${JSON.stringify(lastTokenText)}`);
  const lastTokenBreakType = tokens[tokens.length - 1].detectedBreak.type;
  console.log(`        Last token break type: ${lastTokenBreakType}`);
};

// Extract shards from the text field
const getText = (textAnchor, text) => {
  if (!textAnchor.textSegments || textAnchor.textSegments.length === 0) {
    return '';
  }

  // First shard in document doesn't have startIndex property
  const startIndex = textAnchor.textSegments[0].startIndex || 0;
  const endIndex = textAnchor.textSegments[0].endIndex;

  return text.substring(startIndex, endIndex);
};

Python

자세한 내용은 Document AI Python API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


from typing import Optional, Sequence

from google.api_core.client_options import ClientOptions
from google.cloud import documentai

# TODO(developer): Uncomment these variables before running the sample.
# project_id = "YOUR_PROJECT_ID"
# location = "YOUR_PROCESSOR_LOCATION" # Format is "us" or "eu"
# processor_id = "YOUR_PROCESSOR_ID" # Create processor before running sample
# processor_version = "rc" # Refer to https://cloud.google.com/document-ai/docs/manage-processor-versions for more information
# file_path = "/path/to/local/pdf"
# mime_type = "application/pdf" # Refer to https://cloud.google.com/document-ai/docs/file-types for supported file types


def process_document_ocr_sample(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
) -> None:
    # Optional: Additional configurations for Document OCR Processor.
    # For more information: https://cloud.google.com/document-ai/docs/enterprise-document-ocr
    process_options = documentai.ProcessOptions(
        ocr_config=documentai.OcrConfig(
            enable_native_pdf_parsing=True,
            enable_image_quality_scores=True,
            enable_symbol=True,
            # OCR Add Ons https://cloud.google.com/document-ai/docs/ocr-add-ons
            premium_features=documentai.OcrConfig.PremiumFeatures(
                compute_style_info=True,
                enable_math_ocr=False,  # Enable to use Math OCR Model
                enable_selection_mark_detection=True,
            ),
        )
    )
    # Online processing request to Document AI
    document = process_document(
        project_id,
        location,
        processor_id,
        processor_version,
        file_path,
        mime_type,
        process_options=process_options,
    )

    text = document.text
    print(f"Full document text: {text}\n")
    print(f"There are {len(document.pages)} page(s) in this document.\n")

    for page in document.pages:
        print(f"Page {page.page_number}:")
        print_page_dimensions(page.dimension)
        print_detected_languages(page.detected_languages)

        print_blocks(page.blocks, text)
        print_paragraphs(page.paragraphs, text)
        print_lines(page.lines, text)
        print_tokens(page.tokens, text)

        if page.symbols:
            print_symbols(page.symbols, text)

        if page.image_quality_scores:
            print_image_quality_scores(page.image_quality_scores)

        if page.visual_elements:
            print_visual_elements(page.visual_elements, text)


def print_page_dimensions(dimension: documentai.Document.Page.Dimension) -> None:
    print(f"    Width: {str(dimension.width)}")
    print(f"    Height: {str(dimension.height)}")


def print_detected_languages(
    detected_languages: Sequence[documentai.Document.Page.DetectedLanguage],
) -> None:
    print("    Detected languages:")
    for lang in detected_languages:
        print(f"        {lang.language_code} ({lang.confidence:.1%} confidence)")


def print_blocks(blocks: Sequence[documentai.Document.Page.Block], text: str) -> None:
    print(f"    {len(blocks)} blocks detected:")
    first_block_text = layout_to_text(blocks[0].layout, text)
    print(f"        First text block: {repr(first_block_text)}")
    last_block_text = layout_to_text(blocks[-1].layout, text)
    print(f"        Last text block: {repr(last_block_text)}")


def print_paragraphs(
    paragraphs: Sequence[documentai.Document.Page.Paragraph], text: str
) -> None:
    print(f"    {len(paragraphs)} paragraphs detected:")
    first_paragraph_text = layout_to_text(paragraphs[0].layout, text)
    print(f"        First paragraph text: {repr(first_paragraph_text)}")
    last_paragraph_text = layout_to_text(paragraphs[-1].layout, text)
    print(f"        Last paragraph text: {repr(last_paragraph_text)}")


def print_lines(lines: Sequence[documentai.Document.Page.Line], text: str) -> None:
    print(f"    {len(lines)} lines detected:")
    first_line_text = layout_to_text(lines[0].layout, text)
    print(f"        First line text: {repr(first_line_text)}")
    last_line_text = layout_to_text(lines[-1].layout, text)
    print(f"        Last line text: {repr(last_line_text)}")


def print_tokens(tokens: Sequence[documentai.Document.Page.Token], text: str) -> None:
    print(f"    {len(tokens)} tokens detected:")
    first_token_text = layout_to_text(tokens[0].layout, text)
    first_token_break_type = tokens[0].detected_break.type_.name
    print(f"        First token text: {repr(first_token_text)}")
    print(f"        First token break type: {repr(first_token_break_type)}")
    if tokens[0].style_info:
        print_style_info(tokens[0].style_info)

    last_token_text = layout_to_text(tokens[-1].layout, text)
    last_token_break_type = tokens[-1].detected_break.type_.name
    print(f"        Last token text: {repr(last_token_text)}")
    print(f"        Last token break type: {repr(last_token_break_type)}")
    if tokens[-1].style_info:
        print_style_info(tokens[-1].style_info)


def print_symbols(
    symbols: Sequence[documentai.Document.Page.Symbol], text: str
) -> None:
    print(f"    {len(symbols)} symbols detected:")
    first_symbol_text = layout_to_text(symbols[0].layout, text)
    print(f"        First symbol text: {repr(first_symbol_text)}")
    last_symbol_text = layout_to_text(symbols[-1].layout, text)
    print(f"        Last symbol text: {repr(last_symbol_text)}")


def print_image_quality_scores(
    image_quality_scores: documentai.Document.Page.ImageQualityScores,
) -> None:
    print(f"    Quality score: {image_quality_scores.quality_score:.1%}")
    print("    Detected defects:")

    for detected_defect in image_quality_scores.detected_defects:
        print(f"        {detected_defect.type_}: {detected_defect.confidence:.1%}")


def print_style_info(style_info: documentai.Document.Page.Token.StyleInfo) -> None:
    """
    Only supported in version `pretrained-ocr-v2.0-2023-06-02`
    """
    print(f"           Font Size: {style_info.font_size}pt")
    print(f"           Font Type: {style_info.font_type}")
    print(f"           Bold: {style_info.bold}")
    print(f"           Italic: {style_info.italic}")
    print(f"           Underlined: {style_info.underlined}")
    print(f"           Handwritten: {style_info.handwritten}")
    print(
        f"           Text Color (RGBa): {style_info.text_color.red}, {style_info.text_color.green}, {style_info.text_color.blue}, {style_info.text_color.alpha}"
    )


def print_visual_elements(
    visual_elements: Sequence[documentai.Document.Page.VisualElement], text: str
) -> None:
    """
    Only supported in version `pretrained-ocr-v2.0-2023-06-02`
    """
    checkboxes = [x for x in visual_elements if "checkbox" in x.type]
    math_symbols = [x for x in visual_elements if x.type == "math_formula"]

    if checkboxes:
        print(f"    {len(checkboxes)} checkboxes detected:")
        print(f"        First checkbox: {repr(checkboxes[0].type)}")
        print(f"        Last checkbox: {repr(checkboxes[-1].type)}")

    if math_symbols:
        print(f"    {len(math_symbols)} math symbols detected:")
        first_math_symbol_text = layout_to_text(math_symbols[0].layout, text)
        print(f"        First math symbol: {repr(first_math_symbol_text)}")




def process_document(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
    process_options: Optional[documentai.ProcessOptions] = None,
) -> documentai.Document:
    # You must set the `api_endpoint` if you use a location other than "us".
    client = documentai.DocumentProcessorServiceClient(
        client_options=ClientOptions(
            api_endpoint=f"{location}-documentai.googleapis.com"
        )
    )

    # The full resource name of the processor version, e.g.:
    # `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`
    # You must create a processor before running this sample.
    name = client.processor_version_path(
        project_id, location, processor_id, processor_version
    )

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Configure the process request
    request = documentai.ProcessRequest(
        name=name,
        raw_document=documentai.RawDocument(content=image_content, mime_type=mime_type),
        # Only supported for Document OCR processor
        process_options=process_options,
    )

    result = client.process_document(request=request)

    # For a full list of `Document` object attributes, reference this page:
    # https://cloud.google.com/document-ai/docs/reference/rest/v1/Document
    return result.document




def layout_to_text(layout: documentai.Document.Page.Layout, text: str) -> str:
    """
    Document AI identifies text in different parts of the document by their
    offsets in the entirety of the document"s text. This function converts
    offsets to a string.
    """
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    return "".join(
        text[int(segment.start_index) : int(segment.end_index)]
        for segment in layout.text_anchor.text_segments
    )

양식 및 표

다음은 샘플 양식입니다.

handle-response-7

다음은 양식 파서에서 반환한 전체 문서 객체입니다.

JSON 다운로드

다음은 몇 가지 중요 필드입니다.

양식 파서는 페이지에서 FormFields를 감지할 수 있습니다. 각 양식 필드에는 이름과 값이 있습니다. 키-값 쌍 (KVP)이라고도 합니다. KVP는 다른 추출기의 (스키마) 항목과 다릅니다.

항목 이름이 구성됩니다. KVP의 키는 문서의 키 텍스트입니다.

{
  "pages:" [
    {
      "formFields": [
        {
          "fieldName": { ... },
          "fieldValue": { ... }
        }
      ]
    }
  ]
}

Document AI는 페이지에서 Tables도 감지할 수 있습니다.

{
  "pages:" [
    {
      "tables": [
        {
          "layout": { ... },
          "headerRows": [
            {
              "cells": [
                {
                  "layout": { ... },
                  "rowSpan": 1,
                  "colSpan": 1
                },
                {
                  "layout": { ... },
                  "rowSpan": 1,
                  "colSpan": 1
                }
              ]
            }
          ],
          "bodyRows": [
            {
              "cells": [
                {
                  "layout": { ... },
                  "rowSpan": 1,
                  "colSpan": 1
                },
                {
                  "layout": { ... },
                  "rowSpan": 1,
                  "colSpan": 1
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

양식 파서 내의 표 추출은 행이나 열에 걸쳐 있는 셀이 없는 간단한 표만 인식합니다. 따라서 rowSpan 및 colSpan은 항상 1입니다.

프로세서 버전 pretrained-form-parser-v2.0-2022-11-10부터 양식 파서는 일반 항목도 인식할 수 있습니다. 자세한 내용은 양식 파서를 참고하세요.
문서 구조를 시각화하는 데 도움이 되도록 다음 이미지에는 page.formFields 및 page.tables의 경계 다각형이 그려져 있습니다.
표의 체크박스 양식 파서는 이미지와 PDF의 체크박스를 KVP로 디지털화할 수 있습니다. 체크박스 디지털화의 예를 키-값 쌍으로 제공합니다.

handle-response-8

테이블 외부에서 체크박스는 양식 파서 내에서 시각적 요소로 표시됩니다. UI의 체크표시가 있는 정사각형 상자와 JSON의 유니코드 ✓를 강조 표시합니다.

handle-response-9

"pages:" [
    {
      "tables": [
        {
          "layout": { ... },
          "headerRows": [
            {
              "cells": [
                {
                  "layout": { ... },
                  "rowSpan": 1,
                  "colSpan": 1
                },
                {
                  "layout": { ... },
                  "rowSpan": 1,
                  "colSpan": 1
                }
              ]
            }
          ],
          "bodyRows": [
            {
              "cells": [
                {
                  "layout": { ... },
                  "rowSpan": 1,
                  "colSpan": 1
                },
                {
                  "layout": { ... },
                  "rowSpan": 1,
                  "colSpan": 1
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

표에서 체크박스는 ✓ (선택됨) 또는 ☐ (선택 해제됨)과 같은 유니코드 문자로 표시됩니다.

선택된 체크박스의 값은 filled_checkbox: under pages > x > formFields > x > fieldValue > valueType.입니다. 선택 해제된 체크박스의 값은 unfilled_checkbox입니다.

handle-response-10

콘텐츠 필드에는 경로 pages>formFields>x>fieldValue>textAnchor>content에서 체크박스 콘텐츠 값이 강조 표시된 ✓로 표시됩니다.

문서 구조를 시각화하는 데 도움이 되도록 다음 이미지에는 page.formFields 및 page.tables의 경계 다각형이 그려져 있습니다.

양식 입력란

handle-response-11

테이블

handle-response-12

코드 샘플

다음 코드 샘플은 처리 요청을 전송한 후 필드를 읽고 터미널에 출력하는 방법을 보여줍니다.

Java

자세한 내용은 Document AI Java API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


import com.google.cloud.documentai.v1beta3.Document;
import com.google.cloud.documentai.v1beta3.DocumentProcessorServiceClient;
import com.google.cloud.documentai.v1beta3.DocumentProcessorServiceSettings;
import com.google.cloud.documentai.v1beta3.ProcessRequest;
import com.google.cloud.documentai.v1beta3.ProcessResponse;
import com.google.cloud.documentai.v1beta3.RawDocument;
import com.google.protobuf.ByteString;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeoutException;

public class ProcessFormDocument {
  public static void processFormDocument()
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String location = "your-project-location"; // Format is "us" or "eu".
    String processerId = "your-processor-id";
    String filePath = "path/to/input/file.pdf";
    processFormDocument(projectId, location, processerId, filePath);
  }

  public static void processFormDocument(
      String projectId, String location, String processorId, String filePath)
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // Initialize client that will be used to send requests. This client only needs
    // to be created
    // once, and can be reused for multiple requests. After completing all of your
    // requests, call
    // the "close" method on the client to safely clean up any remaining background
    // resources.
    String endpoint = String.format("%s-documentai.googleapis.com:443", location);
    DocumentProcessorServiceSettings settings =
        DocumentProcessorServiceSettings.newBuilder().setEndpoint(endpoint).build();
    try (DocumentProcessorServiceClient client = DocumentProcessorServiceClient.create(settings)) {
      // The full resource name of the processor, e.g.:
      // projects/project-id/locations/location/processor/processor-id
      // You must create new processors in the Cloud Console first
      String name =
          String.format("projects/%s/locations/%s/processors/%s", projectId, location, processorId);

      // Read the file.
      byte[] imageFileData = Files.readAllBytes(Paths.get(filePath));

      // Convert the image data to a Buffer and base64 encode it.
      ByteString content = ByteString.copyFrom(imageFileData);

      RawDocument document =
          RawDocument.newBuilder().setContent(content).setMimeType("application/pdf").build();

      // Configure the process request.
      ProcessRequest request =
          ProcessRequest.newBuilder().setName(name).setRawDocument(document).build();

      // Recognizes text entities in the PDF document
      ProcessResponse result = client.processDocument(request);
      Document documentResponse = result.getDocument();

      System.out.println("Document processing complete.");

      // Read the text recognition output from the processor
      // For a full list of Document object attributes,
      // please reference this page:
      // https://googleapis.dev/java/google-cloud-document-ai/latest/index.html

      // Get all of the document text as one big string
      String text = documentResponse.getText();
      System.out.printf("Full document text: '%s'\n", removeNewlines(text));

      // Read the text recognition output from the processor
      List<Document.Page> pages = documentResponse.getPagesList();
      System.out.printf("There are %s page(s) in this document.\n", pages.size());

      for (Document.Page page : pages) {
        System.out.printf("\n\n**** Page %d ****\n", page.getPageNumber());

        List<Document.Page.Table> tables = page.getTablesList();
        System.out.printf("Found %d table(s):\n", tables.size());
        for (Document.Page.Table table : tables) {
          printTableInfo(table, text);
        }

        List<Document.Page.FormField> formFields = page.getFormFieldsList();
        System.out.printf("Found %d form fields:\n", formFields.size());
        for (Document.Page.FormField formField : formFields) {
          String fieldName = getLayoutText(formField.getFieldName().getTextAnchor(), text);
          String fieldValue = getLayoutText(formField.getFieldValue().getTextAnchor(), text);
          System.out.printf(
              "    * '%s': '%s'\n", removeNewlines(fieldName), removeNewlines(fieldValue));
        }
      }
    }
  }

  private static void printTableInfo(Document.Page.Table table, String text) {
    Document.Page.Table.TableRow firstBodyRow = table.getBodyRows(0);
    int columnCount = firstBodyRow.getCellsCount();
    System.out.printf(
        "    Table with %d columns and %d rows:\n", columnCount, table.getBodyRowsCount());

    Document.Page.Table.TableRow headerRow = table.getHeaderRows(0);
    StringBuilder headerRowText = new StringBuilder();
    for (Document.Page.Table.TableCell cell : headerRow.getCellsList()) {
      String columnName = getLayoutText(cell.getLayout().getTextAnchor(), text);
      headerRowText.append(String.format("%s | ", removeNewlines(columnName)));
    }
    headerRowText.setLength(headerRowText.length() - 3);
    System.out.printf("        Collumns: %s\n", headerRowText.toString());

    StringBuilder firstRowText = new StringBuilder();
    for (Document.Page.Table.TableCell cell : firstBodyRow.getCellsList()) {
      String cellText = getLayoutText(cell.getLayout().getTextAnchor(), text);
      firstRowText.append(String.format("%s | ", removeNewlines(cellText)));
    }
    firstRowText.setLength(firstRowText.length() - 3);
    System.out.printf("        First row data: %s\n", firstRowText.toString());
  }

  // Extract shards from the text field
  private static String getLayoutText(Document.TextAnchor textAnchor, String text) {
    if (textAnchor.getTextSegmentsList().size() > 0) {
      int startIdx = (int) textAnchor.getTextSegments(0).getStartIndex();
      int endIdx = (int) textAnchor.getTextSegments(0).getEndIndex();
      return text.substring(startIdx, endIdx);
    }
    return "[NO TEXT]";
  }

  private static String removeNewlines(String s) {
    return s.replace("\n", "").replace("\r", "");
  }
}

Node.js

자세한 내용은 Document AI Node.js API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION'; // Format is 'us' or 'eu'
// const processorId = 'YOUR_PROCESSOR_ID'; // Create processor in Cloud Console
// const filePath = '/path/to/local/pdf';

const {DocumentProcessorServiceClient} =
  require('@google-cloud/documentai').v1beta3;

// Instantiates a client
const client = new DocumentProcessorServiceClient();

async function processDocument() {
  // The full resource name of the processor, e.g.:
  // projects/project-id/locations/location/processor/processor-id
  // You must create new processors in the Cloud Console first
  const name = `projects/${projectId}/locations/${location}/processors/${processorId}`;

  // Read the file into memory.
  const fs = require('fs').promises;
  const imageFile = await fs.readFile(filePath);

  // Convert the image data to a Buffer and base64 encode it.
  const encodedImage = Buffer.from(imageFile).toString('base64');

  const request = {
    name,
    rawDocument: {
      content: encodedImage,
      mimeType: 'application/pdf',
    },
  };

  // Recognizes text entities in the PDF document
  const [result] = await client.processDocument(request);

  console.log('Document processing complete.');

  // Read the table and form fields output from the processor
  // The form processor also contains OCR data. For more information
  // on how to parse OCR data please see the OCR sample.
  // For a full list of Document object attributes,
  // please reference this page: https://googleapis.dev/nodejs/documentai/latest/index.html
  const {document} = result;
  const {text} = document;
  console.log(`Full document text: ${JSON.stringify(text)}`);
  console.log(`There are ${document.pages.length} page(s) in this document.`);

  for (const page of document.pages) {
    console.log(`\n\n**** Page ${page.pageNumber} ****`);

    console.log(`Found ${page.tables.length} table(s):`);
    for (const table of page.tables) {
      const numCollumns = table.headerRows[0].cells.length;
      const numRows = table.bodyRows.length;
      console.log(`Table with ${numCollumns} columns and ${numRows} rows:`);
      printTableInfo(table, text);
    }
    console.log(`Found ${page.formFields.length} form field(s):`);
    for (const field of page.formFields) {
      const fieldName = getText(field.fieldName.textAnchor, text);
      const fieldValue = getText(field.fieldValue.textAnchor, text);
      console.log(
        `\t* ${JSON.stringify(fieldName)}: ${JSON.stringify(fieldValue)}`
      );
    }
  }
}

const printTableInfo = (table, text) => {
  // Print header row
  let headerRowText = '';
  for (const headerCell of table.headerRows[0].cells) {
    const headerCellText = getText(headerCell.layout.textAnchor, text);
    headerRowText += `${JSON.stringify(headerCellText.trim())} | `;
  }
  console.log(
    `Collumns: ${headerRowText.substring(0, headerRowText.length - 3)}`
  );
  // Print first body row
  let bodyRowText = '';
  for (const bodyCell of table.bodyRows[0].cells) {
    const bodyCellText = getText(bodyCell.layout.textAnchor, text);
    bodyRowText += `${JSON.stringify(bodyCellText.trim())} | `;
  }
  console.log(
    `First row data: ${bodyRowText.substring(0, bodyRowText.length - 3)}`
  );
};

// Extract shards from the text field
const getText = (textAnchor, text) => {
  if (!textAnchor.textSegments || textAnchor.textSegments.length === 0) {
    return '';
  }

  // First shard in document doesn't have startIndex property
  const startIndex = textAnchor.textSegments[0].startIndex || 0;
  const endIndex = textAnchor.textSegments[0].endIndex;

  return text.substring(startIndex, endIndex);
};

Python

자세한 내용은 Document AI Python API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


from typing import Optional, Sequence

from google.api_core.client_options import ClientOptions
from google.cloud import documentai

# TODO(developer): Uncomment these variables before running the sample.
# project_id = "YOUR_PROJECT_ID"
# location = "YOUR_PROCESSOR_LOCATION" # Format is "us" or "eu"
# processor_id = "YOUR_PROCESSOR_ID" # Create processor before running sample
# processor_version = "rc" # Refer to https://cloud.google.com/document-ai/docs/manage-processor-versions for more information
# file_path = "/path/to/local/pdf"
# mime_type = "application/pdf" # Refer to https://cloud.google.com/document-ai/docs/file-types for supported file types


def process_document_form_sample(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
) -> documentai.Document:
    # Online processing request to Document AI
    document = process_document(
        project_id, location, processor_id, processor_version, file_path, mime_type
    )

    # Read the table and form fields output from the processor
    # The form processor also contains OCR data. For more information
    # on how to parse OCR data please see the OCR sample.

    text = document.text
    print(f"Full document text: {repr(text)}\n")
    print(f"There are {len(document.pages)} page(s) in this document.")

    # Read the form fields and tables output from the processor
    for page in document.pages:
        print(f"\n\n**** Page {page.page_number} ****")

        print(f"\nFound {len(page.tables)} table(s):")
        for table in page.tables:
            num_columns = len(table.header_rows[0].cells)
            num_rows = len(table.body_rows)
            print(f"Table with {num_columns} columns and {num_rows} rows:")

            # Print header rows
            print("Columns:")
            print_table_rows(table.header_rows, text)
            # Print body rows
            print("Table body data:")
            print_table_rows(table.body_rows, text)

        print(f"\nFound {len(page.form_fields)} form field(s):")
        for field in page.form_fields:
            name = layout_to_text(field.field_name, text)
            value = layout_to_text(field.field_value, text)
            print(f"    * {repr(name.strip())}: {repr(value.strip())}")

    # Supported in version `pretrained-form-parser-v2.0-2022-11-10` and later.
    # For more information: https://cloud.google.com/document-ai/docs/form-parser
    if document.entities:
        print(f"Found {len(document.entities)} generic entities:")
        for entity in document.entities:
            print_entity(entity)
            # Print Nested Entities
            for prop in entity.properties:
                print_entity(prop)

    return document


def print_table_rows(
    table_rows: Sequence[documentai.Document.Page.Table.TableRow], text: str
) -> None:
    for table_row in table_rows:
        row_text = ""
        for cell in table_row.cells:
            cell_text = layout_to_text(cell.layout, text)
            row_text += f"{repr(cell_text.strip())} | "
        print(row_text)




def print_entity(entity: documentai.Document.Entity) -> None:
    # Fields detected. For a full list of fields for each processor see
    # the processor documentation:
    # https://cloud.google.com/document-ai/docs/processors-list
    key = entity.type_

    # Some other value formats in addition to text are available
    # e.g. dates: `entity.normalized_value.date_value.year`
    text_value = entity.text_anchor.content or entity.mention_text
    confidence = entity.confidence
    normalized_value = entity.normalized_value.text
    print(f"    * {repr(key)}: {repr(text_value)} ({confidence:.1%} confident)")

    if normalized_value:
        print(f"    * Normalized Value: {repr(normalized_value)}")




def process_document(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
    process_options: Optional[documentai.ProcessOptions] = None,
) -> documentai.Document:
    # You must set the `api_endpoint` if you use a location other than "us".
    client = documentai.DocumentProcessorServiceClient(
        client_options=ClientOptions(
            api_endpoint=f"{location}-documentai.googleapis.com"
        )
    )

    # The full resource name of the processor version, e.g.:
    # `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`
    # You must create a processor before running this sample.
    name = client.processor_version_path(
        project_id, location, processor_id, processor_version
    )

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Configure the process request
    request = documentai.ProcessRequest(
        name=name,
        raw_document=documentai.RawDocument(content=image_content, mime_type=mime_type),
        # Only supported for Document OCR processor
        process_options=process_options,
    )

    result = client.process_document(request=request)

    # For a full list of `Document` object attributes, reference this page:
    # https://cloud.google.com/document-ai/docs/reference/rest/v1/Document
    return result.document




def layout_to_text(layout: documentai.Document.Page.Layout, text: str) -> str:
    """
    Document AI identifies text in different parts of the document by their
    offsets in the entirety of the document"s text. This function converts
    offsets to a string.
    """
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    return "".join(
        text[int(segment.start_index) : int(segment.end_index)]
        for segment in layout.text_anchor.text_segments
    )

항목, 중첩된 항목, 정규화된 값

많은 전문 프로세서는 잘 정의된 스키마에 기반한 구조화된 데이터를 추출합니다. 예를 들어 인보이스 파서는 invoice_date 및 supplier_name와 같은 특정 필드를 감지합니다. 인보이스 샘플은 다음과 같습니다.

handle-response-13

다음은 인보이스 파서에서 반환한 전체 문서 객체입니다.

JSON 다운로드

다음은 문서 객체의 몇 가지 중요한 부분입니다.

감지된 필드: Entities에는 프로세서가 감지할 수 있었던 필드(예: invoice_date)가 포함됩니다.

{
 "entities": [
    {
      "textAnchor": {
        "textSegments": [
          {
            "startIndex": "14",
            "endIndex": "24"
          }
        ],
        "content": "2020/01/01"
      },
      "type": "invoice_date",
      "confidence": 0.9938466,
      "pageAnchor": { ... },
      "id": "2",
      "normalizedValue": {
        "text": "2020-01-01",
        "dateValue": {
          "year": 2020,
          "month": 1,
          "day": 1
        }
      }
    }
  ]
}

특정 필드의 경우 프로세서는 값을 정규화합니다. 이 예시에서는 날짜가 2020/01/01에서 2020-01-01로 정규화되었습니다.

정규화: 프로세서는 지원되는 여러 특정 필드의 경우 값을 정규화하고 entity도 반환합니다. normalizedValue 필드는 각 항목의 textAnchor를 통해 얻은 원시 추출 필드에 추가됩니다. 따라서 리터럴 텍스트를 정규화하여 텍스트 값을 하위 필드로 분할하는 경우가 많습니다. 예를 들어 2024년 9월 1일과 같은 날짜는 다음과 같이 표시됩니다.

  normalizedValue": {
    "text": "2020-09-01",
    "dateValue": {
      "year": 2024,
      "month": 9,
      "day": 1
  }

이 예에서는 날짜가 2020/01/01에서 2020-01-01로 표준화되었습니다. 이는 후처리를 줄이고 선택한 형식으로 변환할 수 있도록 표준화된 형식입니다.

주소는 정규화되는 경우가 많으며, 이 경우 주소의 요소가 개별 필드로 분류됩니다. 숫자는 정수 또는 부동 소수점 숫자를 normalizedValue로 사용하여 정규화됩니다.

보강: 특정 프로세서와 필드는 보강도 지원합니다. 예를 들어 문서 Google Singapore의 원래 supplier_name가 Enterprise Knowledge Graph에 대해 Google Asia Pacific, Singapore로 정규화되었습니다. 또한 Enterprise Knowledge Graph에는 Google에 관한 정보가 포함되어 있으므로 Document AI는 샘플 문서에 없더라도 supplier_address를 추론합니다.

  {
    "entities": [
      {
        "textAnchor": {
          "textSegments": [ ... ],
          "content": "Google Singapore"
        },
        "type": "supplier_name",
        "confidence": 0.39170802,
        "pageAnchor": { ... },
        "id": "12",
        "normalizedValue": {
          "text": "Google Asia Pacific, Singapore"
        }
      },
      {
        "type": "supplier_address",
        "id": "17",
        "normalizedValue": {
          "text": "70 Pasir Panjang Rd #03-71 Mapletree Business City II Singapore 117371",
          "addressValue": {
            "regionCode": "SG",
            "languageCode": "en-US",
            "postalCode": "117371",
            "addressLines": [
              "70 Pasir Panjang Rd",
              "#03-71 Mapletree Business City II"
            ]
          }
        }
      }
    ]
  }

중첩된 필드: 중첩된 스키마 (필드)는 먼저 항목을 상위 요소로 선언한 다음 상위 요소 아래에 하위 항목을 만들어 만들 수 있습니다. 상위 요소의 파싱 응답에는 상위 필드의 properties 요소에 하위 필드가 포함됩니다. 다음 예에서 line_item는 하위 필드인 line_item/description 및 line_item/quantity를 두 개 갖는 상위 필드입니다.

{
  "entities": [
    {
      "textAnchor": { ... },
      "type": "line_item",
      "confidence": 1.0,
      "pageAnchor": { ... },
      "id": "19",
      "properties": [
        {
          "textAnchor": {
            "textSegments": [ ... ],
            "content": "Tool A"
          },
          "type": "line_item/description",
          "confidence": 0.3461604,
          "pageAnchor": { ... },
          "id": "20"
        },
        {
          "textAnchor": {
            "textSegments": [ ... ],
            "content": "500"
          },
          "type": "line_item/quantity",
          "confidence": 0.8077843,
          "pageAnchor": { ... },
          "id": "21",
          "normalizedValue": {
            "text": "500"
          }
        }
      ]
    }
  ]
}

다음 파서는 이를 따릅니다.

추출 (맞춤 추출기)
기존
- 은행 명세서 파서
- 지출 파서
- 인보이스 파서
- PaySlip 파서
- W2 파서

코드 샘플

다음 코드 샘플은 처리 요청을 전송한 다음 특수 프로세서에서 필드를 읽고 터미널에 출력하는 방법을 보여줍니다.

Java

자세한 내용은 Document AI Java API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


import com.google.cloud.documentai.v1beta3.Document;
import com.google.cloud.documentai.v1beta3.DocumentProcessorServiceClient;
import com.google.cloud.documentai.v1beta3.DocumentProcessorServiceSettings;
import com.google.cloud.documentai.v1beta3.ProcessRequest;
import com.google.cloud.documentai.v1beta3.ProcessResponse;
import com.google.cloud.documentai.v1beta3.RawDocument;
import com.google.protobuf.ByteString;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeoutException;

public class ProcessSpecializedDocument {
  public static void processSpecializedDocument()
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String location = "your-project-location"; // Format is "us" or "eu".
    String processerId = "your-processor-id";
    String filePath = "path/to/input/file.pdf";
    processSpecializedDocument(projectId, location, processerId, filePath);
  }

  public static void processSpecializedDocument(
      String projectId, String location, String processorId, String filePath)
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // Initialize client that will be used to send requests. This client only needs
    // to be created
    // once, and can be reused for multiple requests. After completing all of your
    // requests, call
    // the "close" method on the client to safely clean up any remaining background
    // resources.
    String endpoint = String.format("%s-documentai.googleapis.com:443", location);
    DocumentProcessorServiceSettings settings =
        DocumentProcessorServiceSettings.newBuilder().setEndpoint(endpoint).build();
    try (DocumentProcessorServiceClient client = DocumentProcessorServiceClient.create(settings)) {
      // The full resource name of the processor, e.g.:
      // projects/project-id/locations/location/processor/processor-id
      // You must create new processors in the Cloud Console first
      String name =
          String.format("projects/%s/locations/%s/processors/%s", projectId, location, processorId);

      // Read the file.
      byte[] imageFileData = Files.readAllBytes(Paths.get(filePath));

      // Convert the image data to a Buffer and base64 encode it.
      ByteString content = ByteString.copyFrom(imageFileData);

      RawDocument document =
          RawDocument.newBuilder().setContent(content).setMimeType("application/pdf").build();

      // Configure the process request.
      ProcessRequest request =
          ProcessRequest.newBuilder().setName(name).setRawDocument(document).build();

      // Recognizes text entities in the PDF document
      ProcessResponse result = client.processDocument(request);
      Document documentResponse = result.getDocument();

      System.out.println("Document processing complete.");

      // Read fields specificly from the specalized US drivers license processor:
      // https://cloud.google.com/document-ai/docs/processors-list#processor_us-driver-license-parser
      // retriving data from other specalized processors follow a similar pattern.
      // For a complete list of processors see:
      // https://cloud.google.com/document-ai/docs/processors-list
      //
      // OCR and other data is also present in the quality processor's response.
      // Please see the OCR and other samples for how to parse other data in the
      // response.
      for (Document.Entity entity : documentResponse.getEntitiesList()) {
        // Fields detected. For a full list of fields for each processor see
        // the processor documentation:
        // https://cloud.google.com/document-ai/docs/processors-list
        String entityType = entity.getType();
        // some other value formats in addition to text are availible
        // e.g. dates: `entity.getNormalizedValue().getDateValue().getYear()`
        // check for normilized value with `entity.hasNormalizedValue()`
        String entityTextValue = escapeNewlines(entity.getTextAnchor().getContent());
        float entityConfidence = entity.getConfidence();
        System.out.printf(
            "    * %s: %s (%.2f%% confident)\n",
            entityType, entityTextValue, entityConfidence * 100.0);
      }
    }
  }

  private static String escapeNewlines(String s) {
    return s.replace("\n", "\\n").replace("\r", "\\r");
  }
}

Node.js

자세한 내용은 Document AI Node.js API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION'; // Format is 'us' or 'eu'
// const processorId = 'YOUR_PROCESSOR_ID'; // Create processor in Cloud Console
// const filePath = '/path/to/local/pdf';

const {DocumentProcessorServiceClient} =
  require('@google-cloud/documentai').v1beta3;

// Instantiates a client
const client = new DocumentProcessorServiceClient();

async function processDocument() {
  // The full resource name of the processor, e.g.:
  // projects/project-id/locations/location/processor/processor-id
  // You must create new processors in the Cloud Console first
  const name = `projects/${projectId}/locations/${location}/processors/${processorId}`;

  // Read the file into memory.
  const fs = require('fs').promises;
  const imageFile = await fs.readFile(filePath);

  // Convert the image data to a Buffer and base64 encode it.
  const encodedImage = Buffer.from(imageFile).toString('base64');

  const request = {
    name,
    rawDocument: {
      content: encodedImage,
      mimeType: 'application/pdf',
    },
  };

  // Recognizes text entities in the PDF document
  const [result] = await client.processDocument(request);

  console.log('Document processing complete.');

  // Read fields specificly from the specalized US drivers license processor:
  // https://cloud.google.com/document-ai/docs/processors-list#processor_us-driver-license-parser
  // retriving data from other specalized processors follow a similar pattern.
  // For a complete list of processors see:
  // https://cloud.google.com/document-ai/docs/processors-list
  //
  // OCR and other data is also present in the quality processor's response.
  // Please see the OCR and other samples for how to parse other data in the
  // response.
  const {document} = result;
  for (const entity of document.entities) {
    // Fields detected. For a full list of fields for each processor see
    // the processor documentation:
    // https://cloud.google.com/document-ai/docs/processors-list
    const key = entity.type;
    // some other value formats in addition to text are availible
    // e.g. dates: `entity.normalizedValue.dateValue.year`
    const textValue =
      entity.textAnchor !== null ? entity.textAnchor.content : '';
    const conf = entity.confidence * 100;
    console.log(
      `* ${JSON.stringify(key)}: ${JSON.stringify(textValue)}(${conf.toFixed(
        2
      )}% confident)`
    );
  }
}

Python

자세한 내용은 Document AI Python API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


from typing import Optional, Sequence

from google.api_core.client_options import ClientOptions
from google.cloud import documentai

# TODO(developer): Uncomment these variables before running the sample.
# project_id = "YOUR_PROJECT_ID"
# location = "YOUR_PROCESSOR_LOCATION" # Format is "us" or "eu"
# processor_id = "YOUR_PROCESSOR_ID" # Create processor before running sample
# processor_version = "rc" # Refer to https://cloud.google.com/document-ai/docs/manage-processor-versions for more information
# file_path = "/path/to/local/pdf"
# mime_type = "application/pdf" # Refer to https://cloud.google.com/document-ai/docs/file-types for supported file types


def process_document_entity_extraction_sample(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
) -> None:
    # Online processing request to Document AI
    document = process_document(
        project_id, location, processor_id, processor_version, file_path, mime_type
    )

    # Print extracted entities from entity extraction processor output.
    # For a complete list of processors see:
    # https://cloud.google.com/document-ai/docs/processors-list
    #
    # OCR and other data is also present in the processor's response.
    # Refer to the OCR samples for how to parse other data in the response.

    print(f"Found {len(document.entities)} entities:")
    for entity in document.entities:
        print_entity(entity)
        # Print Nested Entities (if any)
        for prop in entity.properties:
            print_entity(prop)




def print_entity(entity: documentai.Document.Entity) -> None:
    # Fields detected. For a full list of fields for each processor see
    # the processor documentation:
    # https://cloud.google.com/document-ai/docs/processors-list
    key = entity.type_

    # Some other value formats in addition to text are available
    # e.g. dates: `entity.normalized_value.date_value.year`
    text_value = entity.text_anchor.content or entity.mention_text
    confidence = entity.confidence
    normalized_value = entity.normalized_value.text
    print(f"    * {repr(key)}: {repr(text_value)} ({confidence:.1%} confident)")

    if normalized_value:
        print(f"    * Normalized Value: {repr(normalized_value)}")




def process_document(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
    process_options: Optional[documentai.ProcessOptions] = None,
) -> documentai.Document:
    # You must set the `api_endpoint` if you use a location other than "us".
    client = documentai.DocumentProcessorServiceClient(
        client_options=ClientOptions(
            api_endpoint=f"{location}-documentai.googleapis.com"
        )
    )

    # The full resource name of the processor version, e.g.:
    # `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`
    # You must create a processor before running this sample.
    name = client.processor_version_path(
        project_id, location, processor_id, processor_version
    )

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Configure the process request
    request = documentai.ProcessRequest(
        name=name,
        raw_document=documentai.RawDocument(content=image_content, mime_type=mime_type),
        # Only supported for Document OCR processor
        process_options=process_options,
    )

    result = client.process_document(request=request)

    # For a full list of `Document` object attributes, reference this page:
    # https://cloud.google.com/document-ai/docs/reference/rest/v1/Document
    return result.document

커스텀 문서 추출기

커스텀 문서 추출기 프로세서는 선행 학습 프로세서를 사용할 수 없는 문서에서 커스텀 항목을 추출할 수 있습니다. 이는 커스텀 모델을 학습하거나 생성형 AI 기반 모델을 사용하여 학습 없이 지명된 항목을 추출하는 방식으로 실행할 수 있습니다. 자세한 내용은 콘솔에서 맞춤 문서 추출기 만들기를 참고하세요.

맞춤 모델을 학습하는 경우 프로세서를 선행 학습된 항목 추출 프로세서와 정확히 동일한 방식으로 사용할 수 있습니다.
기반 모델을 사용하는 경우 프로세서 버전을 만들어 모든 요청에 대해 특정 항목을 추출하거나 요청별로 구성할 수 있습니다.

출력 구조에 관한 자세한 내용은 개체, 중첩된 개체, 정규화된 값을 참고하세요.

코드 샘플

맞춤 모델을 사용 중이거나 기반 모델을 사용하여 프로세서 버전을 만든 경우 항목 추출 코드 샘플을 사용하세요.

다음 코드 샘플은 요청별로 기반 모델 맞춤 문서 추출기에 특정 항목을 구성하고 추출된 항목을 출력하는 방법을 보여줍니다.

Python

자세한 내용은 Document AI Python API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


from typing import Optional, Sequence

from google.api_core.client_options import ClientOptions
from google.cloud import documentai

# TODO(developer): Uncomment these variables before running the sample.
# project_id = "YOUR_PROJECT_ID"
# location = "YOUR_PROCESSOR_LOCATION" # Format is "us" or "eu"
# processor_id = "YOUR_PROCESSOR_ID" # Create processor before running sample
# processor_version = "rc" # Refer to https://cloud.google.com/document-ai/docs/manage-processor-versions for more information
# file_path = "/path/to/local/pdf"
# mime_type = "application/pdf" # Refer to https://cloud.google.com/document-ai/docs/file-types for supported file types




def process_document_custom_extractor_sample(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
) -> None:
    # Entities to extract from Foundation Model CDE
    properties = [
        documentai.DocumentSchema.EntityType.Property(
            name="invoice_id",
            value_type="string",
            occurrence_type=documentai.DocumentSchema.EntityType.Property.OccurrenceType.REQUIRED_ONCE,
        ),
        documentai.DocumentSchema.EntityType.Property(
            name="notes",
            value_type="string",
            occurrence_type=documentai.DocumentSchema.EntityType.Property.OccurrenceType.OPTIONAL_MULTIPLE,
        ),
        documentai.DocumentSchema.EntityType.Property(
            name="terms",
            value_type="string",
            occurrence_type=documentai.DocumentSchema.EntityType.Property.OccurrenceType.OPTIONAL_MULTIPLE,
        ),
    ]
    # Optional: For Generative AI processors, request different fields than the
    # schema for a processor version
    process_options = documentai.ProcessOptions(
        schema_override=documentai.DocumentSchema(
            display_name="CDE Schema",
            description="Document Schema for the CDE Processor",
            entity_types=[
                documentai.DocumentSchema.EntityType(
                    name="custom_extraction_document_type",
                    base_types=["document"],
                    properties=properties,
                )
            ],
        )
    )

    # Online processing request to Document AI
    document = process_document(
        project_id,
        location,
        processor_id,
        processor_version,
        file_path,
        mime_type,
        process_options=process_options,
    )

    for entity in document.entities:
        print_entity(entity)
        # Print Nested Entities (if any)
        for prop in entity.properties:
            print_entity(prop)




def print_entity(entity: documentai.Document.Entity) -> None:
    # Fields detected. For a full list of fields for each processor see
    # the processor documentation:
    # https://cloud.google.com/document-ai/docs/processors-list
    key = entity.type_

    # Some other value formats in addition to text are available
    # e.g. dates: `entity.normalized_value.date_value.year`
    text_value = entity.text_anchor.content or entity.mention_text
    confidence = entity.confidence
    normalized_value = entity.normalized_value.text
    print(f"    * {repr(key)}: {repr(text_value)} ({confidence:.1%} confident)")

    if normalized_value:
        print(f"    * Normalized Value: {repr(normalized_value)}")




def process_document(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
    process_options: Optional[documentai.ProcessOptions] = None,
) -> documentai.Document:
    # You must set the `api_endpoint` if you use a location other than "us".
    client = documentai.DocumentProcessorServiceClient(
        client_options=ClientOptions(
            api_endpoint=f"{location}-documentai.googleapis.com"
        )
    )

    # The full resource name of the processor version, e.g.:
    # `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`
    # You must create a processor before running this sample.
    name = client.processor_version_path(
        project_id, location, processor_id, processor_version
    )

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Configure the process request
    request = documentai.ProcessRequest(
        name=name,
        raw_document=documentai.RawDocument(content=image_content, mime_type=mime_type),
        # Only supported for Document OCR processor
        process_options=process_options,
    )

    result = client.process_document(request=request)

    # For a full list of `Document` object attributes, reference this page:
    # https://cloud.google.com/document-ai/docs/reference/rest/v1/Document
    return result.document

요약

요약기 프로세서는 생성형 AI 기반 모델을 사용하여 문서에서 추출된 텍스트를 요약합니다. 응답의 길이와 형식은 다음과 같이 맞춤설정할 수 있습니다.

길이
- BRIEF: 한두 문장으로 간단히 요약
- MODERATE: 단락 길이의 요약
- COMPREHENSIVE: 사용 가능한 가장 긴 옵션
형식
- PARAGRAPH: 단락으로 출력
- BULLETS: 글머리 기호로 출력

특정 길이와 형식에 맞게 프로세서 버전을 만들거나 요청별로 구성할 수 있습니다.

요약된 텍스트가 Document.entities.normalizedValue.text에 표시됩니다. 전체 샘플 출력 JSON 파일은 샘플 프로세서 출력에서 확인할 수 있습니다.

자세한 내용은 콘솔에서 문서 요약 도구 빌드를 참고하세요.

코드 샘플

다음 코드 샘플은 처리 요청에서 특정 길이와 형식을 구성하고 요약된 텍스트를 출력하는 방법을 보여줍니다.

Python

자세한 내용은 Document AI Python API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

from typing import Optional

from google.api_core.client_options import ClientOptions
from google.cloud import documentai_v1beta3 as documentai


# TODO(developer): Uncomment these variables before running the sample.
# project_id = "YOUR_PROJECT_ID"
# location = "YOUR_PROCESSOR_LOCATION" # Format is "us" or "eu"
# processor_id = "YOUR_PROCESSOR_ID" # Create processor before running sample
# processor_version = "rc" # Refer to https://cloud.google.com/document-ai/docs/manage-processor-versions for more information
# file_path = "/path/to/local/pdf"
# mime_type = "application/pdf" # Refer to https://cloud.google.com/document-ai/docs/file-types for supported file types

def process_document_summarizer_sample(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
) -> None:
    # For supported options, refer to:
    # https://cloud.google.com/document-ai/docs/reference/rest/v1beta3/projects.locations.processors.processorVersions#summaryoptions
    summary_options = documentai.SummaryOptions(
        length=documentai.SummaryOptions.Length.BRIEF,
        format=documentai.SummaryOptions.Format.BULLETS,
    )

    properties = [
        documentai.DocumentSchema.EntityType.Property(
            name="summary",
            value_type="string",
            occurrence_type=documentai.DocumentSchema.EntityType.Property.OccurrenceType.REQUIRED_ONCE,
            property_metadata=documentai.PropertyMetadata(
                field_extraction_metadata=documentai.FieldExtractionMetadata(
                    summary_options=summary_options
                )
            ),
        )
    ]

    # Optional: Request specific summarization format other than the default
    # for the processor version.
    process_options = documentai.ProcessOptions(
        schema_override=documentai.DocumentSchema(
            entity_types=[
                documentai.DocumentSchema.EntityType(
                    name="summary_document_type",
                    base_types=["document"],
                    properties=properties,
                )
            ]
        )
    )

    # Online processing request to Document AI
    document = process_document(
        project_id,
        location,
        processor_id,
        processor_version,
        file_path,
        mime_type,
        process_options=process_options,
    )

    for entity in document.entities:
        print_entity(entity)
        # Print Nested Entities (if any)
        for prop in entity.properties:
            print_entity(prop)


def print_entity(entity: documentai.Document.Entity) -> None:
    # Fields detected. For a full list of fields for each processor see
    # the processor documentation:
    # https://cloud.google.com/document-ai/docs/processors-list
    key = entity.type_

    # Some other value formats in addition to text are availible
    # e.g. dates: `entity.normalized_value.date_value.year`
    text_value = entity.text_anchor.content
    confidence = entity.confidence
    normalized_value = entity.normalized_value.text
    print(f"    * {repr(key)}: {repr(text_value)}({confidence:.1%} confident)")

    if normalized_value:
        print(f"    * Normalized Value: {repr(normalized_value)}")


def process_document(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
    process_options: Optional[documentai.ProcessOptions] = None,
) -> documentai.Document:
    # You must set the `api_endpoint` if you use a location other than "us".
    client = documentai.DocumentProcessorServiceClient(
        client_options=ClientOptions(
            api_endpoint=f"{location}-documentai.googleapis.com"
        )
    )

    # The full resource name of the processor version, e.g.:
    # `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`
    # You must create a processor before running this sample.
    name = client.processor_version_path(
        project_id, location, processor_id, processor_version
    )

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Configure the process request
    request = documentai.ProcessRequest(
        name=name,
        raw_document=documentai.RawDocument(content=image_content, mime_type=mime_type),
        # Only supported for Document OCR processor
        process_options=process_options,
    )

    result = client.process_document(request=request)

    # For a full list of `Document` object attributes, reference this page:
    # https://cloud.google.com/document-ai/docs/reference/rest/v1/Document
    return result.document

분할 및 분류

다음은 다양한 유형의 문서와 양식이 포함된 10페이지의 합성 PDF입니다.

PDF 다운로드

다음은 대출 문서 분할기 및 분류 기준에서 반환한 전체 문서 객체입니다.

JSON 다운로드

스플리터에 의해 감지된 각 문서는 entity로 표시됩니다. 예를 들면 다음과 같습니다.

  {
    "entities": [
      {
        "textAnchor": {
          "textSegments": [
            {
              "startIndex": "13936",
              "endIndex": "21108"
            }
          ]
        },
        "type": "1040se_2020",
        "confidence": 0.76257163,
        "pageAnchor": {
          "pageRefs": [
            {
              "page": "6"
            },
            {
              "page": "7"
            }
          ]
        }
      }
    ]
  }

Entity.pageAnchor는 이 문서가 2페이지임을 나타냅니다. pageRefs[].page는 0부터 시작하며 document.pages[] 필드의 색인입니다.

주의: 0 pageRefs[].page 값은 생략됩니다. API에서 페이지 값 "0"를 감지하면 해당 좌표는 JSON 응답에서 생략됩니다. 예를 들어 문서의 처음 두 페이지에 대한 응답은 "pageRefs": [{}, {"page": "1"}]입니다.
Entity.type는 이 문서가 1040 Schedule SE 양식임을 지정합니다. 식별할 수 있는 문서 유형의 전체 목록은 프로세서 문서의 식별된 문서 유형을 참고하세요.

자세한 내용은 문서 분할기 동작을 참고하세요.

코드 샘플

분할기는 페이지 경계를 식별하지만 실제로 입력 문서를 분할하지는 않습니다. Document AI 도구 상자를 사용하여 페이지 경계를 사용하여 PDF 파일을 물리적으로 분할할 수 있습니다. 다음 코드 샘플은 PDF를 분할하지 않고 페이지 범위를 인쇄합니다.

Java

자세한 내용은 Document AI Java API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


import com.google.cloud.documentai.v1beta3.Document;
import com.google.cloud.documentai.v1beta3.DocumentProcessorServiceClient;
import com.google.cloud.documentai.v1beta3.DocumentProcessorServiceSettings;
import com.google.cloud.documentai.v1beta3.ProcessRequest;
import com.google.cloud.documentai.v1beta3.ProcessResponse;
import com.google.cloud.documentai.v1beta3.RawDocument;
import com.google.protobuf.ByteString;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.TimeoutException;

public class ProcessSplitterDocument {
  public static void processSplitterDocument()
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String location = "your-project-location"; // Format is "us" or "eu".
    String processerId = "your-processor-id";
    String filePath = "path/to/input/file.pdf";
    processSplitterDocument(projectId, location, processerId, filePath);
  }

  public static void processSplitterDocument(
      String projectId, String location, String processorId, String filePath)
      throws IOException, InterruptedException, ExecutionException, TimeoutException {
    // Initialize client that will be used to send requests. This client only needs
    // to be created
    // once, and can be reused for multiple requests. After completing all of your
    // requests, call
    // the "close" method on the client to safely clean up any remaining background
    // resources.
    String endpoint = String.format("%s-documentai.googleapis.com:443", location);
    DocumentProcessorServiceSettings settings =
        DocumentProcessorServiceSettings.newBuilder().setEndpoint(endpoint).build();
    try (DocumentProcessorServiceClient client = DocumentProcessorServiceClient.create(settings)) {
      // The full resource name of the processor, e.g.:
      // projects/project-id/locations/location/processor/processor-id
      // You must create new processors in the Cloud Console first
      String name =
          String.format("projects/%s/locations/%s/processors/%s", projectId, location, processorId);

      // Read the file.
      byte[] imageFileData = Files.readAllBytes(Paths.get(filePath));

      // Convert the image data to a Buffer and base64 encode it.
      ByteString content = ByteString.copyFrom(imageFileData);

      RawDocument document =
          RawDocument.newBuilder().setContent(content).setMimeType("application/pdf").build();

      // Configure the process request.
      ProcessRequest request =
          ProcessRequest.newBuilder().setName(name).setRawDocument(document).build();

      // Recognizes text entities in the PDF document
      ProcessResponse result = client.processDocument(request);
      Document documentResponse = result.getDocument();

      System.out.println("Document processing complete.");

      // Read the splitter output from the document splitter processor:
      // https://cloud.google.com/document-ai/docs/processors-list#processor_doc-splitter
      // This processor only provides text for the document and information on how
      // to split the document on logical boundaries. To identify and extract text,
      // form elements, and entities please see other processors like the OCR, form,
      // and specalized processors.
      List<Document.Entity> entities = documentResponse.getEntitiesList();
      System.out.printf("Found %d subdocuments:\n", entities.size());
      for (Document.Entity entity : entities) {
        float entityConfidence = entity.getConfidence();
        String pagesRangeText = pageRefsToString(entity.getPageAnchor().getPageRefsList());
        String subdocumentType = entity.getType();
        if (subdocumentType.isEmpty()) {
          System.out.printf(
              "%.2f%% confident that %s a subdocument.\n", entityConfidence * 100, pagesRangeText);
        } else {
          System.out.printf(
              "%.2f%% confident that %s a '%s' subdocument.\n",
              entityConfidence * 100, pagesRangeText, subdocumentType);
        }
      }
    }
  }

  // Converts page reference(s) to a string describing the page or page range.
  private static String pageRefsToString(List<Document.PageAnchor.PageRef> pageRefs) {
    if (pageRefs.size() == 1) {
      return String.format("page %d is", pageRefs.get(0).getPage() + 1);
    } else {
      long start = pageRefs.get(0).getPage() + 1;
      long end = pageRefs.get(1).getPage() + 1;
      return String.format("pages %d to %d are", start, end);
    }
  }
}

Node.js

자세한 내용은 Document AI Node.js API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.

/**
 * TODO(developer): Uncomment these variables before running the sample.
 */
// const projectId = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION'; // Format is 'us' or 'eu'
// const processorId = 'YOUR_PROCESSOR_ID'; // Create processor in Cloud Console
// const filePath = '/path/to/local/pdf';

const {DocumentProcessorServiceClient} =
  require('@google-cloud/documentai').v1beta3;

// Instantiates a client
const client = new DocumentProcessorServiceClient();

async function processDocument() {
  // The full resource name of the processor, e.g.:
  // projects/project-id/locations/location/processor/processor-id
  // You must create new processors in the Cloud Console first
  const name = `projects/${projectId}/locations/${location}/processors/${processorId}`;

  // Read the file into memory.
  const fs = require('fs').promises;
  const imageFile = await fs.readFile(filePath);

  // Convert the image data to a Buffer and base64 encode it.
  const encodedImage = Buffer.from(imageFile).toString('base64');

  const request = {
    name,
    rawDocument: {
      content: encodedImage,
      mimeType: 'application/pdf',
    },
  };

  // Recognizes text entities in the PDF document
  const [result] = await client.processDocument(request);

  console.log('Document processing complete.');

  // Read fields specificly from the specalized US drivers license processor:
  // https://cloud.google.com/document-ai/docs/processors-list#processor_us-driver-license-parser
  // retriving data from other specalized processors follow a similar pattern.
  // For a complete list of processors see:
  // https://cloud.google.com/document-ai/docs/processors-list
  //
  // OCR and other data is also present in the quality processor's response.
  // Please see the OCR and other samples for how to parse other data in the
  // response.
  const {document} = result;
  console.log(`Found ${document.entities.length} subdocuments:`);
  for (const entity of document.entities) {
    const conf = entity.confidence * 100;
    const pagesRange = pageRefsToRange(entity.pageAnchor.pageRefs);
    if (entity.type !== '') {
      console.log(
        `${conf.toFixed(2)}% confident that ${pagesRange} a "${
          entity.type
        }" subdocument.`
      );
    } else {
      console.log(
        `${conf.toFixed(2)}% confident that ${pagesRange} a subdocument.`
      );
    }
  }
}

// Converts a page ref to a string describing the page or page range.
const pageRefsToRange = pageRefs => {
  if (pageRefs.length === 1) {
    const num = parseInt(pageRefs[0].page) + 1 || 1;
    return `page ${num} is`;
  } else {
    const start = parseInt(pageRefs[0].page) + 1 || 1;
    const end = parseInt(pageRefs[1].page) + 1;
    return `pages ${start} to ${end} are`;
  }
};

Python

자세한 내용은 Document AI Python API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


from typing import Optional, Sequence

from google.api_core.client_options import ClientOptions
from google.cloud import documentai

# TODO(developer): Uncomment these variables before running the sample.
# project_id = "YOUR_PROJECT_ID"
# location = "YOUR_PROCESSOR_LOCATION" # Format is "us" or "eu"
# processor_id = "YOUR_PROCESSOR_ID" # Create processor before running sample
# processor_version = "rc" # Refer to https://cloud.google.com/document-ai/docs/manage-processor-versions for more information
# file_path = "/path/to/local/pdf"
# mime_type = "application/pdf" # Refer to https://cloud.google.com/document-ai/docs/file-types for supported file types


def process_document_splitter_sample(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
) -> None:
    # Online processing request to Document AI
    document = process_document(
        project_id, location, processor_id, processor_version, file_path, mime_type
    )

    # Read the splitter output from a document splitter/classifier processor:
    # e.g. https://cloud.google.com/document-ai/docs/processors-list#processor_procurement-document-splitter
    # This processor only provides text for the document and information on how
    # to split the document on logical boundaries. To identify and extract text,
    # form elements, and entities please see other processors like the OCR, form,
    # and specalized processors.

    print(f"Found {len(document.entities)} subdocuments:")
    for entity in document.entities:
        conf_percent = f"{entity.confidence:.1%}"
        pages_range = page_refs_to_string(entity.page_anchor.page_refs)

        # Print subdocument type information, if available
        if entity.type_:
            print(
                f"{conf_percent} confident that {pages_range} a '{entity.type_}' subdocument."
            )
        else:
            print(f"{conf_percent} confident that {pages_range} a subdocument.")


def page_refs_to_string(
    page_refs: Sequence[documentai.Document.PageAnchor.PageRef],
) -> str:
    """Converts a page ref to a string describing the page or page range."""
    pages = [str(int(page_ref.page) + 1) for page_ref in page_refs]
    if len(pages) == 1:
        return f"page {pages[0]} is"
    else:
        return f"pages {', '.join(pages)} are"




def process_document(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version: str,
    file_path: str,
    mime_type: str,
    process_options: Optional[documentai.ProcessOptions] = None,
) -> documentai.Document:
    # You must set the `api_endpoint` if you use a location other than "us".
    client = documentai.DocumentProcessorServiceClient(
        client_options=ClientOptions(
            api_endpoint=f"{location}-documentai.googleapis.com"
        )
    )

    # The full resource name of the processor version, e.g.:
    # `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`
    # You must create a processor before running this sample.
    name = client.processor_version_path(
        project_id, location, processor_id, processor_version
    )

    # Read the file into memory
    with open(file_path, "rb") as image:
        image_content = image.read()

    # Configure the process request
    request = documentai.ProcessRequest(
        name=name,
        raw_document=documentai.RawDocument(content=image_content, mime_type=mime_type),
        # Only supported for Document OCR processor
        process_options=process_options,
    )

    result = client.process_document(request=request)

    # For a full list of `Document` object attributes, reference this page:
    # https://cloud.google.com/document-ai/docs/reference/rest/v1/Document
    return result.document

다음 코드 샘플에서는 Document AI Toolbox를 사용하여 처리된 Document의 페이지 경계를 사용하여 PDF 파일을 분할합니다.

Python

자세한 내용은 Document AI Python API 참조 문서를 참고하세요.

Document AI에 인증하려면 애플리케이션 기본 사용자 인증 정보를 설정합니다. 자세한 내용은 로컬 개발 환경의 인증 설정을 참조하세요.


from google.cloud.documentai_toolbox import document

# TODO(developer): Uncomment these variables before running the sample.
# Given a local document.proto or sharded document.proto from a splitter/classifier in path
# document_path = "path/to/local/document.json"
# pdf_path = "path/to/local/document.pdf"
# output_path = "resources/output/"


def split_pdf_sample(document_path: str, pdf_path: str, output_path: str) -> None:
    wrapped_document = document.Document.from_document_path(document_path=document_path)

    output_files = wrapped_document.split_pdf(
        pdf_path=pdf_path, output_path=output_path
    )

    print("Document Successfully Split")
    for output_file in output_files:
        print(output_file)

Document AI 도구 상자

Document AI Toolbox는 문서 응답에서 정보를 관리, 조작, 추출하기 위한 유틸리티 함수를 제공하는 Python용 SDK입니다. Cloud Storage의 JSON 파일, 로컬 JSON 파일 또는 process_document() 메서드에서 직접 출력된 처리된 문서 응답에서 '래핑된' 문서 객체를 만듭니다.

다음 작업을 실행할 수 있습니다.

일괄 처리의 단편화된 Document JSON 파일을 단일 '래핑된' 문서로 결합합니다.

샤드를 통합된 Document로 내보냅니다.

다음에서 Document 출력을 가져옵니다.
Layout 정보를 처리하지 않고 Pages, Lines, Paragraphs, FormFields, Tables의 텍스트에 액세스합니다.
타겟 문자열이 포함되어 있거나 정규 표현식과 일치하는 Pages를 검색합니다.
이름으로 FormFields를 검색합니다.
유형별로 Entities를 검색합니다.
Tables를 Pandas Dataframe 또는 CSV로 변환합니다.
Entities 및 FormFields를 BigQuery 테이블에 삽입합니다.
분할기/분류기 프로세서의 출력을 기반으로 PDF 파일을 분할합니다.
Document 경계 상자에서 이미지 Entities를 추출합니다.
Documents를 일반적으로 사용되는 형식으로 변환합니다.
- Cloud Vision API AnnotateFileResponse
- hOCR
- 서드 파티 문서 처리 형식
Cloud Storage 폴더에서 처리할 문서 일괄을 만듭니다.

코드 샘플

다음 코드 샘플은 Document AI Toolbox를 사용하는 방법을 보여줍니다.

빠른 시작

from typing import Optional

from google.cloud import documentai
from google.cloud.documentai_toolbox import document, gcs_utilities

# TODO(developer): Uncomment these variables before running the sample.
# Given a Document JSON or sharded Document JSON in path gs://bucket/path/to/folder
# gcs_bucket_name = "bucket"
# gcs_prefix = "path/to/folder"

# Or, given a Document JSON in path gs://bucket/path/to/folder/document.json
# gcs_uri = "gs://bucket/path/to/folder/document.json"

# Or, given a Document JSON in path local/path/to/folder/document.json
# document_path = "local/path/to/folder/document.json"

# Or, given a Document object from Document AI
# documentai_document = documentai.Document()

# Or, given a BatchProcessMetadata object from Document AI
# operation = client.batch_process_documents(request)
# operation.result(timeout=timeout)
# batch_process_metadata = documentai.BatchProcessMetadata(operation.metadata)

# Or, given a BatchProcessOperation name from Document AI
# batch_process_operation = "projects/project_id/locations/location/operations/operation_id"


def quickstart_sample(
    gcs_bucket_name: Optional[str] = None,
    gcs_prefix: Optional[str] = None,
    gcs_uri: Optional[str] = None,
    document_path: Optional[str] = None,
    documentai_document: Optional[documentai.Document] = None,
    batch_process_metadata: Optional[documentai.BatchProcessMetadata] = None,
    batch_process_operation: Optional[str] = None,
) -> document.Document:
    if gcs_bucket_name and gcs_prefix:
        # Load from Google Cloud Storage Directory
        print("Document structure in Cloud Storage")
        gcs_utilities.print_gcs_document_tree(
            gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix
        )

        wrapped_document = document.Document.from_gcs(
            gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix
        )
    elif gcs_uri:
        # Load a single Document from a Google Cloud Storage URI
        wrapped_document = document.Document.from_gcs_uri(gcs_uri=gcs_uri)
    elif document_path:
        # Load from local `Document` JSON file
        wrapped_document = document.Document.from_document_path(document_path)
    elif documentai_document:
        # Load from `documentai.Document` object
        wrapped_document = document.Document.from_documentai_document(
            documentai_document
        )
    elif batch_process_metadata:
        # Load Documents from `BatchProcessMetadata` object
        wrapped_documents = document.Document.from_batch_process_metadata(
            metadata=batch_process_metadata
        )
        wrapped_document = wrapped_documents[0]
    elif batch_process_operation:
        wrapped_documents = document.Document.from_batch_process_operation(
            location="us", operation_name=batch_process_operation
        )
        wrapped_document = wrapped_documents[0]
    else:
        raise ValueError("No document source provided.")

    # For all properties and methods, refer to:
    # https://cloud.google.com/python/docs/reference/documentai-toolbox/latest/google.cloud.documentai_toolbox.wrappers.document.Document

    print("Document Successfully Loaded!")
    print(f"\t Number of Pages: {len(wrapped_document.pages)}")
    print(f"\t Number of Entities: {len(wrapped_document.entities)}")

    for page in wrapped_document.pages:
        print(f"Page {page.page_number}")
        for block in page.blocks:
            print(block.text)
        for paragraph in page.paragraphs:
            print(paragraph.text)
        for line in page.lines:
            print(line.text)
        for token in page.tokens:
            print(token.text)

        # Only supported with Form Parser processor
        # https://cloud.google.com/document-ai/docs/form-parser
        for form_field in page.form_fields:
            print(f"{form_field.field_name} : {form_field.field_value}")

        # Only supported with Enterprise Document OCR version `pretrained-ocr-v2.0-2023-06-02`
        # https://cloud.google.com/document-ai/docs/process-documents-ocr#enable_symbols
        for symbol in page.symbols:
            print(symbol.text)

        # Only supported with Enterprise Document OCR version `pretrained-ocr-v2.0-2023-06-02`
        # https://cloud.google.com/document-ai/docs/process-documents-ocr#math_ocr
        for math_formula in page.math_formulas:
            print(math_formula.text)

    # Only supported with Entity Extraction processors
    # https://cloud.google.com/document-ai/docs/processors-list
    for entity in wrapped_document.entities:
        print(f"{entity.type_} : {entity.mention_text}")
        if entity.normalized_text:
            print(f"\tNormalized Text: {entity.normalized_text}")

    # Only supported with Layout Parser
    for chunk in wrapped_document.chunks:
        print(f"Chunk {chunk.chunk_id}: {chunk.content}")

    for block in wrapped_document.document_layout_blocks:
        print(f"Document Layout Block {block.block_id}")

        if block.text_block:
            print(f"{block.text_block.type_}: {block.text_block.text}")
        if block.list_block:
            print(f"{block.list_block.type_}: {block.list_block.list_entries}")
        if block.table_block:
            print(block.table_block.header_rows, block.table_block.body_rows)

테이블


from google.cloud.documentai_toolbox import document

# TODO(developer): Uncomment these variables before running the sample.
# Given a local document.proto or sharded document.proto in path
# document_path = "path/to/local/document.json"
# output_file_prefix = "output/table"


def table_sample(document_path: str, output_file_prefix: str) -> None:
    wrapped_document = document.Document.from_document_path(document_path=document_path)

    print("Tables in Document")
    for page in wrapped_document.pages:
        for table_index, table in enumerate(page.tables):
            # Convert table to Pandas Dataframe
            # Refer to https://pandas.pydata.org/docs/reference/frame.html for all supported methods
            df = table.to_dataframe()
            print(df)

            output_filename = f"{output_file_prefix}-{page.page_number}-{table_index}"

            # Write Dataframe to CSV file
            df.to_csv(f"{output_filename}.csv", index=False)

            # Write Dataframe to HTML file
            df.to_html(f"{output_filename}.html", index=False)

            # Write Dataframe to Markdown file
            df.to_markdown(f"{output_filename}.md", index=False)

BigQuery 내보내기


from google.cloud.documentai_toolbox import document

# TODO(developer): Uncomment these variables before running the sample.
# Given a document.proto or sharded document.proto in path gs://bucket/path/to/folder
# gcs_bucket_name = "bucket"
# gcs_prefix = "path/to/folder"
# dataset_name = "test_dataset"
# table_name = "test_table"
# project_id = "YOUR_PROJECT_ID"


def entities_to_bigquery_sample(
    gcs_bucket_name: str,
    gcs_prefix: str,
    dataset_name: str,
    table_name: str,
    project_id: str,
) -> None:
    wrapped_document = document.Document.from_gcs(
        gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix
    )

    job = wrapped_document.entities_to_bigquery(
        dataset_name=dataset_name, table_name=table_name, project_id=project_id
    )

    # Also supported:
    # job = wrapped_document.form_fields_to_bigquery(
    #     dataset_name=dataset_name, table_name=table_name, project_id=project_id
    # )

    print("Document entities loaded into BigQuery")
    print(f"Job ID: {job.job_id}")
    print(f"Table: {job.destination.path}")

PDF 분할


from google.cloud.documentai_toolbox import document

# TODO(developer): Uncomment these variables before running the sample.
# Given a local document.proto or sharded document.proto from a splitter/classifier in path
# document_path = "path/to/local/document.json"
# pdf_path = "path/to/local/document.pdf"
# output_path = "resources/output/"


def split_pdf_sample(document_path: str, pdf_path: str, output_path: str) -> None:
    wrapped_document = document.Document.from_document_path(document_path=document_path)

    output_files = wrapped_document.split_pdf(
        pdf_path=pdf_path, output_path=output_path
    )

    print("Document Successfully Split")
    for output_file in output_files:
        print(output_file)

이미지 추출


from google.cloud.documentai_toolbox import document

# TODO(developer): Uncomment these variables before running the sample.
# Given a local document.proto or sharded document.proto from an identity processor in path
# document_path = "path/to/local/document.json"
# output_path = "resources/output/"
# output_file_prefix = "exported_photo"
# output_file_extension = "png"


def export_images_sample(
    document_path: str,
    output_path: str,
    output_file_prefix: str,
    output_file_extension: str,
) -> None:
    wrapped_document = document.Document.from_document_path(document_path=document_path)

    output_files = wrapped_document.export_images(
        output_path=output_path,
        output_file_prefix=output_file_prefix,
        output_file_extension=output_file_extension,
    )
    print("Images Successfully Exported")
    for output_file in output_files:
        print(output_file)

비전 전환


from google.cloud.documentai_toolbox import document

# TODO(developer): Uncomment these variables before running the sample.
# Given a document.proto or sharded document.proto in path gs://bucket/path/to/folder
# gcs_bucket_name = "bucket"
# gcs_prefix = "path/to/folder"


def convert_document_to_vision_sample(
    gcs_bucket_name: str,
    gcs_prefix: str,
) -> None:
    wrapped_document = document.Document.from_gcs(
        gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix
    )

    # Converting wrapped_document to vision AnnotateFileResponse
    annotate_file_response = (
        wrapped_document.convert_document_to_annotate_file_response()
    )

    print("Document converted to AnnotateFileResponse!")
    print(
        f"Number of Pages : {len(annotate_file_response.responses[0].full_text_annotation.pages)}"
    )

hOCR 변환


from google.cloud.documentai_toolbox import document

# TODO(developer): Uncomment these variables before running the sample.
# Given a document.proto or sharded document.proto in path gs://bucket/path/to/folder
# document_path = "path/to/local/document.json"
# document_title = "your-document-title"


def convert_document_to_hocr_sample(document_path: str, document_title: str) -> str:
    wrapped_document = document.Document.from_document_path(document_path=document_path)

    # Converting wrapped_document to hOCR format
    hocr_string = wrapped_document.export_hocr_str(title=document_title)

    print("Document converted to hOCR!")
    return hocr_string

서드 파티 전환


from google.cloud.documentai_toolbox import converter

# TODO(developer): Uncomment these variables before running the sample.
# This sample will convert external annotations to the Document.json format used by Document AI Workbench for training.
# To process this the external annotation must have these type of objects:
#       1) Type
#       2) Text
#       3) Bounding Box (bounding boxes must be 1 of the 3 optional types)
#
# This is the bare minimum requirement to convert the annotations but for better accuracy you will need to also have:
#       1) Document width & height
#
# Bounding Box Types:
#   Type 1:
#       bounding_box:[{"x":1,"y":2},{"x":2,"y":2},{"x":2,"y":3},{"x":1,"y":3}]
#   Type 2:
#       bounding_box:{ "Width": 1, "Height": 1, "Left": 1, "Top": 1}
#   Type 3:
#       bounding_box: [1,2,2,2,2,3,1,3]
#
#   Note: If these types are not sufficient you can propose a feature request or contribute the new type and conversion functionality.
#
# Given a folders in gcs_input_path with the following structure :
#
# gs://path/to/input/folder
#   ├──test_annotations.json
#   ├──test_config.json
#   └──test.pdf
#
# An example of the config is in sample-converter-configs/Azure/form-config.json
#
# location = "us",
# processor_id = "my_processor_id"
# gcs_input_path = "gs://path/to/input/folder"
# gcs_output_path = "gs://path/to/input/folder"


def convert_external_annotations_sample(
    location: str,
    processor_id: str,
    project_id: str,
    gcs_input_path: str,
    gcs_output_path: str,
) -> None:
    converter.convert_from_config(
        project_id=project_id,
        location=location,
        processor_id=processor_id,
        gcs_input_path=gcs_input_path,
        gcs_output_path=gcs_output_path,
    )

문서 일괄 처리


from google.cloud import documentai
from google.cloud.documentai_toolbox import gcs_utilities

# TODO(developer): Uncomment these variables before running the sample.
# Given unprocessed documents in path gs://bucket/path/to/folder
# gcs_bucket_name = "bucket"
# gcs_prefix = "path/to/folder"
# batch_size = 50


def create_batches_sample(
    gcs_bucket_name: str,
    gcs_prefix: str,
    batch_size: int = 50,
) -> None:
    # Creating batches of documents for processing
    batches = gcs_utilities.create_batches(
        gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix, batch_size=batch_size
    )

    print(f"{len(batches)} batch(es) created.")
    for batch in batches:
        print(f"{len(batch.gcs_documents.documents)} files in batch.")
        print(batch.gcs_documents.documents)

        # Use as input for batch_process_documents()
        # Refer to https://cloud.google.com/document-ai/docs/send-request
        # for how to send a batch processing request
        request = documentai.BatchProcessRequest(
            name="processor_name", input_documents=batch
        )
        print(request)

문서 샤드 병합


from google.cloud import documentai
from google.cloud.documentai_toolbox import document

# TODO(developer): Uncomment these variables before running the sample.
# Given a document.proto or sharded document.proto in path gs://bucket/path/to/folder
# gcs_bucket_name = "bucket"
# gcs_prefix = "path/to/folder"
# output_file_name = "path/to/folder/file.json"


def merge_document_shards_sample(
    gcs_bucket_name: str, gcs_prefix: str, output_file_name: str
) -> None:
    wrapped_document = document.Document.from_gcs(
        gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix
    )

    merged_document = wrapped_document.to_merged_documentai_document()

    with open(output_file_name, "w") as f:
        f.write(documentai.Document.to_json(merged_document))

    print(f"Document with {len(wrapped_document.shards)} shards successfully merged.")

Send a process request

Long-running operations (LRO)