Class Document (0.6.0a0)

Document(
    shards: List[google.cloud.documentai_v1.types.document.Document],
    gcs_bucket_name: Optional[str] = None,
    gcs_prefix: Optional[str] = None,
    gcs_input_uri: Optional[str] = None,
)

Represents a wrapped Document.

This class hides away the complexities of using Document protobuf response outputted by BatchProcessDocuments or ProcessDocument methods and implements convenient methods for searching and extracting information within the Document.

Optional. The name of the gcs bucket.

Format: gs://{bucket_name}/{optional_folder}/{target_folder}/ where gcs_bucket_name=bucket.

:type: Optional[str]

(List[Entity]): A list of Entities in the Document.

Attributes

NameDescription
gcs_prefix Optional[str]
Optional. The prefix of the json files in the target_folder. Format: gs://{bucket_name}/{optional_folder}/{target_folder}/ where gcs_prefix={optional_folder}/{target_folder}. For more information please take a look at https://cloud.google.com/storage/docs/json_api/v1/objects/list .
pages Optional[str]
(List[Page]): A list of Pages in the Document.

Methods

convert_document_to_annotate_file_response

convert_document_to_annotate_file_response()

Convert OCR data from Document.proto to AnnotateFileResponse.proto for Vision API.

Returns
TypeDescription
AnnotateFileResponseProto with TextAnnotations.

entities_to_bigquery

entities_to_bigquery(
    dataset_name: str, table_name: str, project_id: Optional[str] = None
)

Adds extracted entities to a BigQuery table.

Parameters
NameDescription
dataset_name str

Required. Name of the BigQuery dataset.

table_name str

Required. Name of the BigQuery table.

project_id Optional[str]

Optional. Project ID containing the BigQuery table. If not passed, falls back to the default inferred from the environment.

Returns
TypeDescription
bigquery.job.LoadJobThe BigQuery LoadJob for adding the entities.

entities_to_dict

entities_to_dict()

Returns Dictionary of entities in document.

Returns
TypeDescription
DictThe Dict of the entities indexed by type.

export_images

export_images(
    output_path: str, output_file_prefix: str, output_file_extension: str
)

Exports images from Document to files.

Parameters
NameDescription
output_path str

Required. The path to the output directory.

output_file_prefix str

Required. The output file name prefix.

output_file_extension str

Required. The output file extension. Format: png, jpg, etc.

Returns
TypeDescription
List[str]A list of output image file names. Format: {output_path}/{output_file_prefix}_{index}_{Entity.type_}.{output_file_extension}

form_fields_to_bigquery

form_fields_to_bigquery(
    dataset_name: str, table_name: str, project_id: Optional[str] = None
)

Adds extracted form fields to a BigQuery table.

Parameters
NameDescription
dataset_name str

Required. Name of the BigQuery dataset.

table_name str

Required. Name of the BigQuery table.

project_id Optional[str]

Optional. Project ID containing the BigQuery table. If not passed, falls back to the default inferred from the environment.

Returns
TypeDescription
bigquery.job.LoadJobThe BigQuery LoadJob for adding the form fields.

form_fields_to_dict

form_fields_to_dict()

Returns Dictionary of form fields in document.

Returns
TypeDescription
DictThe Dict of the form fields indexed by type.

from_batch_process_metadata

from_batch_process_metadata(
    metadata: google.cloud.documentai_v1.types.document_processor_service.BatchProcessMetadata,
)

Loads Documents from Cloud Storage, using the output from BatchProcessMetadata.

.. code-block:: python

    from google.cloud import documentai

    operation = client.batch_process_documents(request)
    operation.result(timeout=timeout)
    metadata = documentai.BatchProcessMetadata(operation.metadata)
Parameter
NameDescription
metadata documentai.BatchProcessMetadata

Required. The operation metadata after a batch_process_documents() operation completes.

Returns
TypeDescription
List[Document]A list of wrapped documents from gcs. Each document corresponds to an input file.

from_batch_process_operation

from_batch_process_operation(location: str, operation_name: str)

Loads Documents from Cloud Storage, using the operation name returned from batch_process_documents().

.. code-block:: python

    from google.cloud import documentai

    operation = client.batch_process_documents(request)
    operation_name = operation.operation.name
Parameters
NameDescription
location str

Required. The location of the processor used for batch_process_documents().

operation_name str

Required. The fully qualified operation name for a batch_process_documents() operation.

Returns
TypeDescription
List[Document]A list of wrapped documents from gcs. Each document corresponds to an input file.

from_document_path

from_document_path(document_path: str)

Loads Document from local document_path.

.. code-block:: python

    from google.cloud.documentai_toolbox import document

    document_path = "/path/to/local/file.json
    wrapped_document = document.Document.from_document_path(document_path)
Parameter
NameDescription
document_path str

Required. The path to the document.json file.

Returns
TypeDescription
DocumentA document from local document_path.

from_documentai_document

from_documentai_document(
    documentai_document: google.cloud.documentai_v1.types.document.Document,
)

Loads Document from local documentai_document.

.. code-block:: python

    from google.cloud import documentai
    from google.cloud.documentai_toolbox import document

    documentai_document = client.process_documents(request).document
    wrapped_document = document.Document.from_documentai_document(documentai_document)
Parameter
NameDescription
documentai_document documentai.Document

Optional. The Document.proto response.

Returns
TypeDescription
DocumentA document from local documentai_document.

from_gcs

from_gcs(
    gcs_bucket_name: str, gcs_prefix: str, gcs_input_uri: Optional[str] = None
)

Loads Document from Cloud Storage.

Parameters
NameDescription
gcs_bucket_name str

Required. The gcs bucket. Format: Given gs://{bucket_name}/{optional_folder}/{operation_id}/ where gcs_bucket_name={bucket_name}.

gcs_prefix str

Required. The prefix to the location of the target folder. Format: Given gs://{bucket_name}/{optional_folder}/{target_folder} where gcs_prefix={optional_folder}/{target_folder}.

gcs_input_uri str

Optional. The gcs uri to the original input file. Format: gs://{bucket_name}/{optional_folder}/{target_folder}/{file_name}.pdf

Returns
TypeDescription
DocumentA document from gcs.

get_entity_by_type

get_entity_by_type(target_type: str)

Returns the list of Entities of target_type.

Parameter
NameDescription
target_type str

Required. target_type.

Returns
TypeDescription
List[Entity]A list of Entity matching target_type.

get_form_field_by_name

get_form_field_by_name(target_field: str)

Returns the list of FormFields named target_field.

Parameter
NameDescription
target_field str

Required. Target field name.

Returns
TypeDescription
List[FormField]A list of FormField matching target_field.

search_pages

search_pages(target_string: Optional[str] = None, pattern: Optional[str] = None)

Returns the list of Pages containing target_string or text matching pattern.

Parameters
NameDescription
target_string Optional[str]

Optional. target str.

pattern Optional[str]

Optional. regex str.

Returns
TypeDescription
List[Page]A list of Pages.

split_pdf

split_pdf(pdf_path: str, output_path: str)

Splits local PDF file into multiple PDF files based on output from a Splitter/Classifier processor.

Parameters
NameDescription
pdf_path str

Required. The path to the PDF file.

output_path str

Required. The path to the output directory.

Returns
TypeDescription
List[str]A list of output pdf files.