Module document (0.12.2a0)

Wrappers for Document AI Document type.

Classes

Document

Document(
    shards: typing.List[google.cloud.documentai_v1.types.document.Document],
    gcs_bucket_name: typing.Optional[str] = None,
    gcs_prefix: typing.Optional[str] = None,
    gcs_input_uri: typing.Optional[str] = None,
)

Represents a wrapped Document.

This class hides away the complexities of using the Document protobuf response outputted by BatchProcessDocuments or ProcessDocument methods and implements convenient methods for searching and extracting information within the Document.

Modules Functions

_apply_text_offset

_apply_text_offset(
    documentai_object: typing.Union[typing.Dict[str, typing.Dict], typing.List],
    text_offset: int,
) -> None

Applies a text offset to all text_segments in documentai_object.

Parameters
NameDescription
documentai_object object

Required. Document AI object to apply text_offset to.

text_offset int

Required. Text offset to apply. From Document.shard_info.text_offset.

_bigquery_column_name

_bigquery_column_name(input_string: str) -> str

Converts a string into a BigQuery column name. https://cloud.google.com/bigquery/docs/schemas#column_names

Parameter
NameDescription
input_string str

Required: The string to convert.

_dict_to_bigquery

_dict_to_bigquery(
    dic: typing.Dict[str, typing.Union[str, typing.List[str]]],
    dataset_name: str,
    table_name: str,
    project_id: typing.Optional[str],
) -> google.cloud.bigquery.job.load.LoadJob

Loads dictionary to a BigQuery table.

Parameters
NameDescription
dic Dict[str, Union[str, List[str]]]

Required: The dictionary to insert.

dataset_name str

Required. Name of the BigQuery dataset.

table_name str

Required. Name of the BigQuery table.

project_id Optional[str]

Optional. Project ID containing the BigQuery table. If not passed, falls back to the default inferred from the environment.

Returns
TypeDescription
bigquery.job.LoadJobThe BigQuery LoadJob for adding the dictionary.

_entities_from_shards

_entities_from_shards(
    shards: typing.List[google.cloud.documentai_v1.types.document.Document],
) -> typing.List[google.cloud.documentai_toolbox.wrappers.entity.Entity]

Returns a list of Entities and Properties from a list of documentai.Document shards.

Parameter
NameDescription
shards List[google.cloud.documentai.Document]

Required. List of document shards.

Returns
TypeDescription
List[Entity]a list of Entities.

_get_batch_process_metadata

_get_batch_process_metadata(
    operation_name: str, timeout: typing.Optional[float] = None
) -> google.cloud.documentai_v1.types.document_processor_service.BatchProcessMetadata

Get BatchProcessMetadata from a batch_process_documents() long-running operation.

Parameters
NameDescription
operation_name str

Required. The fully qualified operation name for a batch_process_documents() operation.

timeout float

Optional. Default None. Time in seconds to wait for operation to complete. If None, will wait indefinitely.

Returns
TypeDescription
documentai.BatchProcessMetadataMetadata from batch process.

_get_shards

_get_shards(
    gcs_bucket_name: str, gcs_prefix: str
) -> typing.List[google.cloud.documentai_v1.types.document.Document]

Returns a list of documentai.Document shards from a Cloud Storage folder.

Parameters
NameDescription
gcs_bucket_name str

Required. The name of the gcs bucket. Format: gs://{bucket_name}/{optional_folder}/{target_folder}/ where gcs_bucket_name=bucket.

gcs_prefix str

Required. The prefix of the json files in the target_folder. Format: gs://{bucket_name}/{optional_folder}/{target_folder}/ where gcs_prefix={optional_folder}/{target_folder}.

Returns
TypeDescription
List[google.cloud.documentai.Document]A list of documentai.Documents.

_insert_into_dictionary_with_list

_insert_into_dictionary_with_list(
    dic: typing.Dict[str, typing.Union[str, typing.List[str]]], key: str, value: str
) -> typing.Dict[str, typing.Union[str, typing.List[str]]]

Inserts value into a dictionary that can contain lists.

Parameters
NameDescription
dic Dict[str, Union[str, List[str]]]

Required. The dictionary to insert into.

key str

Required. The key to be created or inserted into.

value str

Required. The value to be inserted.

Returns
TypeDescription
Dict[str, Union[str, List[str]]]The dictionary after adding the key-value pair.

_pages_from_shards

_pages_from_shards(
    shards: typing.List[google.cloud.documentai_v1.types.document.Document],
) -> typing.List[google.cloud.documentai_toolbox.wrappers.page.Page]

Returns a list of Pages from a list of documentai.Document shards.

Parameter
NameDescription
shards List[google.cloud.documentai.Document]

Required. List of document shards.

Returns
TypeDescription
List[Page]A list of Pages.