Before you start
To ingest sample documents into the Document AI Warehouse, see the Quickstart Guide.
Define your data for search
When defining your document schemas and creating your documents, it's important to consider what properties you want to define and how they're going to be used with search, if at all.
Mark a property filterable if you want to use that property to include or exclude a portion of documents for a search. For example, you might make a property that represents a "Vendor" filterable because your users want to search for invoices from a specific vendor.
If you want to construct a histogram (see the example later in this topic) on a property, then the property needs to be filterable.
Mark a property searchable if it has data that your users will want to query during a keyword search.
Full text search
Full text search is the process of retrieving all documents that match the search keywords in their searchable text. The user provides a list of keywords (words separated by a blank space), presumably typed into a search field in the UI. In Document AI Warehouse, the keywords are processed and converted into a proper query. Such processing strips stopwords ("the," "in," and " an, ") and stems the remaining words. Stemming reduces the word to a common version of the wording, so that word variation matches. For example: "work," "working," "worked."
What data gets searched?
- The document's
plain_text
. - If you are importing a Document AI object, use the embedded
cloud_ai_document.text
. - The document's display_name.
- All searchable properties.
The query partially supports Google AIP style syntax. Specifically, the query supports literals, logical operators, negation operators, comparison operators, and functions.
- Literals: A bare literal value (examples: "42", "Hugo") is a value to be matched against. It searches over the full text of the document and the searchable properties.
- Logical operators: "AND", "and", "OR", and "or" are binary logical operators (example: "engineer OR developer").
- Negation operators: "NOT" and "!" are negation operators (example: "NOT software").
Comparison operators: support the binary comparison operators
=
,!=
,<
,>
,<=
and>=
for string, numeric, enum, boolean. Also support like operator~~
for string. It provides semantic search functionality by parsing, stemming and doing synonyms expansion against the input query.To specify a property in the query, the left hand side expression in the comparison must be the property ID including the parent. The right hand side must be literals. For example:
\"projects/123/locations/us\".property_a < 1
matches results whoseproperty_a
is less than 1 in project123
andus
location. The literals and comparison expression can be connected in a single query (example:software engineer \"projects/123/locations/us\".salary > 100
).Functions: supported functions are
LOWER([property_name])
to perform a case insensitive match andEMPTY([property_name])
to filter on the existence of a key.Support nested expressions connected using parentheses and logical operators. The default logical operators is
AND
if there is no operators between expressions.
The query can be used with other filters e.g. time_filters
and folder_name_filter
. They are connected with AND
operator under the hood.
Search queries can be filtered by additional parameters such as by property
, time
, schema
, folder
, and creator
.
Call to a search request
To call the search service, you must use a search request, which is defined as follows:
{
"requestMetadata": {
object (RequestMetadata)
},
"documentQuery": {
object (DocumentQuery)
},
"offset": integer,
"pageSize": integer,
"pageToken": string,
"orderBy": string,
"histogramQueries": [
{
object (HistogramQuery)
}
],
"requireTotalSize": boolean,
"totalResultSize": enum (TotalResultSize),
"qaSizeLimit": integer
}
The parent
field must be filled in with the format:
/projects/PROJECT_ID/locations/LOCATION
Response to a search request
The search response is defined as follows:
{
"matchingDocuments": [
{
object (MatchingDocument)
}
],
"nextPageToken": string,
"totalSize": integer,
"metadata": {
object (ResponseMetadata)
},
"histogramQueryResults": [
{
object (HistogramQueryResult)
}
]
}
Document Query
The document_query
field is defined as follows:
{
"query": string,
"isNlQuery": boolean,
"customPropertyFilter": string,
"timeFilters": [
{
object (TimeFilter)
}
],
"documentSchemaNames": [
string
],
"propertyFilter": [
{
object (PropertyFilter)
}
],
"fileTypeFilter": {
object (FileTypeFilter)
},
"folderNameFilter": string,
"queryContext": [
string
],
"documentCreatorFilter": [
string
],
"customWeightsMetadata": {
object (CustomWeightsMetadata)
}
}
The query
field is for the requesting user's search query words. Typically, these come from the search field in the UI.
Filters
Document AI Warehouse offers a variety of filters.
Document time filter
The create and update time filter is exactly what you would expect: it finds documents matching the keywords within a specified time period.
A TimeFilter
object is used to specify the time range and it is defined as follows:
{
"timeRange": {
object (Interval)
},
"timeField": enum (TimeField)
}
The time_field
field is where you specify if the time range specified in the time_range
is for the document's creation time or the document's last update time.
The time_range
field specifies the time range as an Interval
. An Interval
is defined as:
{
"startTime": string,
"endTime": string
}
Creator filter
To search for documents that were created by specific user or users then use the creator filter. For example:
{
document_query {
query: "videogames director",
documentCreatorFilter: [
"diane@some_company.com",
"frank@some_company.com",
],
},
}
Property filter
The property filter lets you specify filters on any of the properties that you have specified in a schema, as long as that property has been configured to be filterable.
For example, using property filters in the legal industry might filter on a property called COURT
to search only documents from a particular court.
Property filters use a PropertyFilter
object. You can have more than one property filter. When you use multiple property filters, they are combined using the OR
operator.
A property filter is defined as follows:
{
"documentSchemaName": string,
"condition": string
}
Properties are defined in schemas. Thus, the documentSchemaName
field is where you specify the schema for the property that you use for filtering. In the condition
field, you specify the desired logic. For examples of using the documentSchemaName
and condition
fields, see the preceding examples on this page.
Matching document
A matching document contains a Document
and a snippet (discussed later). The returned document in MatchingDocument
is not a fully filled-in document. It contains minimal data for displaying a search results list to the requesting user. If the full document is desired (for example, if the user clicked on a search result), then the full document should be retrieved via the GetDocument
API.
The following Document
fields are filled in: Project number
, Document id
, Document schema id
, Create time
, Update time
, Display name
, Raw document file type
, Reference id
, and Filterable properties
.
A matching document would look like this:
{
"document": {
object (Document)
},
"searchTextSnippet": string,
"qaResult": {
object (QAResult)
}
}
Ranking/sort
The search request lets you specify how you want the results sorted. To sort, use the order_by
field in the search request. The possible values for this field include:
relevance desc
- relevance descending, that is, the best matches are on top.upload_date desc
- the date the document was created in descending order (newest on top).upload_date
- the date the document was created in ascending order (oldest on top).update_date desc
- the date the document was last updated in descending order (newest on top).Update_date
- the date the document was last updated in ascending order (oldest on top).
If you don't specify a sort, but you supply search keywords, then the sort is by relevance descending (the best matches on top). If neither the sort nor keywords are provided, then the default sort is by update time descending (the latest documents on top).
Pagination
Pagination is useful for displaying a page worth of data to the end user. Here you can specify the size of the page and get a total count of the result size to display back to the user (for example, "Showing 50 documents of 300").
Set the page_size
field to the desired number of results that you want to receive with the search request. This might correspond to the requirements of the UI search result display size.
There are two mechanisms: offset and page token.
An offset is the index into the list of returnable documents that you want returned. For example, an offset of 5 means you want the sixth document onward. Presumably you would increment the offset by the page size for the next page of results.
Alternatively, you can use a page token and not have to worry about calculating the next offset. After making your first search request, you get a search response that contains the next_page_token
field. If this field is empty, then there are no more results. If the field is not empty, use this token in your next search request by setting the page_token
field.
Some UIs display the count of documents found by the search. For example, you are viewing 10 documents of 120
. To get a document count returned, set the request's require_total_size boolean
field to True
.
Tip: require_total_size=True
carries a performance penalty. Set this on the first page query, then set this to false
on all subsequent requests, keeping the total count in a local variable.
Code Samples
Python
For more information, see the Document AI Warehouse Python API reference documentation.
To authenticate to Document AI Warehouse, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
For more information, see the Document AI Warehouse Java API reference documentation.
To authenticate to Document AI Warehouse, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Next steps
- Proceed to Advanced search to learn how to use the advanced search features.
- Visit the REST reference