Advanced search

This section discusses more of the search feature's nuances and advanced topics.

Custom synonyms

Document AI Warehouse provides a feature called "Custom Synonyms" that enables customers to provide their own synonyms for their specific domains. A synonym, like the name implies, is a similar word used during search. If the user performs a search on "television", then the following synonyms can be added seamlessly without the user's knowledge: "TV", "Video Monitor", "Video Screen." The user's original search terms and the synonyms will be used to perform the search query.

This feature helps expand a user's search and return the expected results. Typical usage includes company or industry terms, acronyms, lingo, and vernacular.

Synonym, Context, and SynonymSet

Document AI Warehouse introduces three major terms for synonym customization:

  • Synonym. A Synonym represents a set of words where all the words have similar meaning.

  • Context. A Context represents a group of users (like industry, division, or organization users) that have specific synonyms that are not used by other groups. For example, the Finance and Health Care departments likely use a completely different SynonymSet. The contexts can be specified in the queryContext field of the Search API. Thus, using different contexts for the same search query terms could yield different search results.

  • SynonymSet. A SynonymSet is a collection of synonyms for a specific context.

{
  "name": string,
  "context": string,
  "synonyms": [
    {
      object (Synonym)
    }
  ]
}

Example use case

SynonymSet with context: "finance"

"Currency","Foreign exchange","dollar","euro"​​,"yen"
"Inflation","CPI","economic expansion","economic boom","higher prices"
"IRS","Internal Revenue Service","US Treasury"
"Tax return","1040","1120","1099","W-2"

In the preceding example, when the user queries for 'Currency' and the queryContext is 'finance,' then the other synonyms in that row (that is, Foreign exchange, dollar, euro, yen) are added implicitly to the search query. Similarly, if the user queries any term in that row, all other terms are added to the final query. Using the same example, if the user queries for 'euro,' then Currency, Foreign exchange, dollar, and yen are added to the final query.

SynonymSet with context: "healthcare"

"provider","clinic","hospital","doctor","therapist","specialist"
"Medical claim","Claim","1500","Diagnosis","ICD-9","ICD9","ICD-10","ICD10","CPT","HCPCS"
"injury","trauma","hurt","wound","sore","bruise","cut","laceration","lesion","abrasion","contusion"

Search with custom synonyms expansion

When customers want to search documents with custom synonyms expansion, they need to specify one or more query_context, you can review the documentation in the SearchDocuments API.

As the name suggests, folder search searches only in a particular folder and its subfolders.

An example search request would look like this:

  {
    document_query {
      query: "songs";,
      folderNameFilter: "projects/PROJECT_NUM/locations/LOCATION/documents/888"
    }
  }

Histograms

Histograms is an advanced feature that aggregates counts on specified data. For example, how many documents of each schema match the user's query? For a government-related database, if the user searches for "Orange County" a data schema histogram might return the number of driver's licenses, marriage certificates, or deeds that match the search criteria. Histograms honor the requesting user's data access permissions, so only documents that the requesting user has access to are counted.

Histograms can be a powerful resource; however, gathering and aggregating all the data takes time.

Histograms are not affected by the search request's pagination fields.

General histogram query format

The HistogramQuery is defined as:

{
  "histogramQuery": string,
  "requirePreciseResultSize": boolean,
  "filters": {
    object (HistogramQueryPropertyNameFilter)
  }
}

The histogram_query field has the format: COUNT('<item to count>'). This field is a repeating field, which allows you to have more than one histogram query.

The require_precise_result_size field is unimplemented.

The filters field is discussed in Filters section.

Document schema histograms

You can create document schema or document type histograms by adding this histogram query:

  {
    document_query {
      query: "test"
    },
    histogram_queries: [
      {
        histogram_query: "count('DocumentSchemaId')"
      }
    ]
  }

For example, the map inside HistogramQueryResult of this query would look like the following:

  histogramQueryResults: [
    {
      histogramQuery: "DocumentSchemaId",
      histogram:
        {
          "projects/1234/locations/us-west/documentSchemas/5543": "22",
          "projects/1234/locations/us-west/documentSchemas/5544": "2",
          "projects/1234/locations/us-west/documentSchemas/5545": "4",
          "projects/1234/locations/us-west/documentSchemas/5546": "122",
          "projects/1234/locations/us-west/documentSchemas/5547": "256",
          "projects/1234/locations/us-west/documentSchemas/5548": "1",
          "projects/1234/locations/us-west/documentSchemas/5549": "5",
          "projects/1234/locations/us-west/documentSchemas/5550": "15",
        }
    }
  ]

This example shows that three document schemas match the given search query, and it shows the number of documents per document schema.

Property histograms

Property histograms display the counts of specified filterable properties. Property histograms have the format:

  COUNT('SomeSchema.SomeProp')

  // Also, you can try with:

  COUNT('SomeSchema.ParentProp.SubProp')

An example request follows:

  histogramQueryResults: [
    {
      histogramQuery: "5678.text_prop",
      histogram: {
        some_text: "1",
        More_text: "55",
        Additional_text: "19"
      }
    }
  ]

This histogram displays all filterable property histograms based on the HistogramQueryPropertyNameFilter criterion. This returns the count of the property usage (as opposed to the count of the values).

The HistogramQueryPropertyNameFilter is defined as:

  {
    "documentSchemas": [
      string
    ],
    "propertyNames": [
      string
    ],
    "yAxis": enum (HistogramYAxis)
  }

You can limit the results to a set of document schemas by filling in the repeatable field called document_schemas up to 10 schema_ID fields. The properties aggregated can optionally be reduced by using the repeatable property_names field.

The y_axis field determines how to count the properties. If it's not set or is set to HISTOGRAM_YAXIS_DOCUMENT, then the histogram feature only counts the unique property usage. If y_axis is set to HISTOGRAM_YAXIS_PROPERTY, then the histogram counts all the property usage. For example:

Document 1: Payments_property: [AMEX, VISA]

Document 2: Payments_property: [MC]

`HISTOGRAM_YAXIS_DOCUMENT` would return:
  Payments_property: 2
  _Explanation_: The `Payments_property` is found in two documents

`HISTOGRAM_YAXIS_PROPERTY` would return:
  Payments_property: 3
  _Explanation_: The `Payments_property` has three values in the documents found

A simple example query follows:

  {
    document_query {
    },
    histogram_queries: [
      {
        histogramQuery: "count('FilterableProperties')";
      }
    ]
  }

A sample response to the preceding query follows:

  histogramQueryResults: [
    {
      histogramQuery: "FilterableProperties",
      histogram: {
        456.int_prop: "4",
        456.text_prop: "26"
      }
    }
  ]

To filter the count('FilterableProperties')) results by schemas, see the following request:

  {
    document_query: {
    },
    histogram_queries: [
      {
        histogram_query: "count('FilterableProperties')",
        filters: {
          document_schemas: [
            "projects/1234/locations/us-west/documentSchemas/678",
            "projects/1234/locations/us-west/documentSchemas/456"
          ]
        }
      }
    ]
  }

To filter the count('FilterableProperties') results for specific properties, see the following request:

  {
    document_query: {
    },
    histogram_queries: [
      {
        histogram_query: "count('FilterableProperties')",
        filters: {
          property_names: [
            "678.MORTAGE_TYPE",
            "456.language_code"
          ]
        }
      }
    ]
  }

To see the property count for count("FilterableProperties"), change the y_axis to HISTOGRAM_YAXIS_PROPERTY as follows:

  {
    document_query: {
    },
    histogram_queries: [
      {
        histogram_query: "count('FilterableProperties')",
        filters: {
          y_axis: "HISTOGRAM_YAXIS_PROPERTY"
        }
      }
    ]
  }

Next steps