Model adaptation

Very often what a user says strongly depends on the particular context they are in. The context can depend on, to highlight a few examples, options presented to the user, conversation topic, time of the day or location, or prior information about the user (such as favorite radio stations or top contacts).

Model adaptation is used to provide the context to the recognizer. It changes the underlying probabilities of the Speech-to-Text model, so that the contextual words or phrases are more likely to be considered by the recognizer than other options that might otherwise be selected.

Relevant contextual phrases can significantly improve recognition performance. Irrelevant phrases pose a risk of degrading the recognition performance: if the users don't speak those phrases, the recognizer is guided in the wrong direction.

To increase the probability that the libgspeech library recognizes the words and/or phrases you need when it transcribes your audio data, pass them as phrases within the phrase_sets field of a SpeechAdaptation object. Assign the SpeechAdaptation object to the adaptation field of the RecognitionConfig object in your request:

adaptation: {
  phrase_sets {
    phrases: "weather is hot"
    phrases: "weather is cold"
}

Use class tokens to bias the model

Classes represent common concepts that occur in natural language, such as numeric values and calendar dates. A class lets you improve transcription accuracy for large groups of words that map to a common concept but don't always include identical words or phrases.

For example, suppose that your audio data includes recordings of people saying their street address. You might have an audio recording of someone saying "My house is 123 Main Street, the fourth house on the left." In this case, you want Speech-to-Text to recognize the first sequence of numerals ("123") as an address rather than as an ordinal number ("one-hundred twenty-third"). However, not all people live at "123 Main Street." It's impractical to list every possible street address in phrases. Instead, you can use a class to indicate that a street number should be recognized no matter what the number actually is.

To use class tokens, include them in your speech adaptation phrases. You can use classes either as stand-alone items in the phrases array or embed them in longer multi-word phrases. For example, to improve the transcription of address numbers from your source audio, use $ADDRESSNUM class. You can indicate an address number in a larger phrase by including the class token in a string: "my address is $ADDRESSNUM". However, this phrase doesn't help in cases where the audio contains a similar but non-identical phrase, such as "I am at 123 Main Street". To aid in recognition of similar phrases, it's important to add the class token by itself:

adaptation: {
  phrase_sets {
    phrases: "my address is $ADDRESSNUM"
    phrases: "$ADDRESSNUM"
  }
}

To learn which class tokens are available in your locale of interest, contact Google.

Improve recognition using predefined classes

A custom class is a customized list of related items or values. Google provides several predefined custom classes (such as contacts or navigation) that we recommend for use with the libgspeech library. These custom classes are likely to represent phrases that your application might have and yield better recognition accuracy than your custom classes would.

Predefined custom classes are grouped into two categories, which you need to reference differently in your requests:

  • Regular custom classes for which you need to provide phrases and items, for example, contacts.

  • Phraseless custom classes, which are referenced by custom_class_id only, for example, navigation.

To use a regular custom class that requires phrases and items, create a CustomClass object that includes each value in items and reference this class by its custom_class_id in your phrases. For example:

adaptation: {
  custom_classes {
    custom_class_id: "contacts"
    items: "Asia"
    items: "Alex"
    items: "Nuno Pereira"
  }
  phrase_sets {
    phrases: "call ${contacts}"
  }
}

You don't need to provide phrases and items for phraseless custom classes. In that case the phrases list is a Google-provided list of the most common phrases users are likely to use in that context (for example, "take highway" in navigation).

adaptation: {
  custom_classes {
    custom_class_id: "navigation"
  }
}

You can provide several custom classes in your request. In that case, each phrase set should only contain phrases that correspond to a single custom class. For example:

adaptation: {
  # An example of the regular custom class.
  custom_classes {
    custom_class_id: "radio-stations"
    items: "90s Rock & Hip-Hop"
    items: "Z100"
  }
  # An example of the phraseless custom class. (Note, no items and phrases need
  # to be provided.)
  custom_classes {
    custom_class_id: "navigation"
  }
  custom_classes {
    custom_class_id: "contacts"
    items: "Nuno Pereira"
    items: "Carl Jung"
  }
  phrase_sets {
    phrases: "play ${radio-stations}"
    phrases: "tune to ${radio-stations}"
  }
  phrase_sets {
    phrases: "message ${contacts}"
  }
  phrase_sets {
    phrases: "send email to ${contacts}"
  }
}

Create custom classes

If you have a specific business need that isn't met by the predefined custom classes, you can create custom classes.

For example, you might want to transcribe audio data that is likely to include the name of any one of several hundred regional restaurants. Restaurant names are relatively rare in general speech and less likely to be chosen as "correct" by the recognition model. So, specify the names in a custom class. For example:

adaptation: {
  custom_classes {
    custom_class_id: "restaurants"
    items: "sushido"
    items: "taneda sushi"
    items: "altura"
  }
  phrase_sets {
    phrases: "visit restaurants like ${restaurants}"
  }
}

Recommendations

The maximum recommended limits of the number of phrases and custom classes in your requests are as follows:

SpeechAdaptation {
   CustomClass { [max of 20]
      class id
      items      [max of 100]
   }
   PhraseSet {   [max of 5]
      phrases    [max of 300 across all 5 PhraseSets]
   }
}

Doing otherwise may result in the following issues:

  • Quality degradation:
    • Overtriggering or recognizing biased phrases not present in the audio.
    • Stuckiness or truncating parts of a transcript.
    • Indeterminism or flakiness of transcription results, given the same audio. It's caused by providing many, likely-similar speech adaptation phrases or irrelevant phrases, which can result in the recognizer choosing either option at random.
  • Increased latency (possibly, seconds) and memory as the underlying models become larger.

Example Adaptation Configurations

Following is a sample of adaptation configurations for popular cases.

Contacts

An example to adapt the recognizer for a list of contacts:

adaptation: {
  custom_classes {
    custom_class_id: "contacts"
    items: "Nuno Pereira"
    items: "Carl Jung"
  }
  phrase_sets {
    phrases: "message ${contacts}"
  }
  phrase_sets {
    phrases: "send email to ${contacts}"
  }
}

See the Improve recognition using predefined classes section for more information.