Embeddings APIs overview

This guide provides an overview of text and multimodal embeddings in Vertex AI, covering the following topics:

What are embeddings?

Embeddings are numerical representations of items like text, images, or videos that capture the relationships between them. Machine learning models, particularly generative AI models, create embeddings by identifying patterns in large datasets. Your applications can use these embeddings to understand complex meanings and semantic relationships in your content. You interact with embeddings when you use Google Search or receive music recommendations.

Embeddings convert items like text, images, and videos into arrays of floating-point numbers called vectors. These vectors capture the semantic meaning of the original item. The number of floating-point numbers in the vector is its dimensionality. For example, a passage of text might be represented by a vector with hundreds of dimensions. By calculating the numerical distance between the vectors for two items, an application can determine how similar they are.

Vertex AI supports two types of embeddings models: text and multimodal. The following table compares the two types.

Embedding Type Description Use Case
Text Generates numerical representations for text data only. Semantic search, text classification, clustering, and conversational interfaces.
Multimodal Generates numerical representations for multiple types of data, such as text, images, and video, within a shared semantic space. Image search, video content search, and cross-modal recommendations (for example, searching images with text).

Text embeddings use cases

Common use cases for text embeddings include the following:

  • Semantic search: Search text ranked by semantic similarity.
  • Classification: Return the class of items whose text attributes are similar to the given text.
  • Clustering: Cluster items whose text attributes are similar to the given text.
  • Outlier Detection: Return items where text attributes are least related to the given text.
  • Conversational interface: Cluster groups of sentences that can lead to similar responses, such as in a conversation-level embedding space.

Example use case: Develop a book recommendation chatbot

To develop a book recommendation chatbot, you can use a deep neural network (DNN) to convert each book into an embedding vector. You can input the book title, the text content, or both, along with other metadata like the genre.

The embeddings in this example could consist of thousands of book titles, summaries, and their genres. The vector representations for books like Wuthering Heights by Emily Brontë and Persuasion by Jane Austen would be close to each other in the embedding space, meaning there is a small numerical distance between them. In contrast, the vector for The Great Gatsby by F. Scott Fitzgerald would be further away, because its time period, genre, and summary are less similar.

The data you use as input determines the structure of the embedding space. For example, if you only use book titles as input, two books with similar titles but very different summaries might be close together. However, if you include both the title and the summary, these same books are represented as less similar and are further apart in the embedding space.

Using generative AI, the chatbot can then summarize and suggest books that you might like based on your query.

Multimodal embeddings use cases

Common use cases for multimodal embeddings include the following:

Image and text use cases

  • Image classification: Takes an image as input and predicts one or more classes (labels).
  • Image search: Search for relevant or similar images.
  • Recommendations: Generate product or ad recommendations based on images.

Image, text, and video use cases

  • Recommendations: Generate product or advertisement recommendations based on videos (similarity search).
  • Video content search: Search for content within videos. This can be done in the following ways:
    • Semantic search: Use text as input to return a set of ranked frames that match the query.
    • Similarity search: Use an image or video as input to return a set of videos that match the query.
  • Video classification: Takes a video as input and predicts one or more classes.

Example use case: Online retail experience

Online retailers increasingly use multimodal embeddings to enhance the customer experience. Every time you see personalized product recommendations or get visual results from a text search, you are interacting with an embedding.

To create a multimodal embedding for an online retail use case, you can start by processing each product image to generate a unique image embedding. This embedding is a mathematical representation of its visual style, color palette, and key details. Simultaneously, you can convert product descriptions, customer reviews, and other relevant text into text embeddings that capture their semantic meaning.

By merging these image and text embeddings into a unified search and recommendation engine, the store can offer personalized recommendations of visually similar items based on a customer's browsing history and preferences. This approach also lets customers search for products using natural language. The engine can then retrieve and display the most visually similar items that match their query. For example, if a customer searches for "black summer dress," the search engine can display dresses that are black, have summer-style cuts, are made of lighter material, and might be sleeveless.

This combination of visual and textual understanding creates a streamlined shopping experience that can enhance customer engagement and satisfaction.

What's next