This guide provides an overview of text and multimodal embeddings in Vertex AI, covering the following topics: Embeddings are numerical representations of items like text, images, or videos that capture the relationships between them. Machine learning models, particularly generative AI models, create embeddings by identifying patterns in large datasets. Your applications can use these embeddings to understand complex meanings and semantic relationships in your content. You interact with embeddings when you use Google Search or receive music recommendations. Embeddings convert items like text, images, and videos into arrays of floating-point numbers called vectors. These vectors capture the semantic meaning of the original item. The number of floating-point numbers in the vector is its dimensionality. For example, a passage of text might be represented by a vector with hundreds of dimensions. By calculating the numerical distance between the vectors for two items, an application can determine how similar they are. Vertex AI supports two types of embeddings models: text and multimodal. The following table compares the two types. Common use cases for text embeddings include the following: To develop a book recommendation chatbot, you can use a deep neural network (DNN) to convert each book into an embedding vector. You can input the book title, the text content, or both, along with other metadata like the genre. The embeddings in this example could consist of thousands of book titles, summaries, and their genres. The vector representations for books like Wuthering Heights by Emily Brontë and Persuasion by Jane Austen would be close to each other in the embedding space, meaning there is a small numerical distance between them. In contrast, the vector for The Great Gatsby by F. Scott Fitzgerald would be further away, because its time period, genre, and summary are less similar. The data you use as input determines the structure of the embedding space. For example, if you only use book titles as input, two books with similar titles but very different summaries might be close together. However, if you include both the title and the summary, these same books are represented as less similar and are further apart in the embedding space. Using generative AI, the chatbot can then summarize and suggest books that you might like based on your query. Common use cases for multimodal embeddings include the following: Image and text use cases Image, text, and video use cases Online retailers increasingly use multimodal embeddings to enhance the customer experience. Every time you see personalized product recommendations or get visual results from a text search, you are interacting with an embedding. To create a multimodal embedding for an online retail use case, you can start by processing each product image to generate a unique image embedding. This embedding is a mathematical representation of its visual style, color palette, and key details. Simultaneously, you can convert product descriptions, customer reviews, and other relevant text into text embeddings that capture their semantic meaning. By merging these image and text embeddings into a unified search and recommendation engine, the store can offer personalized recommendations of visually similar items based on a customer's browsing history and preferences. This approach also lets customers search for products using natural language. The engine can then retrieve and display the most visually similar items that match their query. For example, if a customer searches for "black summer dress," the search engine can display dresses that are black, have summer-style cuts, are made of lighter material, and might be sleeveless. This combination of visual and textual understanding creates a streamlined shopping experience that can enhance customer engagement and satisfaction.
What are embeddings?
Embedding Type
Description
Use Case
Text
Generates numerical representations for text data only.
Semantic search, text classification, clustering, and conversational interfaces.
Multimodal
Generates numerical representations for multiple types of data, such as text, images, and video, within a shared semantic space.
Image search, video content search, and cross-modal recommendations (for example, searching images with text).
Text embeddings use cases
Example use case: Develop a book recommendation chatbot
Multimodal embeddings use cases
Example use case: Online retail experience
What's next
Embeddings APIs overview
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-08-15 UTC.