Llama models on Vertex AI offer fully managed and serverless models as APIs. To use a Llama model on Vertex AI, send a request directly to the Vertex AI API endpoint. Because Llama models use a managed API, there's no need to provision or manage infrastructure.
You can stream your responses to reduce the end-user latency perception. A streamed response uses server-sent events (SSE) to incrementally stream the response.
Available Llama models
The following Llama models are available from Meta to use in Vertex AI. To access a Llama model, go to its Model Garden model card.
Models that are in Preview also have self-deploy option. If you require a production-ready service, use the self-deploy Llama models.
Llama 4 Maverick 17B-128E
Llama 4 Maverick 17B-128E is the largest and most capable Llama 4 model that offers coding, reasoning, and image capabilities. It features Mixture-of-Experts (MoE) architecture with 17 billion active parameters out of 400 billion total parameters and 128 experts. Llama 4 Maverick 17B-128E uses alternating dense and MoE layers, where each token activates a shared expert plus one of the 128 routed experts. The model is pretrained on 200 languages and optimized for high-quality chat interactions through a refined post-training pipeline.
Llama 4 Maverick 17B-128E is multimodal and is suited for advanced image captioning, analysis, precise image understanding, visual questions and answers, creative text generation, general-purpose AI assistants, and sophisticated chatbots requiring top-tier intelligence and image understanding.
Considerations
- You can include a maximum of three images per request.
- The MaaS endpoint doesn't use Llama Guard, unlike previous versions. To use Llama Guard, deploy Llama Guard from Model Garden and then send the prompts and responses to that endpoint. However, compared to Llama 4, Llama Guard has a more limited context (128,000) and can only process requests with a single image at the beginning of the prompt.
- Batch predictions aren't supported.
Llama 4 Scout 17B-16E
Llama 4 Scout 17B-16E delivers state-of-the-art results for its size class that outperforms previous Llama generations and other open and proprietary models on several benchmarks. It features MoE architecture with 17 billion active parameters out of the 109 billion total parameters and 16 experts.
Llama 4 Scout 17B-16E is suited for retrieval tasks within long contexts and tasks that demand reasoning over large amounts of information, such as summarizing multiple large documents, analyzing extensive user interaction logs for personalization, and reasoning across large codebases.
Considerations
- You can include a maximum of three images per request.
- The MaaS endpoint doesn't use Llama Guard, unlike previous versions. To use Llama Guard, deploy Llama Guard from Model Garden and then send the prompts and responses to that endpoint. However, compared to Llama 4, Llama Guard has a more limited context (128,000) and can only process requests with a single image at the beginning of the prompt.
- Batch predictions aren't supported.
Llama 3.3
Llama 3.3 is a text-only 70B instruction-tuned model that provides enhanced performance relative to Llama 3.1 70B and to Llama 3.2 90B when used for text-only applications.
Go to the Llama 3.3 70B model card
During the Preview period, you are charged as you use the model (pay as you go). For pay-as-you-go pricing, see Llama model pricing on the Vertex AI pricing page.
Llama 3.2
Llama 3.2 lets developers to build and deploy the latest generative AI models and applications that use the latest Llama's capabilities, such as image reasoning. Llama 3.2 is also designed to be more accessible for on-device applications.
Go to the Llama 3.2 90B model card
There are no charges during the Preview period. If you require a production-ready service, use the self-hosted Llama models.
Considerations
When using llama-3.2-90b-vision-instruct-maas
, there are no restriction when you send
text-only prompts. However, if you include an image in your prompt, the image
must be at beginning of your prompt, and you can include only one image. You
cannot, for example, include some text and then an image.
Llama 3.1
Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.
Llama 3.1 405B is Generally Available. You are charged as you use the model (pay as you go). For pay-as-you-go pricing, see Llama model pricing on the Vertex AI pricing page.
The other Llama 3.1 models are in Preview. There are no charges for the Preview models. If you require a production-ready service, use the self-hosted Llama models.
Go to the Llama 3.1 model card
What's next
Learn how to use Llama models.