Fully-managed Llama models

This page provides an overview of the fully-managed Llama models available on Vertex AI, and covers the following topics:

  • Available Llama models: A summary and comparison of the different Llama model families to help you choose the right one for your needs.
  • Llama 4 Maverick 17B-128E: Learn about the largest and most capable Llama 4 model for advanced multimodal tasks.
  • Llama 4 Scout 17B-16E: Discover the efficient Llama 4 model suited for retrieval and reasoning over long contexts.
  • Llama 3.3: An enhanced text-only model for high-performance text applications.
  • Llama 3.2: A model designed for image reasoning and on-device applications.
  • Llama 3.1: An auto-regressive language model family available for pay-as-you-go use.

Llama models on Vertex AI are available as fully managed, serverless APIs. To use a Llama model, you send a request directly to the Vertex AI API endpoint. With a managed API, you don't need to provision or manage infrastructure.

You can stream responses to improve the user experience. A streamed response uses server-sent events (SSE) to incrementally stream the response.

Available Llama models

The following table provides a high-level comparison of the available Llama model families.

Model Family Description Key Features Primary Use Case
Llama 4 Maverick 17B-128E The largest and most capable Llama 4 model. Multimodal (text and image), Mixture-of-Experts (MoE) architecture. Advanced image captioning, visual Q&A, creative text generation, sophisticated chatbots.
Llama 4 Scout 17B-16E A highly efficient model that delivers state-of-the-art results for its size. Mixture-of-Experts (MoE) architecture. Retrieval tasks in long contexts, summarizing large documents, analyzing logs, reasoning across large codebases.
Llama 3.3 A text-only 70B instruction-tuned model with enhanced performance. Text-only, high performance for its class. Text-only applications requiring better performance than previous Llama 3 generations.
Llama 3.2 A model designed for image reasoning and accessibility for on-device applications. Multimodal (text and image), optimized for on-device use. Applications that require image reasoning.
Llama 3.1 An auto-regressive language model family aligned with human preferences for helpfulness and safety. Pay-as-you-go availability for the 405B model, tuned with SFT and RLHF. General-purpose language tasks.

The following Llama models from Meta are available in Vertex AI. To access a model, go to its model card in the Model Garden.

Models that are in Preview also have a self-deploy option. If you require a production-ready service, use the self-deploy Llama models.

Llama 4 Maverick 17B-128E

Llama 4 Maverick 17B-128E is the largest and most capable Llama 4 model that offers coding, reasoning, and image capabilities. It features a Mixture-of-Experts (MoE) architecture with 17 billion active parameters out of 400 billion total parameters and 128 experts. Llama 4 Maverick 17B-128E uses alternating dense and MoE layers, where each token activates a shared expert plus one of the 128 routed experts. The model is pretrained on 200 languages and optimized for high-quality chat interactions through a refined post-training pipeline.

Llama 4 Maverick 17B-128E is multimodal and is suited for advanced image captioning, analysis, precise image understanding, visual questions and answers, creative text generation, general-purpose AI assistants, and sophisticated chatbots that require top-tier intelligence and image understanding.

Considerations

  • You can include a maximum of three images per request.
  • The managed API endpoint doesn't use Llama Guard. To use Llama Guard, deploy it from Model Garden and then send your prompts and responses to that endpoint. Llama Guard has a more limited context (128,000 tokens) than Llama 4 and can only process requests that have a single image at the beginning of the prompt.
  • Batch predictions aren't supported.

Go to the Llama 4 model card

Llama 4 Scout 17B-16E

Llama 4 Scout 17B-16E delivers state-of-the-art results for its size class that outperforms previous Llama generations and other open and proprietary models on several benchmarks. It features MoE architecture with 17 billion active parameters out of the 109 billion total parameters and 16 experts.

Llama 4 Scout 17B-16E is suited for retrieval tasks within long contexts and tasks that demand reasoning over large amounts of information, such as summarizing multiple large documents, analyzing extensive user interaction logs for personalization, and reasoning across large codebases.

Go to the Llama 4 model card

Considerations

  • You can include a maximum of three images per request.
  • The managed API endpoint doesn't use Llama Guard. To use Llama Guard, deploy it from Model Garden and then send your prompts and responses to that endpoint. Llama Guard has a more limited context (128,000 tokens) than Llama 4 and can only process requests that have a single image at the beginning of the prompt.
  • Batch predictions aren't supported.

Go to the Llama 4 model card

Llama 3.3

Llama 3.3 is a 70-billion parameter, text-only model that is instruction-tuned. For text-only applications, it provides enhanced performance compared to Llama 3.1 70B and Llama 3.2 90B.

Go to the Llama 3.3 70B model card

During the Preview period, you are charged as you use the model (pay as you go). For pay-as-you-go pricing, see Llama model pricing on the Vertex AI pricing page.

Llama 3.2

With Llama 3.2, you can build and deploy generative AI applications that use capabilities such as image reasoning. Llama 3.2 is also designed to be more accessible for on-device applications.

Go to the Llama 3.2 90B model card

There are no charges during the Preview period. If you require a production-ready service, use the self-hosted Llama models.

Considerations

When you use llama-3.2-90b-vision-instruct-maas, keep the following in mind:

  • There are no restrictions for text-only prompts.
  • If your prompt includes an image, it must be the first item in the prompt. You can only include one image.

Llama 3.1

Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

Llama 3.1 405B is Generally Available. You are charged as you use the model (pay as you go). For pay-as-you-go pricing, see Llama model pricing on the Vertex AI pricing page.

The other Llama 3.1 models are in Preview. There are no charges for the Preview models. If you require a production-ready service, use the self-hosted Llama models.

Go to the Llama 3.1 model card

What's next