Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.

AI21 Labs models

This page shows you how to get started with AI21 Labs models in Vertex AI. It covers the following:

Available models: Learn about the features and use cases of the available AI21 Labs models.
Use the models: Find out how to send streaming and non-streaming requests to the models.
Model quotas and regions: View the regions where the models are available and the associated quotas.

The AI21 Labs models on Vertex AI are fully managed and serverless, which means you don't need to provision or manage any infrastructure. To use a AI21 Labs model, send a request directly to the Vertex AI API endpoint.

You can stream responses to reduce end-user latency. A streamed response uses server-sent events (SSE) to incrementally return the output.

You pay for AI21 Labs models as you use them (pay as you go). For pay-as-you-go pricing, see AI21 Labs model pricing on the Vertex AI pricing page.

Available models

The following AI21 Labs models are available in Vertex AI. To access a model, go to its Model Garden model card.

Model	Description	Use Case
Jamba 1.5 Mini	A smaller, efficient model built on a hybrid architecture that balances quality, throughput, and low cost. It features a 256,000 token context window.	Ideal for data-heavy tasks like document summarization and Q&A where cost-effectiveness is a key consideration.
Jamba 1.5 Large	A larger, more powerful model with 94B active parameters, designed for high quality and high throughput. It also has a 256,000 token context window.	Best for enterprise workflows requiring the highest accuracy and thoroughness for tasks like in-depth analysis and complex Q&A.

Jamba 1.5 Mini

AI21 Labs's Jamba 1.5 Mini is a small foundation model built from a hybrid architecture that uses the Mamba architecture and Transformer architecture to achieve high quality at a competitive price.

With its SSM-Transformer hybrid architecture and a 256,000 token context window, Jamba 1.5 Mini efficiently handles a variety of text generation and comprehension use cases.

Jamba 1.5 Mini is ideal for data-heavy enterprise workflows that require a model to ingest a large amount of information to produce an accurate and thorough response, such as summarizing lengthy documents or enabling question answering across an extensive organizational knowledge base. Jamba 1.5 Mini is well balanced across quality, throughput, and low cost.

Go to the Jamba 1.5 Mini model card

Jamba 1.5 Large

AI21 Labs's Jamba 1.5 Large is a foundation model built from a hybrid architecture that uses the Mamba architecture and Transformer architecture to achieve high quality at a competitive price.

With its SSM-Transformer hybrid architecture and a 256,000 token context window, Jamba 1.5 Large efficiently handles a variety of text generation and comprehension use cases. With 94 B active parameters and 398 B total parameters, Jamba 1.5 Large is designed for highly accurate responses.

Jamba 1.5 Large is ideal for data-heavy enterprise workflows that require a model to ingest a large amount of information to produce an accurate and thorough response, such as summarizing lengthy documents or enabling question answering across an extensive organizational knowledge base. Jamba 1.5 Large is designed for high-quality responses, high throughput, and competitive pricing.

Go to the Jamba 1.5 Large model card

Use the models

You can use curl commands to send requests to the Vertex AI endpoint using the following model names:

For Jamba 1.5 Mini, use jamba-1.5-mini
For Jamba 1.5 Large, use jamba-1.5-large

We recommend that you use the model versions that include a suffix that starts with an @ symbol because of the possible differences between model versions. If you don't specify a model version, the latest version is always used, which can inadvertently affect your workflows when a model version changes.

Before you begin

To use AI21 Labs models with Vertex AI, you must perform the following steps. The Vertex AI API (aiplatform.googleapis.com) must be enabled to use Vertex AI. If you already have an existing project with the Vertex AI API enabled, you can use that project instead of creating a new project.

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Enable the API

Go to the Model Garden model card for the model you want to use, and then click Enable.
- Go to the Jamba 1.5 Mini model card
- Go to the Jamba 1.5 Large model card

Send a streaming request

The following sample shows how to send a streaming request to a AI21 Labs model.

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

LOCATION: A region that supports AI21 Labs models.
MODEL: The model name you want to use. In the request body, exclude the @ model version number.
ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. The models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to true to stream the response and false to return the response all at once.
CONTENT: The content, such as text, of the user or assistant message.
MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately 3.5 characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/ai21/models/MODEL:streamRawPredict

Request JSON body:

{
  "model": MODEL,
  "messages": [
   {
    "role": "ROLE",
    "content": "CONTENT"
   }],
  "max_tokens": MAX_TOKENS,
  "stream": true
}

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/ai21/models/MODEL:streamRawPredict"

PowerShell

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/ai21/models/MODEL:streamRawPredict" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Response

data: {
    "id": "0e9c8e69e5924f729b39bc60bac9e0be",
    "object": "chat.completion.chunk",
    "created": 1720807292,
    "model": "MODEL",
    "choices": [
        {
            "index": 0,
            "delta": {
              "content": "OUTPUT"
            },
            "finish_reason": null,
            "logprobs": null
        }
    ]
}

data: {
    "id": "0e9c8e69e5924f729b39bc60bac9e0be",
    "object": "chat.completion.chunk",
    "created": 1720807292,
    "model": "MODEL",
    "choices": [
        {
            "index": 0,
            "delta": {
              "content": "OUTPUT"
            },
            "finish_reason": null,
            "logprobs": null
        }
    ]
}
...

Send a non-streaming request

The following sample shows how to send a non-streaming request to a AI21 Labs model.

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

LOCATION: A region that supports AI21 Labs models.
MODEL: The model name you want to use. In the request body, exclude the @ model version number.
ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. The models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to true to stream the response and false to return the response all at once.
CONTENT: The content, such as text, of the user or assistant message.
MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately 3.5 characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:rawPredict

Request JSON body:

{
  "model": MODEL,
  "messages": [
   {
    "role": "ROLE",
    "content": "CONTENT"
   }],
  "max_tokens": MAX_TOKENS,
  "stream": false
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:rawPredict"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:rawPredict" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Response

{
    "id": "e71d13ffb77344a08e34e0a22ea84458",
    "object": "chat.completion",
    "created": 1720806624,
    "model": "MODEL",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "OUTPUT",
                "tool_calls": null
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 17,
        "total_tokens": 295,
        "completion_tokens": 278
    }
}

Model quotas and regions

A quota applies for each region where a AI21 Labs model is available. The quota is specified in queries per minute (QPM) and tokens per minute (TPM). TPM includes both input and output tokens.

Model	Region	Quotas	Context length
Jamba 1.5 Mini
	`us-central1`	QPM: 50 TPM: 60,000	256,000
	`europe-west4`	QPM: 50 TPM: 60,000	256,000
Jamba 1.5 Large
	`us-central1`	QPM: 20 TPM: 20,000	256,000
	`europe-west4`	QPM: 20 TPM: 20,000	256,000

If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see Work with quotas.

AI21 Labs models Stay organized with collections Save and categorize content based on your preferences.

Available models

Jamba 1.5 Mini

Jamba 1.5 Large

Use the models

Before you begin

Send a streaming request

REST

curl

PowerShell

Response

Send a non-streaming request

REST

curl

PowerShell

Response

Model quotas and regions

AI21 Labs models