Mistral AI models

Mistral AI models on Vertex AI offer fully managed and serverless models as APIs. To use a Mistral AI model on Vertex AI, send a request directly to the Vertex AI API endpoint. Because Mistral AI models use a managed API, there's no need to provision or manage infrastructure.

You can stream your responses to reduce the end-user latency perception. A streamed response uses server-sent events (SSE) to incrementally stream the response.

You pay for Mistral AI models as you use them (pay as you go). For pay-as-you-go pricing, see Mistral AI model pricing on the Vertex AI pricing

page.

Available Mistral AI models

The following models are available from Mistral AI to use in Vertex AI. To access a Mistral AI model, go to its Model Garden model card.

Mistral Medium 3

Mistral Medium 3 is a versatile model designed for a wide range of tasks, including programming, mathematical reasoning, understanding long documents, summarization, and dialogue. It excels at complex tasks requiring advanced reasoning abilities, visual understanding or a high level of specialization (e.g. creative writing, agentic workflows, code generation).

It boasts multi-modal capabilities, enabling it to process visual inputs, and supports dozens of languages, including over 80 coding languages. Additionally, it features function calling and agentic workflows.

Mistral Medium 3 is optimized for single-node inference, particularly for long-context applications. Its size allows it to achieve high throughput on a single node.

Go to the Mistral Medium 3 model card

Mistral OCR (25.05)

Mistral OCR (25.05) is an Optical Character Recognition API for document understanding. Mistral OCR (25.05) excels in understanding complex document elements, including interleaved imagery, mathematical expressions, tables, and advanced layouts such as LaTeX formatting. The model enables deeper understanding of rich documents such as scientific papers with charts, graphs, equations and figures.

Mistral OCR (25.05) is an ideal model to use in combination with a RAG system that takes multimodal documents (such as slides or complex PDFs) as input.

You can couple Mistral OCR (25.05) with other Mistral models to reformat the results. This combination ensures that the extracted content is not only accurate but also presented in a structured and coherent manner, making it suitable for various downstream applications and analyses.

Go to the Mistral OCR (25.05) model card

Mistral Small 3.1 (25.03)

Mistral Small 3.1 (25.03) features multimodal capabilities and a context of up to 128,000. The model can process and understand visual inputs and long documents, further expanding its range of applications compared to the previous Mistral AI Small model. Mistral Small 3.1 (25.03) is a versatile model designed for various tasks such as programming, mathematical reasoning, document understanding, and dialogue. Mistral Small 3.1 (25.03) is designed for low-latency applications to deliver best-in-class efficiency compared to models of the same quality.

Mistral Small 3.1 (25.03) has undergone a full post-training process to align the model with human preferences and needs, making it usable out-of-the-box for applications that require chat or precise instruction following.

Go to the Mistral Small 3.1 (25.03) model card

Mistral Large (24.11)

Mistral Large (24.11) is the latest version of Mistral AI's Large model now with improved reasoning and function calling capabilities.

Agent-centric: best-in-class agentic capabilities with built-in function calling and JSON outputs.
Multi-lingual by design: dozens of languages supported, including English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, and Polish
Proficient in coding: trained on 80+ coding languages such as Python, Java, C, C++, JavaScript, and Bash. Also trained on more specific languages such as Swift and Fortran
Advanced reasoning: state-of-the-art mathematical and reasoning capabilities.

Go to the Mistral Large (24.11) model card

Codestral 2

Codestral 2 is Mistral's code generation specialized model built specifically for high-precision fill-in-the-middle (FIM) completion. It helps developers write and interact with code through a shared instruction and completion API endpoint. As it masters code and can also converse in a variety of languages, it can be used to design advanced AI applications for software developers.

The latest release of Codestral 2 delivers measurable upgrades over prior version Codestral (25.01):

30% increase in accepted completions.
10% more retained code after suggestion.
50% fewer runaway generations, improving confidence in longer edits.

Improved performance on academic benchmarks for short and long-context FIM completion.

Code generation: code completion, suggestions, translation.
Code understanding and documentation: code summarization and explanation.
Code quality: code review, refactoring, bug fixing and test case generation.
Code fill-in-the-middle: users can define the starting point of the code using a prompt, and the ending point of the code using an optional suffix and an optional stop. The Codestral model will then generate the code that fits in between, making it ideal for tasks that require a specific piece of code to be generated.

Go to the Codestral 2 model card

Codestral (25.01)

Codestral (25.01) is designed for code generation tasks. It helps developers write and interact with code through a shared instruction and completion API endpoint. As it masters code along with its ability to converse in a variety of languages, you can use Codestral (25.01) to design advanced AI applications for software developers.

Codestral (25.01) is fluent in 80+ programming languages including Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific languages like Swift and Fortran.
Codestral (25.01) helps to improve developers' productivity and reduces errors: Codestral (25.01) can complete coding functions, write tests, and complete any partial code using a fill-in-the-middle mechanism.
Codestral (25.01) provides a new standard on the performance and latency space with only 24B parameters and a 128,000 context window.

Codestral (25.01) is optimized for the following use cases:

Generates code and provides code completion, suggestions, and translation.
Adds code between user-defined start and end points, which makes it ideal for tasks that require a specific piece of code to be generated.
Summarizes and explains your code.
Reviews the quality of your code by helping to refactor your code, fixes bugs, and generates test cases.

Go to the Codestral (25.01) model card

Use Mistral AI models

You can use curl commands to send requests to the Vertex AI endpoint using the following model names:

For Mistral Medium 3, use mistral-medium-3
For Mistral OCR (25.05), use mistral-ocr-2505
For Mistral Small 3.1 (25.03), use mistral-small-2503
For Mistral Large (24.11), use mistral-large-2411
For Codestral 2, use codestral-2
For Codestral (25.01), use codestral-2501

For more information about using the Mistral AI SDK, see the Mistral AI Vertex AI documentation.

Before you begin

To use Mistral AI models with Vertex AI, you must perform the following steps. The Vertex AI API (aiplatform.googleapis.com) must be enabled to use Vertex AI. If you already have an existing project with the Vertex AI API enabled, you can use that project instead of creating a new project.

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

Go to one of the following Model Garden model cards, then click Enable:

Make a streaming call to a Mistral AI model

The following sample makes a streaming call to a Mistral AI model.

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

LOCATION: A region that supports Mistral AI models.
MODEL: The model name you want to use. In the request body, exclude the @ model version number.
ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. The models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to true to stream the response and false to return the response all at once.
CONTENT: The content, such as text, of the user or assistant message.
MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately 3.5 characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:streamRawPredict

Request JSON body:

{
"model": MODEL,
  "messages": [
   {
    "role": "ROLE",
    "content": "CONTENT"
   }],
  "max_tokens": MAX_TOKENS,
  "stream": true
}

To send your request, choose one of these options:

curl

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:streamRawPredict"

PowerShell

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:streamRawPredict" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Response

data: {
    "id": "0e9c8e69e5924f729b39bc60bac9e0be",
    "object": "chat.completion.chunk",
    "created": 1720807292,
    "model": "MODEL",
    "choices": [
        {
            "index": 0,
            "delta": {
              "content": "OUTPUT"
            },
            "finish_reason": null,
            "logprobs": null
        }
    ]
}

data: {
    "id": "0e9c8e69e5924f729b39bc60bac9e0be",
    "object": "chat.completion.chunk",
    "created": 1720807292,
    "model": "MODEL",
    "choices": [
        {
            "index": 0,
            "delta": {
              "content": "OUTPUT"
            },
            "finish_reason": null,
            "logprobs": null
        }
    ]
}
...

Make a unary call to a Mistral AI model

The following sample makes a unary call to a Mistral AI model.

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

LOCATION: A region that supports Mistral AI models.
MODEL: The model name you want to use. In the request body, exclude the @ model version number.
ROLE: The role associated with a message. You can specify a user or an assistant. The first message must use the user role. The models operate with alternating user and assistant turns. If the final message uses the assistant role, then the response content continues immediately from the content in that message. You can use this to constrain part of the model's response.
STREAM: A boolean that specifies whether the response is streamed or not. Stream your response to reduce the end-use latency perception. Set to true to stream the response and false to return the response all at once.
CONTENT: The content, such as text, of the user or assistant message.
MAX_OUTPUT_TOKENS: Maximum number of tokens that can be generated in the response. A token is approximately 3.5 characters. 100 tokens correspond to roughly 60-80 words.
Specify a lower value for shorter responses and a higher value for potentially longer responses.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:rawPredict

Request JSON body:

{
"model": MODEL,
  "messages": [
   {
    "role": "ROLE",
    "content": "CONTENT"
   }],
  "max_tokens": MAX_TOKENS,
  "stream": false
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:rawPredict"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/mistralai/models/MODEL:rawPredict" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Response

{
    "id": "e71d13ffb77344a08e34e0a22ea84458",
    "object": "chat.completion",
    "created": 1720806624,
    "model": "MODEL",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "OUTPUT",
                "tool_calls": null
            },
            "finish_reason": "stop",
            "logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 17,
        "total_tokens": 295,
        "completion_tokens": 278
    }
}

Mistral AI model region availability and quotas

For Mistral AI models, a quota applies for each region where the model is available. The quota is specified in queries per minute (QPM) and tokens per minute (TPM). TPM includes both input and output tokens.

Model	Region	Quotas	Context length
Mistral Medium 3
	`us-central1`	QPM: 90 TPM: 315,000	128,000
	`europe-west4`	QPM: 90 TPM: 315,000	128,000
Mistral OCR (25.05)
	`us-central1`	QPM: 30 Pages per request: 30 (1 page = 1 million input tokens and 1 million output tokens)	30 pages
	`europe-west4`	QPM: 30 Pages per request: 30 (1 page = 1 million input tokens and 1 million output tokens)	30 pages
Mistral Small 3.1 (25.03)
	`us-central1`	QPM: 60 TPM: 200,000	128,000
	`europe-west4`	QPM: 60 TPM: 200,000	128,000
Mistral Large (24.11)
	`us-central1`	QPM: 60 TPM: 400,000	128,000
	`europe-west4`	QPM: 60 TPM: 400,000	128,000
Codestral 2
	`us-central1`	QPM: 1,100 Input TPM: 1,100,000 Output TPM: 110,000	128,000 tokens
	`europe-west4`	QPM: 1,100 Input TPM: 1,100,000 Output TPM: 110,000	128,000 tokens
Codestral (25.01)
	`us-central1`	QPM: 60 TPM: 400,000	32,000
	`europe-west4`	QPM: 60 TPM: 400,000	32,000

If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see the Cloud Quotas overview.

Mistral AI models Stay organized with collections Save and categorize content based on your preferences.

Available Mistral AI models

Mistral Medium 3

Mistral OCR (25.05)

Mistral Small 3.1 (25.03)

Mistral Large (24.11)

Codestral 2

Codestral (25.01)

Use Mistral AI models

Before you begin

Make a streaming call to a Mistral AI model

REST

curl

PowerShell

Response

Make a unary call to a Mistral AI model

REST

curl

PowerShell

Response

Mistral AI model region availability and quotas

Mistral AI models