Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.
Stay organized with collections
Save and categorize content based on your preferences.
Imagen for Captioning & VQA (imagetext) is the name of the model that supports image question and
answering. Imagen for Captioning & VQA answers a question provided for a given image, even
if it hasn't been seen before by the model.
To explore this model in the console, see the Imagen for Captioning & VQA model card in
the Model Garden.
{"instances":[{"prompt":string,"image":{// Union field can be only one of the following:"bytesBase64Encoded":string,"gcsUri":string,// End of list of possible types for union field."mimeType":string}}],"parameters":{"sampleCount":integer,"seed":integer}}
LOCATION: Your project's region. For example,
us-central1, europe-west2, or asia-northeast3. For a list
of available regions, see
Generative AI on Vertex AI locations.
VQA_PROMPT: The question you want to get answered about your image.
What color is this shoe?
What type of sleeves are on the shirt?
B64_IMAGE: The image to get captions for. The image must be
specified as a base64-encoded byte string. Size
limit: 10 MB.
RESPONSE_COUNT: The number of answers you want to generate. Accepted integer
values: 1-3.
HTTP method and URL:
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict
The following sample responses are for a request with
"sampleCount": 2 and "prompt": "What is this?". The response returns
two prediction string answers.
{
"predictions": [
"cappuccino",
"coffee"
]
}
Response body
{"predictions":[string]}
Response element
Description
predictions
List of text strings representing VQA answer, sorted by confidence.
Sample response
The following sample responses is for a request with "sampleCount": 2 and
"prompt": "What is this?". The response returns two prediction string answers.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-29 UTC."],[],[],null,["# Visual question and answering (VQA)\n\nImagen for Captioning \\& VQA (`imagetext`) is the name of the model that supports image question and\nanswering. Imagen for Captioning \\& VQA answers a question provided for a given image, even\nif it hasn't been seen before by the model.\n\nTo explore this model in the console, see the Imagen for Captioning \\& VQA model card in\nthe Model Garden.\n\n\n[View Imagen for Captioning \\& VQA model card](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/imagetext)\n\nUse cases\n---------\n\nSome common use cases for image question and answering include:\n\n- Empower users to engage with visual content with Q\\&A.\n- Enable customers to engage with product images shown on retail apps and websites.\n- Provide accessibility options for visually impaired users.\n\nHTTP request\n------------\n\n POST https://us-central1-aiplatform.googleapis.com/v1/projects/\u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e/locations/us-central1/publishers/google/models/imagetext:predict\n\nRequest body\n------------\n\n {\n \"instances\": [\n {\n \"prompt\": string,\n \"image\": {\n // Union field can be only one of the following:\n \"bytesBase64Encoded\": string,\n \"gcsUri\": string,\n // End of list of possible types for union field.\n \"mimeType\": string\n }\n }\n ],\n \"parameters\": {\n \"sampleCount\": integer,\n \"seed\": integer\n }\n }\n\nUse the following parameters for the visual Q\\&A generation model `imagetext`.\nFor more information, see [Use Visual Question Answering (VQA)](/vertex-ai/generative-ai/docs/image/visual-question-answering).\n\nSample request\n--------------\n\n\nBefore using any of the request data,\nmake the following replacements:\n\n- \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: Your Google Cloud [project ID](/resource-manager/docs/creating-managing-projects#identifiers).\n- \u003cvar translate=\"no\"\u003eLOCATION\u003c/var\u003e: Your project's region. For example, `us-central1`, `europe-west2`, or `asia-northeast3`. For a list of available regions, see [Generative AI on Vertex AI locations](/vertex-ai/generative-ai/docs/learn/locations-genai).\n- \u003cvar translate=\"no\"\u003eVQA_PROMPT\u003c/var\u003e: The question you want to get answered about your image.\n - *What color is this shoe?*\n - *What type of sleeves are on the shirt?*\n- \u003cvar translate=\"no\"\u003eB64_IMAGE\u003c/var\u003e: The image to get captions for. The image must be specified as a [base64-encoded](/vertex-ai/generative-ai/docs/image/base64-encode) byte string. Size limit: 10 MB.\n- \u003cvar translate=\"no\"\u003eRESPONSE_COUNT\u003c/var\u003e: The number of answers you want to generate. Accepted integer values: 1-3.\n\n\nHTTP method and URL:\n\n```\nPOST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict\n```\n\n\nRequest JSON body:\n\n```\n{\n \"instances\": [\n {\n \"prompt\": \"VQA_PROMPT\",\n \"image\": {\n \"bytesBase64Encoded\": \"B64_IMAGE\"\n }\n }\n ],\n \"parameters\": {\n \"sampleCount\": RESPONSE_COUNT\n }\n}\n```\n\nTo send your request, choose one of these options: \n\n#### curl\n\n| **Note:** The following command assumes that you have logged in to the `gcloud` CLI with your user account by running [`gcloud init`](/sdk/gcloud/reference/init) or [`gcloud auth login`](/sdk/gcloud/reference/auth/login) , or by using [Cloud Shell](/shell/docs), which automatically logs you into the `gcloud` CLI . You can check the currently active account by running [`gcloud auth list`](/sdk/gcloud/reference/auth/list).\n\n\nSave the request body in a file named `request.json`,\nand execute the following command:\n\n```\ncurl -X POST \\\n -H \"Authorization: Bearer $(gcloud auth print-access-token)\" \\\n -H \"Content-Type: application/json; charset=utf-8\" \\\n -d @request.json \\\n \"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict\"\n```\n\n#### PowerShell\n\n| **Note:** The following command assumes that you have logged in to the `gcloud` CLI with your user account by running [`gcloud init`](/sdk/gcloud/reference/init) or [`gcloud auth login`](/sdk/gcloud/reference/auth/login) . You can check the currently active account by running [`gcloud auth list`](/sdk/gcloud/reference/auth/list).\n\n\nSave the request body in a file named `request.json`,\nand execute the following command:\n\n```\n$cred = gcloud auth print-access-token\n$headers = @{ \"Authorization\" = \"Bearer $cred\" }\n\nInvoke-WebRequest `\n -Method POST `\n -Headers $headers `\n -ContentType: \"application/json; charset=utf-8\" `\n -InFile request.json `\n -Uri \"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/imagetext:predict\" | Select-Object -Expand Content\n```\nThe following sample responses are for a request with `\"sampleCount\": 2` and `\"prompt\": \"What is this?\"`. The response returns two prediction string answers.\n\n```\n{\n \"predictions\": [\n \"cappuccino\",\n \"coffee\"\n ]\n}\n```\n\n\u003cbr /\u003e\n\nResponse body\n-------------\n\n\n {\n \"predictions\": [\n string\n ]\n }\n\nSample response\n---------------\n\nThe following sample responses is for a request with `\"sampleCount\": 2` and\n`\"prompt\": \"What is this?\"`. The response returns two prediction string answers. \n\n {\n \"predictions\": [\n \"cappuccino\",\n \"coffee\"\n ],\n \"deployedModelId\": \"DEPLOYED_MODEL_ID\",\n \"model\": \"projects/PROJECT_ID/locations/us-central1/models/MODEL_ID\",\n \"modelDisplayName\": \"MODEL_DISPLAYNAME\",\n \"modelVersionId\": \"1\"\n }"]]