Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.
When you build interactive applications, response time, also known as latency,
plays a crucial role in the user experience. This section explores the concept
of latency in the context of Vertex AI LLM APIs and provides
actionable strategies to minimize it and improve the response time of
your AI-powered applications.
Understanding latency metrics for LLMs
Latency refers to the time it takes for a model to process your input
prompt and generate a corresponding output response.
When examining latency with a model, consider the following:
Time to first token (TTFT) is the time that it takes for the model to produce
the first token of the response after receiving the prompt. TTFT is particularly
relevant for applications utilizing streaming, where providing immediate
feedback is crucial.
Time to last token (TTLT) measures the overall time taken by the model to process
the prompt and generate the response.
Strategies to reduce latency
You can utilize several strategies with Vertex AI
to minimize latency and enhance the responsiveness of your applications:
Choose the right model for your use case
Vertex AI provides a diverse range of models with varying
capabilities and performance characteristics. Carefully evaluate your
requirements regarding speed and output quality to choose the model that best
aligns with your use case. For a list of available models, see
Explore all models.
Optimize prompt and output length
The number of tokens in both your input prompt and expected output directly
impacts processing time. Minimize your token count to reduce
latency.
Craft clear and concise prompts that effectively convey your intent without
unnecessary details or redundancy. Shorter prompts reduce your time to first token.
Use system instructions to control the length of the response. Instruct the
model to provide concise answers or limit the output to a specific number of
sentences or paragraphs. This strategy can reduce your time to last token.
Adjust the temperature. Experiment with the temperature parameter to
control the randomness of the output. Lower temperature values can lead to
shorter, more focused responses, while higher values can result in more
diverse, but potentially longer, outputs. For more information,
see temperature in the model parameters reference.
Restrict output by setting a limit. Use the max_output_tokens parameter to
set a maximum limit on the length of the generated response length, preventing
overly long output. However, be cautious as this might cut off responses
mid-sentence.
Stream responses
With streaming, the model starts sending its response before it generates the
complete output. This enables real-time processing of the output, and you can
immediately update your user interface and perform other concurrent tasks.
Streaming enhances perceived responsiveness and creates a more interactive user
experience.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-29 UTC."],[],[],null,["# Best practices with large language models (LLMs)\n\nMultimodal prompts\n------------------\n\nFor information on best practices for multimodal prompts, see the following\npages based on the modality that you're working with:\n\n- [Image understanding](/vertex-ai/generative-ai/docs/multimodal/image-understanding)\n- [Video understanding](/vertex-ai/generative-ai/docs/multimodal/video-understanding)\n- [Audio understanding](/vertex-ai/generative-ai/docs/multimodal/audio-understanding)\n- [Document understanding](/vertex-ai/generative-ai/docs/multimodal/document-understanding)\n\nReduce latency\n--------------\n\nWhen you build interactive applications, response time, also known as latency,\nplays a crucial role in the user experience. This section explores the concept\nof latency in the context of Vertex AI LLM APIs and provides\nactionable strategies to minimize it and improve the response time of\nyour AI-powered applications.\n\n### Understanding latency metrics for LLMs\n\nLatency refers to the time it takes for a model to process your input\nprompt and generate a corresponding output response.\n\nWhen examining latency with a model, consider the following:\n\n*Time to first token (TTFT)* is the time that it takes for the model to produce\nthe first token of the response after receiving the prompt. TTFT is particularly\nrelevant for applications utilizing streaming, where providing immediate\nfeedback is crucial.\n\n*Time to last token (TTLT)* measures the overall time taken by the model to process\nthe prompt and generate the response.\n\n### Strategies to reduce latency\n\nYou can utilize several strategies with Vertex AI\nto minimize latency and enhance the responsiveness of your applications:\n\n#### Choose the right model for your use case\n\nVertex AI provides a diverse range of models with varying\ncapabilities and performance characteristics. Carefully evaluate your\nrequirements regarding speed and output quality to choose the model that best\naligns with your use case. For a list of available models, see\n[Explore all models](/vertex-ai/generative-ai/docs/model-garden/explore-models).\n\n#### Optimize prompt and output length\n\nThe number of tokens in both your input prompt and expected output directly\nimpacts processing time. Minimize your token count to reduce\nlatency.\n\n- Craft clear and concise prompts that effectively convey your intent without\n unnecessary details or redundancy. Shorter prompts reduce your time to first token.\n\n- Use *system instructions* to control the length of the response. Instruct the\n model to provide concise answers or limit the output to a specific number of\n sentences or paragraphs. This strategy can reduce your time to last token.\n\n- Adjust the `temperature`. Experiment with the `temperature` parameter to\n control the randomness of the output. Lower `temperature` values can lead to\n shorter, more focused responses, while higher values can result in more\n diverse, but potentially longer, outputs. For more information,\n see [`temperature` in the model parameters reference](/vertex-ai/generative-ai/docs/model-reference/gemini#parameters).\n\n- Restrict output by setting a limit. Use the `max_output_tokens` parameter to\n set a maximum limit on the length of the generated response length, preventing\n overly long output. However, be cautious as this might cut off responses\n mid-sentence.\n\n#### Stream responses\n\nWith streaming, the model starts sending its response before it generates the\ncomplete output. This enables real-time processing of the output, and you can\nimmediately update your user interface and perform other concurrent tasks.\n\nStreaming enhances perceived responsiveness and creates a more interactive user\nexperience.\n\nWhat's next\n-----------\n\n- Learn [general prompt design strategies](/vertex-ai/generative-ai/docs/learn/prompt-design-strategies).\n- See some [sample prompts](/vertex-ai/generative-ai/docs/prompt-gallery).\n- Learn how to [send chat prompts](/vertex-ai/generative-ai/docs/multimodal/send-chat-prompts-gemini).\n- Learn about [responsible AI best practices and Vertex AI's safety filters](/vertex-ai/generative-ai/docs/learn/responsible-ai).\n- Learn how to [tune a model](/vertex-ai/generative-ai/docs/models/tune-models).\n- Learn about [Provisioned Throughput](/vertex-ai/generative-ai/docs/provisioned-throughput) to assure production workloads."]]