Multimodal prompts
For information on best practices for multimodal prompts, see the following pages based on the modality that you're working with:
Reduce latency
When you build interactive applications, response time, also known as latency, plays a crucial role in the user experience. This section explores the concept of latency in the context of Vertex AI LLM APIs and provides actionable strategies to minimize it and improve the response time of your AI-powered applications.
Understanding latency metrics for LLMs
Latency refers to the time it takes for a model to process your input prompt and generate a corresponding output response.
When examining latency with a model, consider the following:
Time to first token (TTFT) is the time that it takes for the model to produce the first token of the response after receiving the prompt. TTFT is particularly relevant for applications utilizing streaming, where providing immediate feedback is crucial.
Time to last token (TTLT) measures the overall time taken by the model to process the prompt and generate the response.
Strategies to reduce latency
You can utilize several strategies with Vertex AI to minimize latency and enhance the responsiveness of your applications:
Choose the right model for your use case
Vertex AI provides a diverse range of models with varying capabilities and performance characteristics. Select the model that best suits your specific needs.
Gemini 1.5 Flash: A multimodal model designed for high volume, cost-effective applications. Gemini 1.5 Flash delivers speed and efficiency to build fast, lower-cost applications that don't compromise on quality. It supports the following modalities: text, code, images, audio, video with and without audio, PDFs, or a combination of any of these.
Gemini 1.5 Pro: A more capable multimodal model with support for larger context. It supports the following modalities: text, code, images, audio, video with and without audio, PDFs, or a combination of any of these.
Gemini 1.0 Pro: If speed is a top priority and your prompts contain only text, then consider using this model. This model offers fast response times while still delivering impressive results.
Carefully evaluate your requirements regarding speed and output quality to choose the model that best aligns with your use case. For a list of available models, see Explore all models.
Optimize prompt and output length
The number of tokens in both your input prompt and expected output directly impacts processing time. Minimize your token count to reduce latency.
Craft clear and concise prompts that effectively convey your intent without unnecessary details or redundancy. Shorter prompts reduce your time to first token.
Use system instructions to control the length of the response. Instruct the model to provide concise answers or limit the output to a specific number of sentences or paragraphs. This strategy can reduce your time to last token.
Adjust the
temperature
. Experiment with thetemperature
parameter to control the randomness of the output. Lowertemperature
values can lead to shorter, more focused responses, while higher values can result in more diverse, but potentially longer, outputs. For more information, seetemperature
in the model parameters reference.Restrict output by setting a limit. Use the
max_output_tokens
parameter to set a maximum limit on the length of the generated response length, preventing overly long output. However, be cautious as this might cut off responses mid-sentence.
Stream responses
With streaming, the model starts sending its response before it generates the complete output. This enables real-time processing of the output, and you can immediately update your user interface and perform other concurrent tasks.
Streaming enhances perceived responsiveness and creates a more interactive user experience.
What's next
- Learn general prompt design strategies.
- See some sample prompts.
- Learn how to send chat prompts.
- Learn about responsible AI best practices and Vertex AI's safety filters.
- Learn how to tune a model.
- Learn about Provisioned Throughput to
assure production workloads.