Best practices with large language models (LLMs)

Multimodal prompts

For information on best practices for multimodal prompts, see Multimodal best practices.

Reduce latency

When you build interactive applications, response time, also known as latency, plays a crucial role in the user experience. This section explores the concept of latency in the context of Vertex AI LLM APIs and provides actionable strategies to minimize it and improve the response time of your AI-powered applications.

Understanding latency metrics for LLMs

Latency refers to the time it takes for a model to process your input prompt and generate a corresponding output response.

When examining latency with a model, consider the following:

Time to first token (TTFT) is the time that it takes for the model to produce the first token of the response after receiving the prompt. TTFT is particularly relevant for applications utilizing streaming, where providing immediate feedback is crucial.

Time to last token (TTLT) measures the overall time taken by the model to process the prompt and generate the response.

Strategies to reduce latency

You can utilize several strategies with Vertex AI to minimize latency and enhance the responsiveness of your applications:

Choose the right model for your use case

Vertex AI provides a diverse range of models with varying capabilities and performance characteristics. Select the model that best suits your specific needs.

  • Gemini 1.5 Flash: A multimodal model designed for high volume, cost-effective applications. Gemini 1.5 Flash delivers speed and efficiency to build fast, lower-cost applications that don't compromise on quality. It supports the following modalities: text, code, images, audio, video with and without audio, PDFs, or a combination of any of these.

  • Gemini 1.5 Pro: A more capable multimodal model with support for larger context. It supports the following modalities: text, code, images, audio, video with and without audio, PDFs, or a combination of any of these.

  • Gemini 1.0 Pro: If speed is a top priority and your prompts contain only text, then consider using this model. This model offers fast response times while still delivering impressive results.

Carefully evaluate your requirements regarding speed and output quality to choose the model that best aligns with your use case. For a list of available models, see Explore all models.

Optimize prompt and output length

The number of tokens in both your input prompt and expected output directly impacts processing time. Minimize your token count to reduce latency.

  • Craft clear and concise prompts that effectively convey your intent without unnecessary details or redundancy. Shorter prompts reduce your time to first token.

  • Use system instructions to control the length of the response. Instruct the model to provide concise answers or limit the output to a specific number of sentences or paragraphs. This strategy can reduce your time to last token.

  • Adjust the temperature. Experiment with the temperature parameter to control the randomness of the output. Lower temperature values can lead to shorter, more focused responses, while higher values can result in more diverse, but potentially longer, outputs. For more information, see Temperature.

  • Restrict output by setting a limit. Use the max_output_tokens parameter to set a maximum limit on the length of the generated response length, preventing overly long output. However, be cautious as this might cut off responses mid-sentence.

Stream responses

With streaming, the model starts sending its response before it generates the complete output. This enables real-time processing of the output, and you can immediately update your user interface and perform other concurrent tasks.

Streaming enhances perceived responsiveness and creates a more interactive user experience. For more information, see Stream responses from Generative AI models.

What's next