Live API reference

This document provides a reference for the Live API and shows you how to use it for bidirectional streaming with Gemini. It covers the following topics:

  • Capabilities: Learn about the key features of the Live API, including multimodality and low-latency interaction.
  • Get started: Find a quick-start example for using the API with text-to-text generation.
  • Integration guide: Understand the core concepts for integrating with the API, such as sessions, message types, and function calling.
  • Limitations: Review the current limitations of the API, including session duration and authentication.
  • Messages and events: See the detailed reference for all client and server messages.

The following diagram summarizes the overall workflow:

The Live API lets you have low-latency, bidirectional voice and video conversations with Gemini. You can create natural, human-like interactions for your users, including the ability to interrupt the model's responses with voice commands. The Live API processes text, audio, and video input, and provides text and audio output.

For more information about the Live API, see Live API.

Capabilities

The Live API includes the following key capabilities:

  • Multimodality: The model can see, hear, and speak.
  • Low-latency, real-time interaction: Get fast responses from the model for a fluid conversational experience.
  • Session memory: The model retains memory of all interactions within a single session, recalling previously heard or seen information.
  • Support for function calling, code execution, and Search as a Tool: You can integrate the model with external services and data sources.

The Live API is designed for server-to-server communication. For web and mobile apps, we recommend that you use the integration from our partners at Daily.

Supported models

Get started

To try the Live API, go to the Vertex AI Studio, and then click Start Session.

The Live API is a stateful API that uses WebSockets.

This section provides an example of how to use the Live API for text-to-text generation with Python 3.9 or later.

Python

Install

pip install --upgrade google-genai

To learn more, see the SDK reference documentation.

Set environment variables to use the Gen AI SDK with Vertex AI:

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True

from google import genai
from google.genai.types import (Content, HttpOptions, LiveConnectConfig,
                                Modality, Part)

client = genai.Client(http_options=HttpOptions(api_version="v1beta1"))
model_id = "gemini-2.0-flash-live-preview-04-09"

async with client.aio.live.connect(
    model=model_id,
    config=LiveConnectConfig(response_modalities=[Modality.TEXT]),
) as session:
    text_input = "Hello? Gemini, are you there?"
    print("> ", text_input, "\n")
    await session.send_client_content(
        turns=Content(role="user", parts=[Part(text=text_input)])
    )

    response = []

    async for message in session.receive():
        if message.text:
            response.append(message.text)

    print("".join(response))
# Example output:
# >  Hello? Gemini, are you there?
# Yes, I'm here. What would you like to talk about?

Integration guide

This section describes how to integrate your application with the Live API.

Sessions

A WebSocket connection establishes a session between the client and the Gemini server. After you initiate a new connection, your client can exchange messages with the server to do the following:

  • Send text, audio, or video to the Gemini server.
  • Receive audio, text, or function call requests from the Gemini server.

Send the session configuration in the first message after you establish a connection. A session configuration includes the model, generation parameters, system instructions, and tools.

The following is an example configuration:


{
  "model": string,
  "generationConfig": {
    "candidateCount": integer,
    "maxOutputTokens": integer,
    "temperature": number,
    "topP": number,
    "topK": integer,
    "presencePenalty": number,
    "frequencyPenalty": number,
    "responseModalities": [string],
    "speechConfig": object
  },

  "systemInstruction": string,
  "tools": [object]
}

For more information, see BidiGenerateContentSetup.

Send messages

Messages are JSON-formatted objects that are exchanged over the WebSocket connection.

To send a message, your client sends a JSON object over an open WebSocket connection. The JSON object must have exactly one of the fields from the following object set:


{
  "setup": BidiGenerateContentSetup,
  "clientContent": BidiGenerateContentClientContent,
  "realtimeInput": BidiGenerateContentRealtimeInput,
  "toolResponse": BidiGenerateContentToolResponse
}

Supported client messages

The following table describes the messages that you can send to the server.

Message Description Use Case
BidiGenerateContentSetup Session configuration that you send in the first message. Send once at the beginning of a new session to configure the model, tools, and other parameters.
BidiGenerateContentClientContent An incremental content update of the current conversation from the client. Send text input, establish or restore session context, or provide turn-by-turn interactions.
BidiGenerateContentRealtimeInput Real-time audio or video input. Stream continuous media data (like voice) to the model without waiting for a turn to complete.
BidiGenerateContentToolResponse A response to a ToolCallMessage from the server. Provide the results back to the model after executing a function call requested by the server.

Receive messages

To receive messages from Gemini, listen for the WebSocket message event. Then, parse the result according to the definitions of the supported server messages.

For example:

ws.addEventListener("message", async (evt) => {
  if (evt.data instanceof Blob) {
    // Process the received data (audio, video, etc.)
  } else {
    // Process JSON response
  }
});

Server messages have exactly one of the fields from the following object set:


{
  "setupComplete": BidiGenerateContentSetupComplete,
  "serverContent": BidiGenerateContentServerContent,
  "toolCall": BidiGenerateContentToolCall,
  "toolCallCancellation": BidiGenerateContentToolCallCancellation
  "usageMetadata": UsageMetadata
  "goAway": GoAway
  "sessionResumptionUpdate": SessionResumptionUpdate
  "inputTranscription": BidiGenerateContentTranscription
  "outputTranscription": BidiGenerateContentTranscription
}

Supported server messages

The following table describes the messages that you can receive from the server.

Message Description Use Case
BidiGenerateContentSetupComplete Acknowledges a BidiGenerateContentSetup message from the client when setup is complete. Confirms to the client that the initial configuration was received and the session is ready for interaction.
BidiGenerateContentServerContent Content generated by the model in response to a client message. Delivers the model's response (text or audio) to the client.
BidiGenerateContentToolCall A request for the client to run function calls and return the responses. Instructs the client to execute a specific tool or function with provided arguments.
BidiGenerateContentToolCallCancellation Sent when a function call is canceled due to the user interrupting model output. Notifies the client to cancel a pending tool call.
UsageMetadata A report of the number of tokens used by the session so far. Monitor token consumption during a session.
GoAway A signal that the current connection will soon be terminated. Allows the client to prepare for a graceful disconnection.
SessionResumptionUpdate A session checkpoint that can be resumed. Provides a handle to the client to resume a disconnected session.
BidiGenerateContentTranscription A transcription of either the user's or model's speech. Provides a real-time text version of spoken audio from either the user or the model.

Incremental content updates

Use incremental updates to send text input, establish session context, or restore session context. For short contexts, you can send turn-by-turn interactions to represent the exact sequence of events. For longer contexts, we recommend that you provide a single message summary to free up the context window for follow-up interactions.

The following is an example of a context message:

{
  "clientContent": {
    "turns": [
      {
          "parts":[
          {
            "text": ""
          }
        ],
        "role":"user"
      },
      {
          "parts":[
          {
            "text": ""
          }
        ],
        "role":"model"
      }
    ],
    "turnComplete": true
  }
}

While content parts can be of a functionResponse type, don't use BidiGenerateContentClientContent to respond to function calls from the model. Instead, use BidiGenerateContentToolResponse. Use BidiGenerateContentClientContent only to establish previous context or provide text input to the conversation.

Streaming audio and video

Code execution

To learn more about code execution, see Code execution.

Function calling

You must declare all functions at the start of the session by sending tool definitions as part of the BidiGenerateContentSetup message.

You define functions by using JSON, specifically with a select subset of the OpenAPI schema format.

A single function declaration can include the following parameters:

  • name (string): The unique identifier for the function within the API call.
  • description (string): A comprehensive explanation of the function's purpose and capabilities.
  • parameters (object): Defines the input data required by the function.
    • type (string): Specifies the overall data type, such as object.
    • properties (object): Lists individual parameters, each with:
      • type (string): The data type of the parameter, such as string, integer, or boolean.
      • description (string): A clear explanation of the parameter's purpose and expected format.
    • required (array): An array of strings listing the parameter names that are mandatory for the function to operate.

For code examples of a function declaration using curl commands, see Function calling with the Gemini API. For examples of how to create function declarations using the Gemini API SDKs, see the Function calling tutorial.

From a single prompt, the model can generate multiple function calls and the code to chain their outputs. This code executes in a sandbox environment and generates subsequent BidiGenerateContentToolCall messages. Execution pauses until the results of each function call are available, which provides sequential processing.

Your client should respond with BidiGenerateContentToolResponse.

To learn more, see Introduction to function calling.

Audio formats

See the list of supported audio formats.

System instructions

You can provide system instructions to better control the model's output and specify the tone and sentiment of audio responses.

System instructions are added to the prompt before the interaction begins and remain in effect for the entire session. You can only set them at the beginning of a session, immediately after the initial connection. To provide more input to the model during the session, use incremental content updates.

Interruptions

Users can interrupt the model's output at any time. When Voice activity detection (VAD) detects an interruption, the ongoing generation is canceled and discarded. Only the information that has already been sent to the client is retained in the session history. The server then sends a BidiGenerateContentServerContent message to report the interruption.

In addition, the Gemini server discards any pending function calls and sends a BidiGenerateContentServerContent message with the IDs of the canceled calls.

Voices

To specify a voice, set the voiceName within the speechConfig object as part of your session configuration.

The following is a JSON representation of a speechConfig object:

{
  "voiceConfig": {
    "prebuiltVoiceConfig": {
      "voiceName": "VOICE_NAME"
    }
  }
}

To see the list of supported voices, see Change voice and language settings.

Limitations

Consider the following limitations of the Live API and Gemini 2.0 when you plan your project.

Client authentication

The Live API provides only server-to-server authentication and isn't recommended for direct client use. Route client input through an intermediate application server for secure authentication with the Live API.

Maximum session duration

The default maximum length of a conversation session is 10 minutes. For more information, see Session length.

Voice activity detection (VAD)

By default, the model automatically performs voice activity detection (VAD) on a continuous audio input stream. You can configure VAD with the RealtimeInputConfig.AutomaticActivityDetection field of the setup message.

The API supports two VAD modes:

  • Automatic VAD (default): The server detects user speech.

    • If the audio stream is paused for more than a second (for example, when the user mutes the microphone), an AudioStreamEnd event is sent to flush any cached audio.
    • Your client can resume sending audio data at any time.
  • Manual VAD: To use this mode, set RealtimeInputConfig.AutomaticActivityDetection.disabled to true in the setup message.

    • Your client is responsible for detecting user speech and sending ActivityStart and ActivityEnd messages at the appropriate times.
    • An AudioStreamEnd event isn't sent in this configuration. Instead, an ActivityEnd message marks any interruption of the stream.

Additional limitations

  • Manual endpointing isn't supported.
  • Audio inputs and outputs can negatively impact the model's ability to use function calling.

Token count

Token count isn't supported.

Rate limits

The following rate limits apply:

  • 5,000 concurrent sessions per API key
  • 4M tokens per minute

Messages and events

BidiGenerateContentClientContent

An incremental update of the current conversation delivered from the client. All content is unconditionally appended to the conversation history and used as part of the prompt to the model to generate content.

A message interrupts any current model generation.

Fields
turns[]

Content

Optional. The content appended to the current conversation with the model.

For single-turn queries, this is a single instance. For multi-turn queries, this is a repeated field that contains conversation history and latest request.

turn_complete

bool

Optional. If true, indicates that the server content generation should start with the currently accumulated prompt. Otherwise, the server will await additional messages before starting generation.

BidiGenerateContentRealtimeInput

User input that is sent in real time.

This differs from ClientContentUpdate in the following ways:

  • Can be sent continuously without interruption to model generation.
  • If there is a need to mix data interleaved across the ClientContentUpdate and the RealtimeUpdate, server attempts to optimize for best response, but there are no guarantees.
  • End of turn is not explicitly specified, but is rather derived from user activity (for example, end of speech).
  • Even before the end of turn, the data is processed incrementally to optimize for a fast start of the response from the model.
  • Is always assumed to be the user's input and can't be used to populate conversation history.
Fields
media_chunks[]

Blob

Optional. Inlined bytes data for media input.

activity_start

ActivityStart

Optional. Marks the start of user activity. This can only be sent if automatic (i.e. server-side) activity detection is disabled.

activity_end

ActivityEnd

Optional. Marks the end of user activity. This can only be sent if automatic (i.e. server-side) activity detection is disabled.

ActivityEnd

This type has no fields.

Marks the end of user activity.

ActivityStart

This type has no fields.

Only one of the fields in this message must be set at a time. Marks the start of user activity.

BidiGenerateContentServerContent

An incremental server update generated by the model in response to client messages.

Content is generated as quickly as possible, not in real time. Clients can choose to buffer the content and play it in real time.

Fields
turn_complete

bool

Output only. If true, indicates that the model is done generating. Generation will only start in response to additional client messages. Can be set alongside content, indicating that the content is the last in the turn.

interrupted

bool

Output only. If true, indicates that a client message has interrupted current model generation. If the client is playing out the content in realtime, this is a good signal to stop and empty the current queue. If the client is playing out the content in realtime, this is a good signal to stop and empty the current playback queue.

generation_complete

bool

Output only. If true, indicates that the model is done generating.

When model is interrupted while generating there will be no 'generation_complete' message in interrupted turn, it will go through 'interrupted > turn_complete'.

When model assumes realtime playback there will be delay between generation_complete and turn_complete that is caused by model waiting for playback to finish.

grounding_metadata

GroundingMetadata

Output only. Metadata specifies sources used to ground generated content.

input_transcription

Transcription

Optional. Input transcription. The transcription is independent to the model turn which means it doesn't imply any ordering between transcription and model turn.

output_transcription

Transcription

Optional. Output transcription. The transcription is independent to the model turn which means it doesn't imply any ordering between transcription and model turn.

model_turn

Content

Output only. The content that the model has generated as part of the current conversation with the user.

Transcription

Audio transcription message.

Fields
text

string

Optional. Transcription text.

finished

bool

Optional. The bool indicates the end of the transcription.

BidiGenerateContentSetup

A message to be sent in the first, and only the first, client message. It contains the configuration that applies for the duration of the streaming session.

Clients should wait for a BidiGenerateContentSetupComplete message before sending any more messages.

Fields
model

string

Required. The fully qualified name of the publisher model.

Publisher model format: projects/{project}/locations/{location}/publishers/\*/models/\*

generation_config

GenerationConfig

Optional. Generation config.

The following fields aren't supported:

  • response_logprobs
  • response_mime_type
  • logprobs
  • response_schema
  • stop_sequence
  • routing_config
  • audio_timestamp
system_instruction

Content

Optional. The user provided system instructions for the model. Note: only text should be used in parts and content in each part will be in a separate paragraph.

tools[]

Tool

Optional. A list of Tools the model may use to generate the next response.

A Tool is a piece of code that enables the system to interact with external systems to perform an action, or set of actions, outside of knowledge and scope of the model.

session_resumption

SessionResumptionConfig

Optional. Configures session resumption mechanism. If included, the server will send periodical SessionResumptionUpdate messages to the client.

context_window_compression

ContextWindowCompressionConfig

Optional. Configures context window compression mechanism.

If included, server will compress context window to fit into given length.

realtime_input_config

RealtimeInputConfig

Optional. Configures the handling of realtime input.

input_audio_transcription

AudioTranscriptionConfig

Optional. The transcription of the input aligns with the input audio language.

output_audio_transcription

AudioTranscriptionConfig

Optional. The transcription of the output aligns with the language code specified for the output audio.

AudioTranscriptionConfig

This type has no fields.

The audio transcription configuration.

BidiGenerateContentSetupComplete

This type has no fields.

Sent in response to a BidiGenerateContentSetup message from the client.

BidiGenerateContentToolCall

A request for the client to execute the function_calls and return the responses with the matching ids.

Fields
function_calls[]

FunctionCall

Output only. The function call to be executed.

BidiGenerateContentToolCallCancellation

A notification for the client that a previously issued ToolCallMessage with the specified ids shouldn't be executed and should be canceled. If there were side effects to those tool calls, clients can attempt to undo them. This message occurs only when clients interrupt server turns.

Fields
ids[]

string

Output only. The ids of the tool calls to be cancelled.

BidiGenerateContentToolResponse

A client-generated response to a ToolCall from the server. Individual FunctionResponse objects are matched to their respective FunctionCall objects by the id field.

In the unary and server-streaming GenerateContent APIs, function calling occurs by exchanging Content parts. In the bidirectional GenerateContent APIs, function calling occurs over this dedicated set of messages.

Fields
function_responses[]

FunctionResponse

Optional. The response to the function calls.

RealtimeInputConfig

Configures the real-time input behavior in BidiGenerateContent.

Fields
automatic_activity_detection

AutomaticActivityDetection

Optional. If not set, automatic activity detection is enabled by default. If automatic voice detection is disabled, the client must send activity signals.

activity_handling

ActivityHandling

Optional. Defines what effect activity has.

turn_coverage

TurnCoverage

Optional. Defines which input is included in the user's turn.

ActivityHandling

The different ways of handling user activity.

Enums
ACTIVITY_HANDLING_UNSPECIFIED If unspecified, the default behavior is START_OF_ACTIVITY_INTERRUPTS.
START_OF_ACTIVITY_INTERRUPTS If true, start of activity will interrupt the model's response (also called "barge in"). The model's current response will be cut-off in the moment of the interruption. This is the default behavior.
NO_INTERRUPTION The model's response will not be interrupted.

AutomaticActivityDetection

Configures automatic detection of activity.

Fields
start_of_speech_sensitivity

StartSensitivity

Optional. Determines how likely speech is to be detected.

end_of_speech_sensitivity

EndSensitivity

Optional. Determines how likely detected speech is ended.

prefix_padding_ms

int32

Optional. The required duration of detected speech before start-of-speech is committed. A lower value makes start-of-speech detection more sensitive and lets shorter speech be recognized. However, a lower value also increases the probability of false positives.

silence_duration_ms

int32

Optional. The required duration of detected silence (or non-speech) before end-of-speech is committed. A larger value lets speech gaps be longer without interrupting the user's activity, but it also increases the model's latency.

disabled

bool

Optional. If enabled, detected voice and text input count as activity. If disabled, the client must send activity signals.

EndSensitivity

End of speech sensitivity.

Enums
END_SENSITIVITY_UNSPECIFIED The default is END_SENSITIVITY_LOW.
END_SENSITIVITY_HIGH Automatic detection ends speech more often.
END_SENSITIVITY_LOW Automatic detection ends speech less often.

StartSensitivity

Start of speech sensitivity.

Enums
START_SENSITIVITY_UNSPECIFIED The default is START_SENSITIVITY_LOW.
START_SENSITIVITY_HIGH Automatic detection will detect the start of speech more often.
START_SENSITIVITY_LOW Automatic detection will detect the start of speech less often.

TurnCoverage

Options about which input is included in the user's turn.

Enums
TURN_COVERAGE_UNSPECIFIED If unspecified, the default behavior is TURN_INCLUDES_ALL_INPUT.
TURN_INCLUDES_ONLY_ACTIVITY The users turn only includes activity since the last turn, excluding inactivity (e.g. silence on the audio stream).
TURN_INCLUDES_ALL_INPUT The users turn includes all realtime input since the last turn, including inactivity (e.g. silence on the audio stream). This is the default behavior.

UsageMetadata

Metadata on the usage of the cached content.

Fields
total_token_count

int32

Total number of tokens that the cached content consumes.

text_count

int32

Number of text characters.

image_count

int32

Number of images.

video_duration_seconds

int32

Duration of video in seconds.

audio_duration_seconds

int32

Duration of audio in seconds.

GoAway

Indicates that the server will soon be unable to service the client.

Fields
time_left

Duration

The remaining time before the connection will be terminated as ABORTED. The minimal time returned here is specified differently together with the rate limits for a given model.

SessionResumptionUpdate

Update of the session resumption state.

Only sent if BidiGenerateContentSetup.session_resumption was set.

Fields
new_handle

string

New handle that represents state that can be resumed. Empty if resumable=false.

resumable

bool

True if the session can be resumed at this point.

It might not be possible to resume a session at some points. In that case, the server sends an update with an empty new_handle and resumable=false. An example of this is when the model is executing function calls or generating a response. Resuming a session in such a state by using the previous session token results in some data loss.

last_consumed_client_message_index

int64

The index of the last message sent by the client that is included in the state represented by this SessionResumptionToken. This is only sent when SessionResumptionConfig.transparent is set.

The presence of this index lets users transparently reconnect and avoid losing parts of real-time audio or video input. If a client wants to temporarily disconnect (for example, after receiving GoAway), they can do so without losing state by buffering messages sent since the last SessionResmumptionTokenUpdate. This field lets them limit buffering and avoid keeping all requests in RAM.

It isn't used for 'resumption to restore state' at a later time, because in those cases, partial audio and video frames are likely not needed.

What's next