Live API

借助 Live API，您可以与 Gemini 进行低延迟的双向语音和视频互动。使用 Live API 为最终用户提供自然的、类似人类的语音对话体验，包括能够使用语音指令中断模型的回答。

本文档介绍了使用 Live API 的基础知识，包括其功能、入门示例和基本用例代码示例。如果您想了解如何使用 Live API 开始交互式对话，请参阅使用 Live API 进行交互式对话。如果您想了解 Live API 可以使用哪些工具，请参阅内置工具。

在 Vertex AI 中试用

支持的模型

Google Gen AI SDK 和 Vertex AI Studio 均支持使用 Live API。某些功能（例如文本输入和输出）只能通过 Gen AI SDK 使用。

您可以将 Live API 与以下模型搭配使用：

模型版本	可用性级别
`gemini-live-2.5-flash`	非公开正式版^*
`gemini-live-2.5-flash-preview-native-audio-09-2025`	公开预览版
`gemini-live-2.5-flash-preview-native-audio`	公开预览版；终止日期：2025 年 10 月 18 日

^* 请与您的 Google 账号团队代表联系，申请访问权限。

如需了解更多信息（包括技术规范和限制），请参阅 Live API 参考指南。

Live API 功能

实时多模态理解：通过内置的音频和视频流支持，与 Gemini 就其在视频画面中或通过屏幕共享看到的内容进行对话。
内置工具使用：将函数调用和依托 Google 搜索进行接地等工具无缝集成到对话中，实现更实用、更动态的互动。
低延迟互动：与 Gemini 进行低延迟的类人互动。
多语言支持：支持 24 种语言。
（仅限正式版）支持预配吞吐量：使用固定费用、固定期限的订阅服务（提供多种期限长度），为 Vertex AI 上受支持的生成式 AI 模型（包括 Live API）预留吞吐量。
高质量转写：Live API 支持对输入和输出音频进行文字转写。

带有 Live API 的 Gemini 2.5 Flash 还包含原生音频功能，以公开预览版的形式提供。原生音频引入了以下功能：

共情对话：Live API 可理解用户的语气并做出相应回应。以不同方式说出相同的字词可能会带来截然不同且更细致的对话。
主动音频和情境感知：Live API 可智能忽略环境对话和其他无关音频，了解何时应聆听，何时应保持静默。

如需详细了解原生音频，请参阅内置工具。

支持的音频格式

Live API 支持以下音频格式：

输入音频：16 位原始 PCM 音频，16kHz，小端字节序
输出音频：原始 16 位 PCM 音频，24 kHz，小端字节序

支持的视频格式

Live API 支持以 1 FPS 的帧速率输入视频帧。为获得最佳效果，请使用 768x768 的原生分辨率，并以 1 FPS 的帧速率运行。

入门示例

您可以从以下笔记本教程、演示应用或指南入手，开始使用 Live API。

笔记本教程

从 GitHub 下载这些笔记本教程，或在您选择的环境中打开这些笔记本教程。

将 WebSocket 与 Live API 搭配使用

流式音频和视频

演示应用和指南

更多示例

如需进一步提升 Live API 的实用性，请尝试以下示例，这些示例使用了 Live API 的音频处理、转写和语音回答功能。

根据音频输入获取文本回答

您可以将音频转换为 16 位 PCM、16kHz、单声道格式，然后发送音频并接收文本回答。以下示例读取 WAV 文件并以正确的格式发送：

Python

# Test file: https://storage.googleapis.com/generativeai-downloads/data/16000.wav
# Install helpers for converting files: pip install librosa soundfile

import asyncio
import io
from pathlib import Path
from google import genai
from google.genai import types
import soundfile as sf
import librosa

client = genai.Client(
    vertexai=True,
    project=GOOGLE_CLOUD_PROJECT,
    location=GOOGLE_CLOUD_LOCATION,
)
model = "gemini-live-2.5-flash"
config = {"response_modalities": ["TEXT"]}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:

        buffer = io.BytesIO()
        y, sr = librosa.load("sample.wav", sr=16000)
        sf.write(buffer, y, sr, format="RAW", subtype="PCM_16")
        buffer.seek(0)
        audio_bytes = buffer.read()

        # If already in correct format, you can use this:
        # audio_bytes = Path("sample.pcm").read_bytes()

        await session.send_realtime_input(
            audio=types.Blob(data=audio_bytes, mime_type="audio/pcm;rate=16000")
        )

        async for response in session.receive():
            if response.text is not None:
                print(response.text)

if __name__ == "__main__":
    asyncio.run(main())

根据文本输入获取语音回答

使用此示例发送文本输入并接收合成语音回答：

Python

import asyncio
import numpy as np
from IPython.display import Audio, Markdown, display
from google import genai
from google.genai.types import (
  Content,
  LiveConnectConfig,
  HttpOptions,
  Modality,
  Part,
  SpeechConfig,
  VoiceConfig,
  PrebuiltVoiceConfig,
)

client = genai.Client(
  vertexai=True,
  project=GOOGLE_CLOUD_PROJECT,
  location=GOOGLE_CLOUD_LOCATION,
)

voice_name = "Aoede"

config = LiveConnectConfig(
  response_modalities=["AUDIO"],
  speech_config=SpeechConfig(
      voice_config=VoiceConfig(
          prebuilt_voice_config=PrebuiltVoiceConfig(
              voice_name=voice_name,
          )
      ),
  ),
)

async with client.aio.live.connect(
  model="gemini-live-2.5-flash",
  config=config,
) as session:
  text_input = "Hello? Gemini are you there?"
  display(Markdown(f"**Input:** {text_input}"))

  await session.send_client_content(
      turns=Content(role="user", parts=[Part(text=text_input)]))

  audio_data = []
  async for message in session.receive():
      if (
          message.server_content.model_turn
          and message.server_content.model_turn.parts
      ):
          for part in message.server_content.model_turn.parts:
              if part.inline_data:
                  audio_data.append(
                      np.frombuffer(part.inline_data.data, dtype=np.int16)
                  )

  if audio_data:
      display(Audio(np.concatenate(audio_data), rate=24000, autoplay=True))

如需查看有关发送文本的更多示例，请参阅我们的入门指南。

转录音频

Live API 可以转写输入和输出音频。使用以下示例启用转写功能：

Python

import asyncio
from google import genai
from google.genai import types

client = genai.Client(
    vertexai=True,
    project=GOOGLE_CLOUD_PROJECT,
    location=GOOGLE_CLOUD_LOCATION,
)
model = "gemini-live-2.5-flash"

config = {
    "response_modalities": ["AUDIO"],
    "input_audio_transcription": {},
    "output_audio_transcription": {}
}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:
        message = "Hello? Gemini are you there?"

        await session.send_client_content(
            turns={"role": "user", "parts": [{"text": message}]}, turn_complete=True
        )

        async for response in session.receive():
            if response.server_content.model_turn:
                print("Model turn:", response.server_content.model_turn)
            if response.server_content.input_transcription:
                print("Input transcript:", response.server_content.input_transcription.text)
            if response.server_content.output_transcription:
                print("Output transcript:", response.server_content.output_transcription.text)

if __name__ == "__main__":
    asyncio.run(main())

WebSockets

# Set model generation_config
CONFIG = {
    'response_modalities': ['AUDIO'],
}

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {bearer_token[0]}",
}

# Connect to the server
async with connect(SERVICE_URL, additional_headers=headers) as ws:
    # Setup the session
    await ws.send(
        json.dumps(
            {
                "setup": {
                    "model": "gemini-2.0-flash-live-preview-04-09",
                    "generation_config": CONFIG,
                    'input_audio_transcription': {},
                    'output_audio_transcription': {}
                }
            }
        )
    )

    # Receive setup response
    raw_response = await ws.recv(decode=False)
    setup_response = json.loads(raw_response.decode("ascii"))

    # Send text message
    text_input = "Hello? Gemini are you there?"
    display(Markdown(f"**Input:** {text_input}"))

    msg = {
        "client_content": {
            "turns": [{"role": "user", "parts": [{"text": text_input}]}],
            "turn_complete": True,
        }
    }

    await ws.send(json.dumps(msg))

    responses = []
    input_transcriptions = []
    output_transcriptions = []

    # Receive chucks of server response
    async for raw_response in ws:
        response = json.loads(raw_response.decode())
        server_content = response.pop("serverContent", None)
        if server_content is None:
            break

        if (input_transcription := server_content.get("inputTranscription")) is not None:
            if (text := input_transcription.get("text")) is not None:
                input_transcriptions.append(text)
        if (output_transcription := server_content.get("outputTranscription")) is not None:
            if (text := output_transcription.get("text")) is not None:
                output_transcriptions.append(text)

        model_turn = server_content.pop("modelTurn", None)
        if model_turn is not None:
            parts = model_turn.pop("parts", None)
            if parts is not None:
                for part in parts:
                    pcm_data = base64.b64decode(part["inlineData"]["data"])
                    responses.append(np.frombuffer(pcm_data, dtype=np.int16))

        # End of turn
        turn_complete = server_content.pop("turnComplete", None)
        if turn_complete:
            break

    if input_transcriptions:
        display(Markdown(f"**Input transcription >** {''.join(input_transcriptions)}"))

    if responses:
        # Play the returned audio message
        display(Audio(np.concatenate(responses), rate=24000, autoplay=True))

    if output_transcriptions:
        display(Markdown(f"**Output transcription >** {''.join(output_transcriptions)}"))

实时 API 转写服务的价格取决于文本输出token的数量。如需了解详情，请参阅 Vertex AI 价格页面。

Live API 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

支持的模型

Live API 功能

支持的音频格式

支持的视频格式

入门示例

笔记本教程

将 WebSocket 与 Live API 搭配使用

流式音频和视频

演示应用和指南

更多示例

根据音频输入获取文本回答

Python

根据文本输入获取语音回答

Python

转录音频

Python

WebSockets

更多信息

Live API