边界框检测

在此实验性发布中,我们为开发者提供了一款强大的工具,助力开发者在图片和视频中进行对象检测和定位。通过使用边界框准确识别和划分对象,开发者可以解锁各种应用并提升项目的智能化水平。

主要优势:

  • 简单:无论您是否具备计算机视觉专业知识,都可以轻松地将对象检测功能集成到您的应用中。
  • 可自定义:根据自定义指令(例如“I want to see bounding boxes of all the green objects in this image”)生成边界框,而无需训练自定义模型。

技术详情:

  • 输入:提示和关联的图片或视频帧。
  • 输出:边界框,采用 [y_min, x_min, y_max, x_max] 格式。左上角是原点。x 轴是水平轴,y 轴是垂直轴。每个图片的坐标值都进行标准化处理,范围为 0-1000。
  • 可视化:AI Studio 用户将在界面中看到绘制的边界框。Vertex AI 用户应通过自定义可视化代码直观呈现其边界框。

Python

安装

pip install --upgrade google-genai

如需了解详情,请参阅 SDK 参考文档

设置环境变量以将 Gen AI SDK 与 Vertex AI 搭配使用:

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_VERTEXAI=True

import requests
from google import genai
from google.genai.types import (
    GenerateContentConfig,
    HarmBlockThreshold,
    HarmCategory,
    HttpOptions,
    Part,
    SafetySetting,
)
from PIL import Image, ImageColor, ImageDraw
from pydantic import BaseModel

# Helper class to represent a bounding box
class BoundingBox(BaseModel):
    """
    Represents a bounding box with its 2D coordinates and associated label.

    Attributes:
        box_2d (list[int]): A list of integers representing the 2D coordinates of the bounding box,
                            typically in the format [y_min, x_min, y_max, x_max].
        label (str): A string representing the label or class associated with the object within the bounding box.
    """

    box_2d: list[int]
    label: str

# Helper function to plot bounding boxes on an image
def plot_bounding_boxes(image_uri: str, bounding_boxes: list[BoundingBox]) -> None:
    """
    Plots bounding boxes on an image with labels, using PIL and normalized coordinates.

    Args:
        image_uri: The URI of the image file.
        bounding_boxes: A list of BoundingBox objects. Each box's coordinates are in
                        normalized [y_min, x_min, y_max, x_max] format.
    """
    with Image.open(requests.get(image_uri, stream=True, timeout=10).raw) as im:
        width, height = im.size
        draw = ImageDraw.Draw(im)

        colors = list(ImageColor.colormap.keys())

        for i, bbox in enumerate(bounding_boxes):
            # Scale normalized coordinates to image dimensions
            abs_y_min = int(bbox.box_2d[0] / 1000 * height)
            abs_x_min = int(bbox.box_2d[1] / 1000 * width)
            abs_y_max = int(bbox.box_2d[2] / 1000 * height)
            abs_x_max = int(bbox.box_2d[3] / 1000 * width)

            color = colors[i % len(colors)]

            # Draw the rectangle using the correct (x, y) pairs
            draw.rectangle(
                ((abs_x_min, abs_y_min), (abs_x_max, abs_y_max)),
                outline=color,
                width=4,
            )
            if bbox.label:
                # Position the text at the top-left corner of the box
                draw.text((abs_x_min + 8, abs_y_min + 6), bbox.label, fill=color)

        im.show()

client = genai.Client(http_options=HttpOptions(api_version="v1"))

config = GenerateContentConfig(
    system_instruction="""
    Return bounding boxes as an array with labels.
    Never return masks. Limit to 25 objects.
    If an object is present multiple times, give each object a unique label
    according to its distinct characteristics (colors, size, position, etc..).
    """,
    temperature=0.5,
    safety_settings=[
        SafetySetting(
            category=HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
            threshold=HarmBlockThreshold.BLOCK_ONLY_HIGH,
        ),
    ],
    response_mime_type="application/json",
    response_schema=list[BoundingBox],
)

image_uri = "https://storage.googleapis.com/generativeai-downloads/images/socks.jpg"

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        Part.from_uri(
            file_uri=image_uri,
            mime_type="image/jpeg",
        ),
        "Output the positions of the socks with a face. Label according to position in the image.",
    ],
    config=config,
)
print(response.text)
plot_bounding_boxes(image_uri, response.parsed)

# Example response:
# [
#     {"box_2d": [6, 246, 386, 526], "label": "top-left light blue sock with cat face"},
#     {"box_2d": [234, 649, 650, 863], "label": "top-right light blue sock with cat face"},
# ]