マルチモーダルエンベディングを取得する

マルチモーダルエンベディングモデルは、提供された入力に基づいて 1,408 次元のベクトル* を生成します。これには、画像、テキスト、動画データの組み合わせが含まれます。エンベディングベクトルは、画像分類や動画コンテンツのモデレーションなどの後続のタスクに使用できます。

画像エンベディングベクトルとテキストエンベディングベクトルは、同じ次元を持つ同じセマンティック空間にあります。そのため、これらのベクトルは、テキストによる画像の検索や画像による動画の検索などのユースケースでも使用できます。

テキストのみのエンベディングでは、Vertex AI Text-embeddings API を使用することをおすすめします。たとえば、Text-embeddings API は、テキストベースのセマンティック検索、クラスタリング、長時間のドキュメント分析、その他のテキスト取得や質問応答のユースケースに適しています。詳細については、テキストエンベディングを取得するをご覧ください。

サポートされているモデル

次のモデルを使用して、マルチモーダルエンベディングを取得できます。

multimodalembedding

ベストプラクティス

マルチモーダルエンベディングモデルを使用する場合は、入力時の次のことを考慮してください。

画像内のテキスト - このモデルは、光学式文字認識（OCR）と同様に、画像内のテキストを認識できます。画像コンテンツの説明と画像内のテキストを区別する必要がある場合は、プロンプトエンジニアリングを使用してターゲットコンテンツを指定することを検討してください。たとえば、ユースケースに応じて単に「cat」ではなく「picture of a cat」または「the text 'cat'」と指定します。

テキスト「cat」
猫の写真	^{画像クレジット: Manja Vitolic、Unsplash。}

エンベディングの類似度 - エンベディングのドット積は、調整された確率ではありません。ドット積は類似度指標であり、ユースケースによってスコア分布が異なる場合があります。品質を測定するために固定値のしきい値を使用することは避けてください。代わりに、取得にランキングアプローチを使用するか、分類にシグモイドを使用します。

API の使用

API の上限

テキストと画像のエンベディングに multimodalembedding モデルを使用する場合は、次の上限が適用されます。

上限	値と説明
	テキストと画像のデータ
各プロジェクト 1 分あたりの最大 API リクエスト数	120
テキストの最大長	32 トークン（約 32 ワード）テキストの最大長は 32 トークン（約 32 ワード）です。入力が 32 トークンを超える場合、この長さになるまで内部で入力が短縮されます。
言語	英語
画像形式	BMP、GIF、JPG、PNG
イメージサイズ	Base64 エンコード画像: 20 MB（PNG にコード変換する場合） Cloud Storage 画像: 20 MB（元のファイル形式）使用できる画像サイズは最大 20 MB です。ネットワークのレイテンシの増加を回避するには、より小さいイメージを使用してください。さらに、このモデルは画像を 512 x 512 ピクセルの解像度に変更します。そのため、より高い解像度の画像を用意する必要はありません。
	動画データ
音声対応	なし - このモデルでは、動画エンベディングの生成時に音声コンテンツが考慮されません
動画フォーマット	AVI、FLV、MKV、MOV、MP4、MPEG、MPG、WEBM、WMV
動画の最大長（Cloud Storage）	上限なし。ただし、一度に分析できるコンテンツは 2 分間のみです。

始める前に

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Enable the API

環境の認証を設定します。

Select the tab for how you plan to use the samples on this page:
Java

ローカル開発環境でこのページの Java サンプルを使用するには、gcloud CLI をインストールして初期化し、ユーザー認証情報を使用してアプリケーションのデフォルト認証情報を設定します。
1. Install the Google Cloud CLI.
2. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
3. To initialize the gcloud CLI, run the following command:
  gcloud init
4. After initializing the gcloud CLI, update it and install the required components:
  gcloud components update gcloud components install beta
5. If you're using a local shell, then create local authentication credentials for your user account:
  gcloud auth application-default login
  You don't need to do this if you're using Cloud Shell.
  
  If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.
詳細については、 Google Cloud 認証ドキュメントのローカル開発環境の ADC の設定をご覧ください。
Node.js

ローカル開発環境でこのページの Node.js サンプルを使用するには、gcloud CLI をインストールして初期化し、ユーザー認証情報を使用してアプリケーションのデフォルト認証情報を設定します。
1. Install the Google Cloud CLI.
2. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
3. To initialize the gcloud CLI, run the following command:
  gcloud init
4. After initializing the gcloud CLI, update it and install the required components:
  gcloud components update gcloud components install beta
5. If you're using a local shell, then create local authentication credentials for your user account:
  gcloud auth application-default login
  You don't need to do this if you're using Cloud Shell.
  
  If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.
詳細については、 Google Cloud 認証ドキュメントのローカル開発環境の ADC の設定をご覧ください。
Python

ローカル開発環境でこのページの Python サンプルを使用するには、gcloud CLI をインストールして初期化し、ユーザー認証情報を使用してアプリケーションのデフォルト認証情報を設定します。
1. Install the Google Cloud CLI.
2. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
3. To initialize the gcloud CLI, run the following command:
  gcloud init
4. After initializing the gcloud CLI, update it and install the required components:
  gcloud components update gcloud components install beta
5. If you're using a local shell, then create local authentication credentials for your user account:
  gcloud auth application-default login
  You don't need to do this if you're using Cloud Shell.
  
  If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.
詳細については、 Google Cloud 認証ドキュメントのローカル開発環境の ADC の設定をご覧ください。
REST

このページの REST API サンプルをローカル開発環境で使用するには、gcloud CLI に指定した認証情報を使用します。
1. Install the Google Cloud CLI.
2. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
3. To initialize the gcloud CLI, run the following command:
  gcloud init
4. After initializing the gcloud CLI, update it and install the required components:
  gcloud components update gcloud components install beta
詳細については、 Google Cloud 認証ドキュメントの REST を使用して認証するをご覧ください。
Python SDK を使用するには、Vertex AI SDK for Python をインストールするの手順に沿って操作します。詳細については、Vertex AI SDK for Python API のリファレンスドキュメントをご覧ください。
省略可。この機能の料金を確認します。エンベディングの料金は、送信するデータの種類（画像やテキストなど）によって異なります。また、特定のデータタイプ（Video Plus、Video Standard、Video Essentials など）で使用するモードによっても異なります。

ロケーション

ロケーションは、データの保存場所を制御するためにリクエストで指定できるリージョンです。使用可能なリージョンの一覧については、Vertex AI の生成 AI のロケーションをご覧ください。

エラーメッセージ

割り当て超過エラー

google.api_core.exceptions.ResourceExhausted: 429 Quota exceeded for
aiplatform.googleapis.com/online_prediction_requests_per_base_model with base
model: multimodalembedding. Please submit a quota increase request.

このエラーが初めて表示された場合は、Google Cloud コンソールでプロジェクトの割り当ての増加をリクエストします。増加をリクエストする前に、次のフィルタを使用します。

Service ID: aiplatform.googleapis.com
metric: aiplatform.googleapis.com/online_prediction_requests_per_base_model
base_model:multimodalembedding

[割り当て] に移動

割り当て増加リクエストをすでに送信している場合は、しばらく待ってから次のリクエストを送信してください。さらに割り当てを増やす必要がある場合は、割り当てリクエストの正当性を示しながら、割り当て増加リクエストを繰り返します。

低ディメンションのエンベディングを指定する

デフォルトでは、エンベディングリクエストはデータ型の 1,408 浮動小数点ベクトルを返します。テキストデータと画像データには、低次元のエンベディング（128、256、512 の浮動小数点ベクトル）を指定することもできます。このオプションを使用すると、エンベディングの使用方法に基づいてレイテンシ、ストレージ、品質を最適化できます。低ディメンションのエンベディングでは、後続のエンベディングタスク（検索やレコメンデーションなど）でストレージニーズが減少し、レイテンシが低くなります。また、高ディメンションのエンベディングでは、同じタスクに対して高い精度が実現されます。

REST

低ディメンションにアクセスするには、parameters.dimension フィールドを追加します。このパラメータは、128、256、512、1408 のいずれかの値を受け入れます。レスポンスには、そのディメンションのエンベディングが含まれます。

リクエストのデータを使用する前に、次のように置き換えます。

LOCATION: プロジェクトのリージョン。たとえば、us-central1、europe-west2、asia-northeast3 です。使用可能なリージョンの一覧については、Vertex AI の生成 AI のロケーションをご覧ください。
PROJECT_ID: 実際の Google Cloud プロジェクト ID。
IMAGE_URI: エンベディングを取得するターゲット画像の Cloud Storage URI。例: gs://my-bucket/embeddings/supermarket-img.png
また、画像を base64 エンコードのバイト文字列として指定することもできます。
```
[...]
"image": {
  "bytesBase64Encoded": "B64_ENCODED_IMAGE"
}
[...]
```
TEXT: エンベディングを取得するターゲットテキスト。例: a cat
EMBEDDING_DIMENSION: エンベディングディメンションの数。値を小さくすると、後続のタスクでこれらのエンベディングを使用する際のレイテンシが低くなります。値が高いほど、精度が向上します。使用可能な値: 128、256、512、1408（デフォルト）。

HTTP メソッドと URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict

リクエストの本文（JSON）:

{
  "instances": [
    {
      "image": {
        "gcsUri": "IMAGE_URI"
      },
      "text": "TEXT"
    }
  ],
  "parameters": {
    "dimension": EMBEDDING_DIMENSION
  }
}

リクエストを送信するには、次のいずれかのオプションを選択します。

curl

注: 次のコマンドは、gcloud init または gcloud auth login を実行して、ユーザーアカウントで gcloud CLI にログインしているか、Cloud Shell を使用して自動的に gcloud CLI にログインしていることを前提としています。gcloud auth list を実行すると、現在アクティブなアカウントを確認できます。

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"

PowerShell

注: 次のコマンドは、gcloud init または gcloud auth login を実行して、ご自分のユーザーアカウントで gcloud CLI にログインしていることを前提としています。gcloud auth list を実行すると、現在アクティブなアカウントを確認できます。

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content

モデルのエンベディングは、指定したディメンションの浮動小数点ベクトルを返します。スペースの関係上、次のサンプルレスポンスは短縮されています。

128 ディメンション:

{
  "predictions": [
    {
      "imageEmbedding": [
        0.0279239565,
        [...128 dimension vector...]
        0.00403284049
      ],
      "textEmbedding": [
        0.202921599,
        [...128 dimension vector...]
        -0.0365431122
      ]
    }
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID"
}

256 ディメンション:

{
  "predictions": [
    {
      "imageEmbedding": [
        0.248620048,
        [...256 dimension vector...]
        -0.0646447465
      ],
      "textEmbedding": [
        0.0757875815,
        [...256 dimension vector...]
        -0.02749932
      ]
    }
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID"
}

512 ディメンション:

{
  "predictions": [
    {
      "imageEmbedding": [
        -0.0523675755,
        [...512 dimension vector...]
        -0.0444030389
      ],
      "textEmbedding": [
        -0.0592851527,
        [...512 dimension vector...]
        0.0350437127
      ]
    }
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID"
}

Python

import vertexai

from vertexai.vision_models import Image, MultiModalEmbeddingModel

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

# TODO(developer): Try different dimenions: 128, 256, 512, 1408
embedding_dimension = 128

model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")
image = Image.load_from_file(
    "gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png"
)

embeddings = model.get_embeddings(
    image=image,
    contextual_text="Colosseum",
    dimension=embedding_dimension,
)

print(f"Image Embedding: {embeddings.image_embedding}")
print(f"Text Embedding: {embeddings.text_embedding}")

# Example response:
# Image Embedding: [0.0622573346, -0.0406507477, 0.0260440577, ...]
# Text Embedding: [0.27469793, -0.146258667, 0.0222803634, ...]

Go

import (
	"context"
	"encoding/json"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
	"google.golang.org/protobuf/encoding/protojson"
	"google.golang.org/protobuf/types/known/structpb"
)

// generateWithLowerDimension shows how to generate lower-dimensional embeddings for text and image inputs.
func generateWithLowerDimension(w io.Writer, project, location string) error {
	// location = "us-central1"
	ctx := context.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewPredictionClient(ctx, option.WithEndpoint(apiEndpoint))
	if err != nil {
		return fmt.Errorf("failed to construct API client: %w", err)
	}
	defer client.Close()

	model := "multimodalembedding@001"
	endpoint := fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s", project, location, model)

	// This is the input to the model's prediction call. For schema, see:
	// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_body
	instance, err := structpb.NewValue(map[string]any{
		"image": map[string]any{
			// Image input can be provided either as a Google Cloud Storage URI or as
			// base64-encoded bytes using the "bytesBase64Encoded" field.
			"gcsUri": "gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png",
		},
		"text": "Colosseum",
	})
	if err != nil {
		return fmt.Errorf("failed to construct request payload: %w", err)
	}

	// TODO(developer): Try different dimenions: 128, 256, 512, 1408
	outputDimensionality := 128
	params, err := structpb.NewValue(map[string]any{
		"dimension": outputDimensionality,
	})
	if err != nil {
		return fmt.Errorf("failed to construct request params: %w", err)
	}

	req := &aiplatformpb.PredictRequest{
		Endpoint: endpoint,
		// The model supports only 1 instance per request.
		Instances:  []*structpb.Value{instance},
		Parameters: params,
	}

	resp, err := client.Predict(ctx, req)
	if err != nil {
		return fmt.Errorf("failed to generate embeddings: %w", err)
	}

	instanceEmbeddingsJson, err := protojson.Marshal(resp.GetPredictions()[0])
	if err != nil {
		return fmt.Errorf("failed to convert protobuf value to JSON: %w", err)
	}
	// For response schema, see:
	// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-body
	var instanceEmbeddings struct {
		ImageEmbeddings []float32 `json:"imageEmbedding"`
		TextEmbeddings  []float32 `json:"textEmbedding"`
	}
	if err := json.Unmarshal(instanceEmbeddingsJson, &instanceEmbeddings); err != nil {
		return fmt.Errorf("failed to unmarshal JSON: %w", err)
	}

	imageEmbedding := instanceEmbeddings.ImageEmbeddings
	textEmbedding := instanceEmbeddings.TextEmbeddings

	fmt.Fprintf(w, "Text embedding (length=%d): %v\n", len(textEmbedding), textEmbedding)
	fmt.Fprintf(w, "Image embedding (length=%d): %v\n", len(imageEmbedding), imageEmbedding)
	// Example response:
	// Text Embedding (length=128): [0.27469793 -0.14625867 0.022280363 ... ]
	// Image Embedding (length=128): [0.06225733 -0.040650766 0.02604402 ... ]

	return nil
}

エンベディングリクエストを送信する（画像とテキスト）

次のコードサンプルを使用して、画像データとテキストデータのエンベディングリクエストを送信します。サンプルでは、両方のデータ型でリクエストを送信する方法を示していますが、データ型別にサービスを使用することもできます。

テキストと画像のエンベディングを取得する

REST

multimodalembedding モデルリクエストの詳細については、multimodalembedding モデル API リファレンスをご覧ください。

リクエストのデータを使用する前に、次のように置き換えます。

LOCATION: プロジェクトのリージョン。たとえば、us-central1、europe-west2、asia-northeast3 です。使用可能なリージョンの一覧については、Vertex AI の生成 AI のロケーションをご覧ください。
PROJECT_ID: 実際の Google Cloud プロジェクト ID。
TEXT: エンベディングを取得するターゲットテキスト。例: a cat
B64_ENCODED_IMG: エンベディングを取得するターゲット画像。画像は base64 でエンコードされたバイト文字列として指定する必要があります。

HTTP メソッドと URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict

リクエストの本文（JSON）:

{
  "instances": [
    {
      "text": "TEXT",
      "image": {
        "bytesBase64Encoded": "B64_ENCODED_IMG"
      }
    }
  ]
}

リクエストを送信するには、次のいずれかのオプションを選択します。

curl

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"

PowerShell

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content

モデルが返すエンベディングは、1,408 個の浮動小数点ベクトルです。スペースの関係上、次のサンプルレスポンスの一部は表示されていません。

{
  "predictions": [
    {
      "textEmbedding": [
        0.010477379,
        -0.00399621,
        0.00576670747,
        [...]
        -0.00823613815,
        -0.0169572588,
        -0.00472954148
      ],
      "imageEmbedding": [
        0.00262696808,
        -0.00198890246,
        0.0152047109,
        -0.0103145819,
        [...]
        0.0324628279,
        0.0284924973,
        0.011650892,
        -0.00452344026
      ]
    }
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID"
}

Vertex AI SDK for Python

Vertex AI SDK for Python のインストールまたは更新の方法については、Vertex AI SDK for Python をインストールするをご覧ください。詳細については、Vertex AI SDK for Python API のリファレンスドキュメントをご覧ください。

import vertexai
from vertexai.vision_models import Image, MultiModalEmbeddingModel

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")
image = Image.load_from_file(
    "gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png"
)

embeddings = model.get_embeddings(
    image=image,
    contextual_text="Colosseum",
    dimension=1408,
)
print(f"Image Embedding: {embeddings.image_embedding}")
print(f"Text Embedding: {embeddings.text_embedding}")
# Example response:
# Image Embedding: [-0.0123147098, 0.0727171078, ...]
# Text Embedding: [0.00230263756, 0.0278981831, ...]

Node.js

このサンプルを試す前に、Vertex AI クイックスタート: クライアントライブラリの使用にある Node.js の設定手順を完了してください。詳細については、Vertex AI Node.js API のリファレンスドキュメントをご覧ください。

Vertex AI に対する認証を行うには、アプリケーションのデフォルト認証情報を設定します。詳細については、ローカル開発環境の認証を設定するをご覧ください。

/**
 * TODO(developer): Uncomment these variables before running the sample.\
 * (Not necessary if passing values as arguments)
 */
// const project = 'YOUR_PROJECT_ID';
// const location = 'YOUR_PROJECT_LOCATION';
// const bastImagePath = "YOUR_BASED_IMAGE_PATH"
// const textPrompt = 'YOUR_TEXT_PROMPT';
const aiplatform = require('@google-cloud/aiplatform');

// Imports the Google Cloud Prediction service client
const {PredictionServiceClient} = aiplatform.v1;

// Import the helper module for converting arbitrary protobuf.Value objects.
const {helpers} = aiplatform;

// Specifies the location of the api endpoint
const clientOptions = {
  apiEndpoint: 'us-central1-aiplatform.googleapis.com',
};
const publisher = 'google';
const model = 'multimodalembedding@001';

// Instantiates a client
const predictionServiceClient = new PredictionServiceClient(clientOptions);

async function predictImageFromImageAndText() {
  // Configure the parent resource
  const endpoint = `projects/${project}/locations/${location}/publishers/${publisher}/models/${model}`;

  const fs = require('fs');
  const imageFile = fs.readFileSync(baseImagePath);

  // Convert the image data to a Buffer and base64 encode it.
  const encodedImage = Buffer.from(imageFile).toString('base64');

  const prompt = {
    text: textPrompt,
    image: {
      bytesBase64Encoded: encodedImage,
    },
  };
  const instanceValue = helpers.toValue(prompt);
  const instances = [instanceValue];

  const parameter = {
    sampleCount: 1,
  };
  const parameters = helpers.toValue(parameter);

  const request = {
    endpoint,
    instances,
    parameters,
  };

  // Predict request
  const [response] = await predictionServiceClient.predict(request);
  console.log('Get image embedding response');
  const predictions = response.predictions;
  console.log('\tPredictions :');
  for (const prediction of predictions) {
    console.log(`\t\tPrediction : ${JSON.stringify(prediction)}`);
  }
}

await predictImageFromImageAndText();

Java

このサンプルを試す前に、Vertex AI クイックスタート: クライアントライブラリの使用にある Java の設定手順を完了してください。詳細については、Vertex AI Java API のリファレンスドキュメントをご覧ください。


import com.google.cloud.aiplatform.v1beta1.EndpointName;
import com.google.cloud.aiplatform.v1beta1.PredictResponse;
import com.google.cloud.aiplatform.v1beta1.PredictionServiceClient;
import com.google.cloud.aiplatform.v1beta1.PredictionServiceSettings;
import com.google.gson.Gson;
import com.google.gson.JsonObject;
import com.google.protobuf.InvalidProtocolBufferException;
import com.google.protobuf.Value;
import com.google.protobuf.util.JsonFormat;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Base64;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class PredictImageFromImageAndTextSample {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace this variable before running the sample.
    String project = "YOUR_PROJECT_ID";
    String textPrompt = "YOUR_TEXT_PROMPT";
    String baseImagePath = "YOUR_BASE_IMAGE_PATH";

    // Learn how to use text prompts to update an image:
    // https://cloud.google.com/vertex-ai/docs/generative-ai/image/edit-images
    Map<String, Object> parameters = new HashMap<String, Object>();
    parameters.put("sampleCount", 1);

    String location = "us-central1";
    String publisher = "google";
    String model = "multimodalembedding@001";

    predictImageFromImageAndText(
        project, location, publisher, model, textPrompt, baseImagePath, parameters);
  }

  // Update images using text prompts
  public static void predictImageFromImageAndText(
      String project,
      String location,
      String publisher,
      String model,
      String textPrompt,
      String baseImagePath,
      Map<String, Object> parameters)
      throws IOException {
    final String endpoint = String.format("%s-aiplatform.googleapis.com:443", location);
    final PredictionServiceSettings predictionServiceSettings =
        PredictionServiceSettings.newBuilder().setEndpoint(endpoint).build();

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (PredictionServiceClient predictionServiceClient =
        PredictionServiceClient.create(predictionServiceSettings)) {
      final EndpointName endpointName =
          EndpointName.ofProjectLocationPublisherModelName(project, location, publisher, model);

      // Convert the image to Base64
      byte[] imageData = Base64.getEncoder().encode(Files.readAllBytes(Paths.get(baseImagePath)));
      String encodedImage = new String(imageData, StandardCharsets.UTF_8);

      JsonObject jsonInstance = new JsonObject();
      jsonInstance.addProperty("text", textPrompt);
      JsonObject jsonImage = new JsonObject();
      jsonImage.addProperty("bytesBase64Encoded", encodedImage);
      jsonInstance.add("image", jsonImage);

      Value instanceValue = stringToValue(jsonInstance.toString());
      List<Value> instances = new ArrayList<>();
      instances.add(instanceValue);

      Gson gson = new Gson();
      String gsonString = gson.toJson(parameters);
      Value parameterValue = stringToValue(gsonString);

      PredictResponse predictResponse =
          predictionServiceClient.predict(endpointName, instances, parameterValue);
      System.out.println("Predict Response");
      System.out.println(predictResponse);
      for (Value prediction : predictResponse.getPredictionsList()) {
        System.out.format("\tPrediction: %s\n", prediction);
      }
    }
  }

  // Convert a Json string to a protobuf.Value
  static Value stringToValue(String value) throws InvalidProtocolBufferException {
    Value.Builder builder = Value.newBuilder();
    JsonFormat.parser().merge(value, builder);
    return builder.build();
  }
}

Go

このサンプルを試す前に、Vertex AI クイックスタート: クライアントライブラリの使用にある Go の設定手順を完了してください。詳細については、Vertex AI Go API のリファレンスドキュメントをご覧ください。

import (
	"context"
	"encoding/json"
	"fmt"
	"io"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
	"google.golang.org/protobuf/encoding/protojson"
	"google.golang.org/protobuf/types/known/structpb"
)

// generateForTextAndImage shows how to use the multimodal model to generate embeddings for
// text and image inputs.
func generateForTextAndImage(w io.Writer, project, location string) error {
	// location = "us-central1"
	ctx := context.Background()
	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewPredictionClient(ctx, option.WithEndpoint(apiEndpoint))
	if err != nil {
		return fmt.Errorf("failed to construct API client: %w", err)
	}
	defer client.Close()

	model := "multimodalembedding@001"
	endpoint := fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s", project, location, model)

	// This is the input to the model's prediction call. For schema, see:
	// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_body
	instance, err := structpb.NewValue(map[string]any{
		"image": map[string]any{
			// Image input can be provided either as a Google Cloud Storage URI or as
			// base64-encoded bytes using the "bytesBase64Encoded" field.
			"gcsUri": "gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png",
		},
		"text": "Colosseum",
	})
	if err != nil {
		return fmt.Errorf("failed to construct request payload: %w", err)
	}

	req := &aiplatformpb.PredictRequest{
		Endpoint: endpoint,
		// The model supports only 1 instance per request.
		Instances: []*structpb.Value{instance},
	}

	resp, err := client.Predict(ctx, req)
	if err != nil {
		return fmt.Errorf("failed to generate embeddings: %w", err)
	}

	instanceEmbeddingsJson, err := protojson.Marshal(resp.GetPredictions()[0])
	if err != nil {
		return fmt.Errorf("failed to convert protobuf value to JSON: %w", err)
	}
	// For response schema, see:
	// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-body
	var instanceEmbeddings struct {
		ImageEmbeddings []float32 `json:"imageEmbedding"`
		TextEmbeddings  []float32 `json:"textEmbedding"`
	}
	if err := json.Unmarshal(instanceEmbeddingsJson, &instanceEmbeddings); err != nil {
		return fmt.Errorf("failed to unmarshal JSON: %w", err)
	}

	imageEmbedding := instanceEmbeddings.ImageEmbeddings
	textEmbedding := instanceEmbeddings.TextEmbeddings

	fmt.Fprintf(w, "Text embedding (length=%d): %v\n", len(textEmbedding), textEmbedding)
	fmt.Fprintf(w, "Image embedding (length=%d): %v\n", len(imageEmbedding), imageEmbedding)
	// Example response:
	// Text embedding (length=1408): [0.0023026613 0.027898183 -0.011858357 ... ]
	// Image embedding (length=1408): [-0.012314269 0.07271844 0.00020170923 ... ]

	return nil
}

エンベディングリクエストを送信する（動画、画像、テキスト）

エンベディングリクエストを送信するときに、入力動画のみを指定することも、動画、画像、テキストデータの組み合わせを指定することもできます。

動画エンベディングモード

動画エンベディングでは、Essentials、Standard、Plus の 3 つのモードを使用できます。モードは、生成されるエンベディングの密度に対応します。これは、リクエストの interval_sec 構成で指定できます。動画の間隔（interval_sec）ごとにエンベディングが生成されます。動画の最小間隔は 4 秒です。間隔が 120 秒を超えると、生成されたエンベディングの品質に悪影響を及ぼす可能性があります。

動画エンベディングの料金は、使用するモードによって異なります。詳細は、料金をご覧ください。

次の表に、動画のエンベディングに使用できる 3 つのモードをまとめます。

モード	1 分あたりの最大エンベディング数	動画のエンベディング間隔（最小値）
Essential	4	15 これは `intervalSec` >= 15 に対応します
Standard	8	8 これは 8 <= `intervalSec` < 15 に対応します
Plus	15	4 これは 4 <= `intervalSec` < 8 に対応します。

動画エンベディングのベストプラクティス

動画エンベディングのリクエストを送信する場合は、次の点を考慮してください。

任意の長さの入力動画の最初の 2 分間に単一のエンベディングを生成するには、次の videoSegmentConfig 設定を使用します。

request.json:
```
// other request body content
"videoSegmentConfig": {
  "intervalSec": 120
}
// other request body content
```

長さが 2 分を超える動画のエンベディングを生成するには、videoSegmentConfig で開始時間と終了時間を指定する複数のリクエストを送信します。

request1.json:

// other request body content
"videoSegmentConfig": {
  "startOffsetSec": 0,
  "endOffsetSec": 120
}
// other request body content

request2.json:

// other request body content
"videoSegmentConfig": {
  "startOffsetSec": 120,
  "endOffsetSec": 240
}
// other request body content

動画のエンベディングを取得する

動画コンテンツのみのエンベディングを取得するには、次のサンプルを使用します。

REST

multimodalembedding モデルリクエストの詳細については、multimodalembedding モデル API リファレンスをご覧ください。

次の例では、Cloud Storage にある動画を使用します。video.bytesBase64Encoded フィールドを使用して、動画の base64 エンコード文字列表現を指定することもできます。

リクエストのデータを使用する前に、次のように置き換えます。

LOCATION: プロジェクトのリージョン。たとえば、us-central1、europe-west2、asia-northeast3 です。使用可能なリージョンの一覧については、Vertex AI の生成 AI のロケーションをご覧ください。
PROJECT_ID: 実際の Google Cloud プロジェクト ID。
VIDEO_URI: エンベディングを取得するターゲット動画の Cloud Storage URI。例: gs://my-bucket/embeddings/supermarket-video.mp4
また、動画を base64 エンコードのバイト文字列として指定することもできます。
```
[...]
"video": {
  "bytesBase64Encoded": "B64_ENCODED_VIDEO"
}
[...]
```
videoSegmentConfig (START_SECOND, END_SECOND, INTERVAL_SECONDS)。省略可。エンベディングが生成される特定の動画セグメント（秒単位）。
videoSegmentConfig.intervalSec に設定した値は、課金される料金ティアに影響します。詳細については、動画エンベディングモードのセクションと料金のページをご覧ください。

例:
```
[...]
"videoSegmentConfig": {
  "startOffsetSec": 10,
  "endOffsetSec": 60,
  "intervalSec": 10
}
[...]
```
この構成を使用して、10 秒から 60 秒までの動画データを指定し、[10, 20), [20, 30), [30, 40), [40, 50), [50, 60) の 10 秒の動画間隔のエンベディングを生成します。この動画間隔（"intervalSec": 10）は、Standard 動画エンベディングモードになり、ユーザーには Standard モードの料金レートで請求されます。

videoSegmentConfig を省略すると、サービスはデフォルト値の "videoSegmentConfig": { "startOffsetSec": 0, "endOffsetSec": 120, "intervalSec": 16 } を使用します。この動画間隔（"intervalSec": 16）は、Essential 動画エンベディングモードになり、ユーザーには Essential モードの料金レートで請求されます。

HTTP メソッドと URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict

リクエストの本文（JSON）:

{
  "instances": [
    {
      "video": {
        "gcsUri": "VIDEO_URI",
        "videoSegmentConfig": {
          "startOffsetSec": START_SECOND,
          "endOffsetSec": END_SECOND,
          "intervalSec": INTERVAL_SECONDS
        }
      }
    }
  ]
}

リクエストを送信するには、次のいずれかのオプションを選択します。

curl

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"

PowerShell

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content

モデルが返すエンベディングは、1,408 個の浮動小数点ベクトルです。スペースの関係上、次のサンプルレスポンスは短縮されています。

レスポンス（7 秒の動画、videoSegmentConfig の指定なし）:

{
  "predictions": [
    {
      "videoEmbeddings": [
        {
          "endOffsetSec": 7,
          "embedding": [
            -0.0045467657,
            0.0258095954,
            0.0146885719,
            0.00945400633,
            [...]
            -0.0023291884,
            -0.00493789,
            0.00975185353,
            0.0168156829
          ],
          "startOffsetSec": 0
        }
      ]
    }
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID"
}

レスポンス（59 秒の動画、動画セグメントの構成: "videoSegmentConfig": { "startOffsetSec": 0, "endOffsetSec": 60, "intervalSec": 10 }）:

{
  "predictions": [
    {
      "videoEmbeddings": [
        {
          "endOffsetSec": 10,
          "startOffsetSec": 0,
          "embedding": [
            -0.00683252793,
            0.0390476175,
            [...]
            0.00657121744,
            0.013023301
          ]
        },
        {
          "startOffsetSec": 10,
          "endOffsetSec": 20,
          "embedding": [
            -0.0104404651,
            0.0357737206,
            [...]
            0.00509833824,
            0.0131902946
          ]
        },
        {
          "startOffsetSec": 20,
          "embedding": [
            -0.0113538112,
            0.0305239167,
            [...]
            -0.00195809244,
            0.00941874553
          ],
          "endOffsetSec": 30
        },
        {
          "embedding": [
            -0.00299320649,
            0.0322436653,
            [...]
            -0.00993082579,
            0.00968887936
          ],
          "startOffsetSec": 30,
          "endOffsetSec": 40
        },
        {
          "endOffsetSec": 50,
          "startOffsetSec": 40,
          "embedding": [
            -0.00591270532,
            0.0368893594,
            [...]
            -0.00219071587,
            0.0042470959
          ]
        },
        {
          "embedding": [
            -0.00458270218,
            0.0368121453,
            [...]
            -0.00317760976,
            0.00595594104
          ],
          "endOffsetSec": 59,
          "startOffsetSec": 50
        }
      ]
    }
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID"
}

Vertex AI SDK for Python

import vertexai

from vertexai.vision_models import MultiModalEmbeddingModel, Video
from vertexai.vision_models import VideoSegmentConfig

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")

embeddings = model.get_embeddings(
    video=Video.load_from_file(
        "gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4"
    ),
    video_segment_config=VideoSegmentConfig(end_offset_sec=1),
)

# Video Embeddings are segmented based on the video_segment_config.
print("Video Embeddings:")
for video_embedding in embeddings.video_embeddings:
    print(
        f"Video Segment: {video_embedding.start_offset_sec} - {video_embedding.end_offset_sec}"
    )
    print(f"Embedding: {video_embedding.embedding}")

# Example response:
# Video Embeddings:
# Video Segment: 0.0 - 1.0
# Embedding: [-0.0206376351, 0.0123456789, ...]

Go

import (
	"context"
	"encoding/json"
	"fmt"
	"io"
	"time"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
	"google.golang.org/protobuf/encoding/protojson"
	"google.golang.org/protobuf/types/known/structpb"
)

// generateForVideo shows how to use the multimodal model to generate embeddings for video input.
func generateForVideo(w io.Writer, project, location string) error {
	// location = "us-central1"

	// The default context timeout may be not enough to process a video input.
	ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
	defer cancel()

	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewPredictionClient(ctx, option.WithEndpoint(apiEndpoint))
	if err != nil {
		return fmt.Errorf("failed to construct API client: %w", err)
	}
	defer client.Close()

	model := "multimodalembedding@001"
	endpoint := fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s", project, location, model)

	// This is the input to the model's prediction call. For schema, see:
	// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_body
	instances, err := structpb.NewValue(map[string]any{
		"video": map[string]any{
			// Video input can be provided either as a Google Cloud Storage URI or as base64-encoded
			// bytes using the "bytesBase64Encoded" field.
			"gcsUri": "gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4",
			"videoSegmentConfig": map[string]any{
				"startOffsetSec": 1,
				"endOffsetSec":   5,
			},
		},
	})
	if err != nil {
		return fmt.Errorf("failed to construct request payload: %w", err)
	}

	req := &aiplatformpb.PredictRequest{
		Endpoint: endpoint,
		// The model supports only 1 instance per request.
		Instances: []*structpb.Value{instances},
	}
	resp, err := client.Predict(ctx, req)
	if err != nil {
		return fmt.Errorf("failed to generate embeddings: %w", err)
	}

	instanceEmbeddingsJson, err := protojson.Marshal(resp.GetPredictions()[0])
	if err != nil {
		return fmt.Errorf("failed to convert protobuf value to JSON: %w", err)
	}
	// For response schema, see:
	// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-body
	var instanceEmbeddings struct {
		VideoEmbeddings []struct {
			Embedding      []float32 `json:"embedding"`
			StartOffsetSec float64   `json:"startOffsetSec"`
			EndOffsetSec   float64   `json:"endOffsetSec"`
		} `json:"videoEmbeddings"`
	}
	if err := json.Unmarshal(instanceEmbeddingsJson, &instanceEmbeddings); err != nil {
		return fmt.Errorf("failed to unmarshal json: %w", err)
	}
	// Get the embedding for our single video segment (`.videoEmbeddings` object has one entry per
	// each processed segment).
	videoEmbedding := instanceEmbeddings.VideoEmbeddings[0]

	fmt.Fprintf(w, "Video embedding (seconds: %.f-%.f; length=%d): %v\n",
		videoEmbedding.StartOffsetSec,
		videoEmbedding.EndOffsetSec,
		len(videoEmbedding.Embedding),
		videoEmbedding.Embedding,
	)
	// Example response:
	// Video embedding (seconds: 1-5; length=1408): [-0.016427778 0.032878537 -0.030755188 ... ]

	return nil
}

画像、テキスト、動画のエンベディングを取得する

次のサンプルを使用して、動画、テキスト、画像コンテンツのエンベディングを取得します。

REST

multimodalembedding モデルリクエストの詳細については、multimodalembedding モデル API リファレンスをご覧ください。

次の例では、画像、テキスト、動画のデータを使用します。リクエストの本文で、これらのデータ型を任意に組み合わせて使用できます。

また、このサンプルでは Cloud Storage にある動画を使用しています。video.bytesBase64Encoded フィールドを使用して、動画の base64 エンコード文字列表現を指定することもできます。

リクエストのデータを使用する前に、次のように置き換えます。

LOCATION: プロジェクトのリージョン。たとえば、us-central1、europe-west2、asia-northeast3 です。使用可能なリージョンの一覧については、Vertex AI の生成 AI のロケーションをご覧ください。
PROJECT_ID: 実際の Google Cloud プロジェクト ID。
TEXT: エンベディングを取得するターゲットテキスト。例: a cat
IMAGE_URI: エンベディングを取得するターゲット画像の Cloud Storage URI。例: gs://my-bucket/embeddings/supermarket-img.png
また、画像を base64 エンコードのバイト文字列として指定することもできます。
```
[...]
"image": {
  "bytesBase64Encoded": "B64_ENCODED_IMAGE"
}
[...]
```
VIDEO_URI: エンベディングを取得するターゲット動画の Cloud Storage URI。例: gs://my-bucket/embeddings/supermarket-video.mp4
また、動画を base64 エンコードのバイト文字列として指定することもできます。
```
[...]
"video": {
  "bytesBase64Encoded": "B64_ENCODED_VIDEO"
}
[...]
```
videoSegmentConfig (START_SECOND, END_SECOND, INTERVAL_SECONDS)。省略可。エンベディングが生成される特定の動画セグメント（秒単位）。
videoSegmentConfig.intervalSec に設定した値は、課金される料金ティアに影響します。詳細については、動画エンベディングモードのセクションと料金のページをご覧ください。

例:
```
[...]
"videoSegmentConfig": {
  "startOffsetSec": 10,
  "endOffsetSec": 60,
  "intervalSec": 10
}
[...]
```
この構成を使用して、10 秒から 60 秒までの動画データを指定し、[10, 20), [20, 30), [30, 40), [40, 50), [50, 60) の 10 秒の動画間隔のエンベディングを生成します。この動画間隔（"intervalSec": 10）は、Standard 動画エンベディングモードになり、ユーザーには Standard モードの料金レートで請求されます。

videoSegmentConfig を省略すると、サービスはデフォルト値の "videoSegmentConfig": { "startOffsetSec": 0, "endOffsetSec": 120, "intervalSec": 16 } を使用します。この動画間隔（"intervalSec": 16）は、Essential 動画エンベディングモードになり、ユーザーには Essential モードの料金レートで請求されます。

HTTP メソッドと URL:

POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict

リクエストの本文（JSON）:

{
  "instances": [
    {
      "text": "TEXT",
      "image": {
        "gcsUri": "IMAGE_URI"
      },
      "video": {
        "gcsUri": "VIDEO_URI",
        "videoSegmentConfig": {
          "startOffsetSec": START_SECOND,
          "endOffsetSec": END_SECOND,
          "intervalSec": INTERVAL_SECONDS
        }
      }
    }
  ]
}

リクエストを送信するには、次のいずれかのオプションを選択します。

curl

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict"

PowerShell

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/multimodalembedding@001:predict" | Select-Object -Expand Content

モデルが返すエンベディングは、1,408 個の浮動小数点ベクトルです。スペースの関係上、次のサンプルレスポンスの一部は表示されていません。

{
  "predictions": [
    {
      "textEmbedding": [
        0.0105433334,
        -0.00302835181,
        0.00656806398,
        0.00603460241,
        [...]
        0.00445805816,
        0.0139605571,
        -0.00170318608,
        -0.00490092579
      ],
      "videoEmbeddings": [
        {
          "startOffsetSec": 0,
          "endOffsetSec": 7,
          "embedding": [
            -0.00673126569,
            0.0248149596,
            0.0128901172,
            0.0107588246,
            [...]
            -0.00180952181,
            -0.0054573305,
            0.0117037306,
            0.0169312079
          ]
        }
      ],
      "imageEmbedding": [
        -0.00728622358,
        0.031021487,
        -0.00206603738,
        0.0273937676,
        [...]
        -0.00204976718,
        0.00321615417,
        0.0121978866,
        0.0193375275
      ]
    }
  ],
  "deployedModelId": "DEPLOYED_MODEL_ID"
}

Vertex AI SDK for Python

import vertexai

from vertexai.vision_models import Image, MultiModalEmbeddingModel, Video
from vertexai.vision_models import VideoSegmentConfig

# TODO(developer): Update & uncomment line below
# PROJECT_ID = "your-project-id"
vertexai.init(project=PROJECT_ID, location="us-central1")

model = MultiModalEmbeddingModel.from_pretrained("multimodalembedding@001")

image = Image.load_from_file(
    "gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png"
)
video = Video.load_from_file(
    "gs://cloud-samples-data/vertex-ai-vision/highway_vehicles.mp4"
)

embeddings = model.get_embeddings(
    image=image,
    video=video,
    video_segment_config=VideoSegmentConfig(end_offset_sec=1),
    contextual_text="Cars on Highway",
)

print(f"Image Embedding: {embeddings.image_embedding}")

# Video Embeddings are segmented based on the video_segment_config.
print("Video Embeddings:")
for video_embedding in embeddings.video_embeddings:
    print(
        f"Video Segment: {video_embedding.start_offset_sec} - {video_embedding.end_offset_sec}"
    )
    print(f"Embedding: {video_embedding.embedding}")

print(f"Text Embedding: {embeddings.text_embedding}")
# Example response:
# Image Embedding: [-0.0123144267, 0.0727186054, 0.000201397663, ...]
# Video Embeddings:
# Video Segment: 0.0 - 1.0
# Embedding: [-0.0206376351, 0.0345234685, ...]
# Text Embedding: [-0.0207006838, -0.00251058186, ...]

Go

import (
	"context"
	"encoding/json"
	"fmt"
	"io"
	"time"

	aiplatform "cloud.google.com/go/aiplatform/apiv1beta1"
	aiplatformpb "cloud.google.com/go/aiplatform/apiv1beta1/aiplatformpb"
	"google.golang.org/api/option"
	"google.golang.org/protobuf/encoding/protojson"
	"google.golang.org/protobuf/types/known/structpb"
)

// generateForImageTextAndVideo shows how to use the multimodal model to generate embeddings for
// image, text and video data.
func generateForImageTextAndVideo(w io.Writer, project, location string) error {
	// location = "us-central1"

	// The default context timeout may be not enough to process a video input.
	ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
	defer cancel()

	apiEndpoint := fmt.Sprintf("%s-aiplatform.googleapis.com:443", location)
	client, err := aiplatform.NewPredictionClient(ctx, option.WithEndpoint(apiEndpoint))
	if err != nil {
		return fmt.Errorf("failed to construct API client: %w", err)
	}
	defer client.Close()

	model := "multimodalembedding@001"
	endpoint := fmt.Sprintf("projects/%s/locations/%s/publishers/google/models/%s", project, location, model)

	// This is the input to the model's prediction call. For schema, see:
	// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#request_body
	instance, err := structpb.NewValue(map[string]any{
		"text": "Domestic cats in natural conditions",
		"image": map[string]any{
			// Image and video inputs can be provided either as a Google Cloud Storage URI or as
			// base64-encoded bytes using the "bytesBase64Encoded" field.
			"gcsUri": "gs://cloud-samples-data/generative-ai/image/320px-Felis_catus-cat_on_snow.jpg",
		},
		"video": map[string]any{
			"gcsUri": "gs://cloud-samples-data/video/cat.mp4",
		},
	})
	if err != nil {
		return fmt.Errorf("failed to construct request payload: %w", err)
	}

	req := &aiplatformpb.PredictRequest{
		Endpoint: endpoint,
		// The model supports only 1 instance per request.
		Instances: []*structpb.Value{instance},
	}

	resp, err := client.Predict(ctx, req)
	if err != nil {
		return fmt.Errorf("failed to generate embeddings: %w", err)
	}

	instanceEmbeddingsJson, err := protojson.Marshal(resp.GetPredictions()[0])
	if err != nil {
		return fmt.Errorf("failed to convert protobuf value to JSON: %w", err)
	}
	// For response schema, see:
	// https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/multimodal-embeddings-api#response-body
	var instanceEmbeddings struct {
		ImageEmbeddings []float32 `json:"imageEmbedding"`
		TextEmbeddings  []float32 `json:"textEmbedding"`
		VideoEmbeddings []struct {
			Embedding      []float32 `json:"embedding"`
			StartOffsetSec float64   `json:"startOffsetSec"`
			EndOffsetSec   float64   `json:"endOffsetSec"`
		} `json:"videoEmbeddings"`
	}
	if err := json.Unmarshal(instanceEmbeddingsJson, &instanceEmbeddings); err != nil {
		return fmt.Errorf("failed to unmarshal JSON: %w", err)
	}

	imageEmbedding := instanceEmbeddings.ImageEmbeddings
	textEmbedding := instanceEmbeddings.TextEmbeddings
	// Get the embedding for our single video segment (`.videoEmbeddings` object has one entry per
	// each processed segment).
	videoEmbedding := instanceEmbeddings.VideoEmbeddings[0].Embedding

	fmt.Fprintf(w, "Image embedding (length=%d): %v\n", len(imageEmbedding), imageEmbedding)
	fmt.Fprintf(w, "Text embedding (length=%d): %v\n", len(textEmbedding), textEmbedding)
	fmt.Fprintf(w, "Video embedding (length=%d): %v\n", len(videoEmbedding), videoEmbedding)
	// Example response:
	// Image embedding (length=1408): [-0.01558477 0.0258355 0.016342038 ... ]
	// Text embedding (length=1408): [-0.005894961 0.008349559 0.015355394 ... ]
	// Video embedding (length=1408): [-0.018867437 0.013997682 0.0012682161 ... ]

	return nil
}

次のステップ

ブログ「マルチモーダル検索とは: ビジョンを持つ LLM によるビジネスの変更」を読む。
テキストエンベディングを取得するで、テキストのみのユースケース（テキストベースのセマンティック検索、クラスタリング、長時間のドキュメント分析、その他のテキスト取得や質問応答のユースケース）を確認する。
Vertex AI のイメージの概要で、Vertex AI のすべての画像生成 AI サービスを確認する。
Model Garden で、その他の事前トレーニング済みモデルを確認する。
Vertex AI での責任ある AI のベストプラクティスと安全フィルタについて学習する。

マルチモーダル エンベディングを取得する コレクションでコンテンツを整理 必要に応じて、コンテンツの保存と分類を行います。

サポートされているモデル

ベスト プラクティス

API の使用

API の上限

始める前に

Java

Node.js

Python

REST

ロケーション

エラー メッセージ

割り当て超過エラー

低ディメンションのエンベディングを指定する

REST

curl

PowerShell

Python

Go

エンベディング リクエストを送信する（画像とテキスト）

テキストと画像のエンベディングを取得する

REST

curl

PowerShell

Vertex AI SDK for Python

Node.js

Java

Go

エンベディング リクエストを送信する（動画、画像、テキスト）

動画エンベディング モード

動画エンベディングのベスト プラクティス

動画のエンベディングを取得する

REST

curl

PowerShell

Vertex AI SDK for Python

Go

画像、テキスト、動画のエンベディングを取得する

REST

curl

PowerShell

Vertex AI SDK for Python

Go

次のステップ

マルチモーダルエンベディングを取得する

ベストプラクティス

エラーメッセージ

エンベディングリクエストを送信する（画像とテキスト）

エンベディングリクエストを送信する（動画、画像、テキスト）

動画エンベディングモード

動画エンベディングのベストプラクティス