Generate embeddings

This document describes how to invoke the embedding models to generate text and multimodal embeddings, by using the Vertex AI SDK for ABAP.

Embeddings are essentially numerical codes that represent text, images, or videos in a way that captures how they are related. Applications use these codes to understand and generate language, recognizing even the most complex meanings and relationships within your specific content. The process works by transforming text, images, and videos into lists of numbers, known as vectors, which are designed to effectively capture the meaning of the original content.

Some common use cases for text embeddings include:

  • Semantic search: Search text ranked by semantic similarity.
  • Classification: Return the class of items whose text attributes are similar to the given text.
  • Clustering: Cluster items whose text attributes are similar to the given text.
  • Outlier detection: Return items where text attributes are least related to the given text.
  • Conversational interface: Clusters groups of sentences which can lead to similar responses, like in a conversation-level embedding space.

With the Vertex AI SDK for ABAP, you can generate embeddings from the ABAP application logic by using the classes and methods that are shipped with the SDK. The SDK also provides out of the box methods to push the generated embeddings to the following datastores:

  • Cloud Storage: You can use the embeddings from a Cloud Storage bucket for building vector indexes and performing Vector Search.
  • BigQuery: You can use the embeddings from a BigQuery dataset as a vector database for your enterprise data.

You can also publish the embeddings to a Pub/Sub topic that can be routed to a BigQuery dataset or to a subscriber system.

Before you begin

Before using the Vertex AI SDK for ABAP with the embedding models, make sure that you or your administrators have completed the following prerequisites:

Generate embeddings

This section explains how to generate embeddings by using the Vertex AI SDK for ABAP.

Instantiate the multimodal embeddings class

To invoke the Vertex AI multimodal embeddings models by using text or multimodal inputs, you can use the /GOOG/CL_EMBEDDINGS_MODEL class. You instantiate the class by passing the model key configured in the model generation parameters.

DATA(lo_embeddings_model) = NEW /goog/cl_embeddings_model( iv_model_key = 'MODEL_KEY' ).

Replace MODEL_KEY with the model key name, which is configured in the model generation parameters.

Generate text embeddings

To generate embeddings for a snippet of text, you can use the GEN_TEXT_EMBEDDINGS method of the /GOOG/CL_EMBEDDINGS_MODEL class. You can also optionally specify a dimension for the output embeddings.

DATA(ls_addln_params) = VALUE /goog/cl_embeddings_model=>ty_addln_params(
                                output_dimensionality = 'DIMENSION' ).
DATA(lt_embeddings) = lo_embeddings_model->gen_text_embeddings(
                                             iv_content      = 'INPUT_TEXT'
                                             is_addln_params = ls_addln_params
                                        )->get_vector( ).

Replace the following:

  • DIMENSION: Optional. The dimensionality of the output embeddings. The default dimension is 768.
  • INPUT_TEXT: Text for which embeddings is to be generated.

You can also generate embeddings for a snippet of text by using an out of the box template /GOOG/CL_EMBEDDINGS_MODEL=>TY_EMBEDDINGS_TEMPLATE, shipped with the SDK. This template lets you capture enterprise specific schematic information in the generated embeddings file along with the embeddings.

To generate embeddings for a snippet of text, based on the /GOOG/CL_EMBEDDINGS_MODEL=>TY_EMBEDDINGS_TEMPLATE template, you can use the GEN_TEXT_EMBEDDINGS_BY_STRUCT method.

DATA(ls_embedding_template) = VALUE /goog/cl_embeddings_model=>ty_embeddings_template(
                                      id      = ENTITY_ID
                                      content = INPUT_TEXT
                                      source  = SOURCE_MODULE ).
DATA(ls_addln_params) = VALUE /goog/cl_embeddings_model=>ty_addln_params(
                          output_dimensionality = 'DIMENSION' ).
DATA(lt_embeddings) = lo_embeddings_model->gen_text_embeddings_by_struct(
                                             is_input        = ls_embedding_template
                                             is_addln_params = ls_addln_params
                                        )->get_vector_by_struct( ).

Replace the following:

  • ENTITY_ID: Entity ID for the embeddings record.
  • INPUT_TEXT: Text for which embeddings is to be generated.
  • SOURCE_MODULE: Source module of the embeddings content.
  • DIMENSION: Optional. The dimensionality of the output embeddings. The default dimension is 768.

Generate image embeddings

To generate embeddings for an input image, you can use the GEN_IMAGE_EMBEDDINGS method of the /GOOG/CL_EMBEDDINGS_MODEL class. You can either pass the raw data of an image or the Cloud Storage URI of an image file. You can also optionally specify a contextual text for the image and a dimension for the output embeddings.

DATA(ls_image) = VALUE /goog/cl_embeddings_model=>ty_image( gcs_uri = 'IMAGE_URI' ).
DATA(lt_embeddings) = lo_embeddings_model->gen_image_embeddings( iv_image           = ls_image
                                                                 iv_contextual_text = 'CONTEXTUAL_TEXT'
                                        )->get_vector( ).

Replace the following:

  • IMAGE_URI: The Cloud Storage URI of the target image to get embeddings for.
  • CONTEXTUAL_TEXT: Optional. Additional context and meaning to the content of an image to the embeddings model.

You can also generate embeddings for images by using an out of the box template /GOOG/CL_EMBEDDINGS_MODEL=>TY_EMBEDDINGS_TEMPLATE, shipped with the SDK. This template lets you capture enterprise specific schematic information in the generated embeddings file along with the embeddings.

To generate embeddings for an image, based on the /GOOG/CL_EMBEDDINGS_MODEL=>TY_EMBEDDINGS_TEMPLATE template, you can use the GEN_IMAGE_EMBEDDINGS_BY_STRUCT method.

DATA(ls_image) = VALUE /goog/cl_embeddings_model=>ty_image( gcs_uri = 'IMAGE_URI' ).
DATA(ls_embedding_template) = VALUE /goog/cl_embeddings_model=>ty_embeddings_template(
                                      id      = ENTITY_ID
                                      content = INPUT_TEXT
                                      source  = SOURCE_MODULE ).
DATA(lt_embeddings) = lo_embeddings_model->gen_image_embeddings_by_struct(
                                             iv_image = ls_image
                                             is_input = ls_embedding_template
                                        )->get_vector_by_struct( ).

Replace the following:

  • IMAGE_URI: The Cloud Storage URI of the target image to get embeddings for.
  • ENTITY_ID: Entity ID for the embeddings record.
  • INPUT_TEXT: Text for which embeddings is to be generated.
  • SOURCE_MODULE: Source module of the embeddings content.

To retrieve embeddings for a contextual text, use the following code:

DATA(lt_context_embeddings) = lo_embeddings_model->get_context_text_vector( ).

This option is available only for single image embedding creation.

Generate video embeddings

To generate embeddings for an input video, you can use the GET_VIDEO_EMBEDDINGS method of the /GOOG/CL_EMBEDDINGS_MODEL class. You can pass the Cloud Storage URI of a video file along with optional start and end offset time in seconds. You can also optionally specify a contextual text for the video and a dimension for the output embeddings.

DATA(ls_video) = VALUE /goog/cl_embeddings_model=>ty_video( gcs_uri = 'VIDEO_URI' ).
DATA(lt_embeddings) = lo_embeddings_model->gen_video_embeddings( iv_video           = ls_video
                                                                 iv_contextual_text = 'CONTEXTUAL_TEXT'
                                                                 iv_dimension       = 'DIMENSION'
                                        )->get_vector( ).
  • VIDEO_URI: The Cloud Storage URI of the target video to get embeddings for.
  • CONTEXTUAL_TEXT: Optional. Additional context and meaning to the content of a video to the embeddings model.
  • DIMENSION: Optional. The dimensionality of the output embeddings. The available dimensions are: 128, 256, 512, and 1408 (default).

The GET_VECTOR method returns the embeddings only for the first segment of the video.

To retrieve the embedding for contextual text, use the following code:

DATA(lt_context_embeddings) = lo_embeddings_model->get_context_text_vector( ).

This option is available only for single video embedding creation.

Collect all generated embeddings

To collect all generated embeddings in an internal table of type /GOOG/CL_EMBEDDINGS_MODEL=>TY_T_EMBEDDINGS_TEMPLATE, you can use the COLLECT method of the /GOOG/CL_EMBEDDINGS_MODEL class in combination with the methods GEN_TEXT_EMBEDDINGS_BY_STRUCT and GEN_IMAGE_EMBEDDINGS_BY_STRUCT.

This is useful when you have the requirement of generating embeddings for an array of items (text/image), and you would like to generate embeddings in a loop iteration and get all the embeddings at once in an internal table after the iteration. Method GET_VECTOR_BY_TABLE can be used to get the final internal table of embeddings.

LOOP AT ....
lo_embeddings_model->gen_text_embeddings_by_struct( is_input        = ls_embedding_template
                                                    is_addln_params = ls_addln_params
                  )->collect( ).

ENDLOOP.

DATA(lt_embeddings) = lo_embeddings_model->get_vector_by_table( ).

Send embeddings to a datastore

You can send the generated embeddings to a Cloud Storage bucket or a BigQuery dataset by using the template that is shipped with the SDK.

Store embeddings in Cloud Storage

To send the generated embeddings to a Cloud Storage bucket, you can use the SEND_STRUCT_TO_GCS method of the /GOOG/CL_EMBEDDINGS_MODEL class.

Before sending embeddings to a Cloud Storage, make sure to have a Cloud Storage bucket that you want to send the embeddings to.

Send individual embeddings to a Cloud Storage bucket

The following code sample illustrates how to send individual image embeddings to a Cloud Storage bucket:

DATA(ls_image) = VALUE /goog/cl_embeddings_model=>ty_image( gcs_uri = 'IMAGE_URI' ).
lo_embeddings_model->gen_image_embeddings_by_struct( iv_image        = ls_image
                                                     is_input        = ls_embedding_template
                                                     is_addln_params = ls_addln_params
                  )->send_struct_to_gcs( iv_key         = 'CLIENT_KEY'
                                         iv_bucket_name = 'BUCKET_NAME'
                                         iv_file_name   = 'FILE_NAME' ).

Replace the following:

  • IMAGE_URI: The Cloud Storage URI of the target image to get embeddings for.
  • CLIENT_KEY: Client key for invoking the Cloud Storage API.
  • BUCKET_NAME: Target Cloud Storage bucket name.
  • FILE_NAME: Embeddings filename.

Send collected embeddings to a Cloud Storage bucket

The following code sample illustrates how to send collected embeddings to a Cloud Storage bucket:

LOOP AT ....
lo_embeddings_model->gen_text_embeddings_by_struct( is_input        = ls_embedding_template
                                                    is_addln_params = ls_addln_params
                  )->collect( ).

ENDLOOP.

lo_embeddings_model->send_struct_to_gcs( iv_key         = 'CLIENT_KEY'
                                         iv_bucket_name = 'BUCKET_NAME'
                                         iv_file_name   = 'FILE_NAME' ).

Replace the following:

  • CLIENT_KEY: Client key for invoking the Cloud Storage API.
  • BUCKET_NAME: Target Cloud Storage bucket name.
  • FILE_NAME: Embeddings filename.

Store embeddings in BigQuery

To send the generated embeddings to a BigQuery dataset, you can use the SEND_STRUCT_TO_BQ method of the /GOOG/CL_EMBEDDINGS_MODEL.

Before sending the embeddings to a BigQuery, make sure to have a BigQuery dataset and a table that you want to send the embeddings to.

Send individual embeddings to a BigQuery dataset

The following code sample illustrates how to send individual image embeddings to a BigQuery dataset:

lo_embeddings_model->gen_image_embeddings_by_struct( iv_image        = ls_image
                                                     is_input        = ls_embedding_template
                                                     is_addln_params = ls_addln_params
                  )->send_struct_to_bq( iv_key        = 'CLIENT_KEY'
                                        iv_dataset_id = 'DATASET_ID'
                                        iv_table_id   = 'TABLE_ID' ).

Replace the following:

  • CLIENT_KEY: Client key for invoking the BigQuery API.
  • DATASET_ID: BigQuery dataset ID.
  • TABLE_ID: BigQuery table ID.

Send collected embeddings to a BigQuery dataset

The following code sample illustrates how to send collected embeddings to a BigQuery dataset:

LOOP AT ....
lo_embeddings_model->gen_text_embeddings_by_struct( is_input        = ls_embedding_template
                                                    is_addln_params = ls_addln_params
                  )->collect( ).

ENDLOOP.

lo_embeddings_model->send_struct_to_bq( iv_key        = 'CLIENT_KEY'
                                        iv_dataset_id = 'DATASET_ID'
                                        iv_table_id   = 'TABLE_ID' ).

Replace the following:

  • CLIENT_KEY: Client key for invoking the BigQuery API.
  • DATASET_ID: BigQuery dataset ID.
  • TABLE_ID: BigQuery table ID.

Publish embeddings to a Pub/Sub topic

To publish the generated embeddings to a Pub/Sub topic, you can use the SEND_STRUCT_TO_PUBSUB method of the /GOOG/CL_EMBEDDINGS_MODEL class. This can be useful for scenarios where you need to build your own custom pipelines for storing embeddings and building follow-on business processes.

Before sending the embeddings to a Pub/Sub topic, make sure to have a Pub/Sub topic that you want to send the embeddings to.

Publish individual embeddings to a Pub/Sub topic

The following code sample illustrates how to publish individual image embeddings to a Pub/Sub topic:

lo_embeddings_model->gen_image_embeddings_by_struct( iv_image        = ls_image
                                                     is_input        = ls_embedding_template
                                                     is_addln_params = ls_addln_params
                  )->send_struct_to_pubsub( iv_key      = 'CLIENT_KEY'
                                            iv_topic_id = 'TOPIC_ID' ).

Replace the following:

  • CLIENT_KEY: Client key for invoking the Pub/Sub API.
  • TOPIC_ID: Pub/Sub topic ID.

Publish collected embeddings to a Pub/Sub topic

The following code sample illustrates how to publish collected embeddings to a Pub/Sub topic:

LOOP AT ....
lo_embeddings_model->gen_text_embeddings_by_struct( is_input        = ls_embedding_template
                                                    is_addln_params = ls_addln_params
                  )->collect( ).

ENDLOOP.

lo_embeddings_model->send_struct_to_pubsub( iv_key      = 'CLIENT_KEY'
                                            iv_topic_id = 'TOPIC_ID' ).

Replace the following:

  • CLIENT_KEY: Client key for invoking the Pub/Sub API.
  • TOPIC_ID: Pub/Sub topic ID.

What's next