Use dedicated private endpoints based on Private Service Connect for online inference

The information in this page applies to custom-trained models and AutoML models. For Model Garden deployment, see Use models in Model Garden.

Private Service Connect lets you deploy your custom-trained Vertex AI model and serve online inferences securely to multiple consumer projects and VPC networks without the need for public IP addresses, public internet access, or an explicitly peered internal IP address range.

We recommend Private Service Connect for online inference use cases that have the following requirements:

Require private and secure connections
Require low latency
Don't need to be publicly accessible

Private Service Connect uses a forwarding rule in your VPC network to send traffic unidirectionally to the Vertex AI online inference service. The forwarding rule connects to a service attachment that exposes the Vertex AI service to your VPC network. For more information, see About accessing Vertex AI services through Private Service Connect. To learn more about setting up Private Service Connect, see the Private Service Connect overview in the Virtual Private Cloud (VPC) documentation.

Dedicated private endpoints support both HTTP and gRPC communication protocols. For gRPC requests, the x-vertex-ai-endpoint-id header must be included for proper endpoint identification. The following APIs are supported:

Predict
RawPredict
StreamRawPredict
Chat Completion (Model Garden only)

You can send online inference requests to a dedicated private endpoint by using the Vertex AI SDK for Python. For details, see Get online inferences.

Required roles

To get the permission that you need to create a Private Service Connect endpoint, ask your administrator to grant you the Vertex AI User (roles/aiplatform.user) IAM role on your project. For more information about granting roles, see Manage access to projects, folders, and organizations.

This predefined role contains the aiplatform.endpoints.create permission, which is required to create a Private Service Connect endpoint.

You might also be able to get this permission with custom roles or other predefined roles.

For more information about Vertex AI roles and permissions, see Vertex AI access control with IAM and Vertex AI IAM permissions.

Create the online inference endpoint

Use one of the following methods to create an online inference endpoint with Private Service Connect enabled.

The default request timeout for a Private Service Connect endpoint is 10 minutes. In the Vertex AI SDK for Python, you can optionally specify a different request timeout by specifying a new inference_timeout value, as shown in the following example. The maximum timeout value is 3600 seconds (1 hour).

Console

In the Google Cloud console, in Vertex AI, go to the Online prediction page.

Go to Online prediction
Click Create.
Provide a display name for the endpoint.
Select Private.
Select Private Service Connect.
Click Select project IDs.
Select projects to add to the allowlist for the endpoint.
Click Continue.
Choose your model specifications. For more information, see Deploy a model to an endpoint.
Click Create to create your endpoint and deploy your model to it.
Make a note of the endpoint ID in the response.

API

REST

Before using any of the request data, make the following replacements:

VERTEX_AI_PROJECT_ID: the ID of the Google Cloud project where you're creating the online prediction endpoint.
REGION: the region where you're using Vertex AI.
VERTEX_AI_ENDPOINT_NAME: the display name for the online prediction endpoint.
ALLOWED_PROJECTS: a comma-separated list of Google Cloud project IDs, each enclosed in quotation marks, for example, ["PROJECTID1", "PROJECTID2"]. If a project isn't contained in this list, you won't be able to send prediction requests to the Vertex AI endpoint from it. Make sure to include VERTEX_AI_PROJECT_ID in this list so that you can call the endpoint from the same project it's in.
INFERENCE_TIMEOUT_SECS: (Optional) Number of seconds in the optional inferenceTimeout field.

HTTP method and URL:

POST https://REGION-aiplatform.googleapis.com/v1/projects/VERTEX_AI_PROJECT_ID/locations/REGION/endpoints

Request JSON body:

{
  "displayName": "VERTEX_AI_ENDPOINT_NAME",
  "privateServiceConnectConfig": {
    "enablePrivateServiceConnect": true,
    "projectAllowlist": ["ALLOWED_PROJECTS"],
    "clientConnectionConfig": {
      "inferenceTimeout": {
        "seconds": INFERENCE_TIMEOUT_SECS
      }
    }
  }
}

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login , or by using Cloud Shell, which automatically logs you into the gcloud CLI . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://REGION-aiplatform.googleapis.com/v1/projects/VERTEX_AI_PROJECT_ID/locations/REGION/endpoints"

PowerShell (Windows)

Note: The following command assumes that you have logged in to the gcloud CLI with your user account by running gcloud init or gcloud auth login . You can check the currently active account by running gcloud auth list.

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://REGION-aiplatform.googleapis.com/v1/projects/VERTEX_AI_PROJECT_ID/locations/REGION/endpoints" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/VERTEX_AI_PROJECT_NUMBER/locations/REGION/endpoints/ENDPOINT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.CreateEndpointOperationMetadata",
    "genericMetadata": {
      "createTime": "2020-11-05T17:45:42.812656Z",
      "updateTime": "2020-11-05T17:45:42.812656Z"
    }
  }
}

Make a note of the ENDPOINT_ID.

Python

Before trying this sample, follow the Python setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Python API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

Replace the following:

VERTEX_AI_PROJECT_ID: the ID of the Google Cloud project where you're creating the online inference endpoint
REGION: the region where you're using Vertex AI
VERTEX_AI_ENDPOINT_NAME: the display name for the online inference endpoint
ALLOWED_PROJECTS: a comma-separated list of Google Cloud project IDs, each enclosed in quotation marks. For example, ["PROJECTID1", "PROJECTID2"]. If a project isn't contained in this list, you won't be able to send inference requests to the Vertex AI endpoint from it. Make sure to include VERTEX_AI_PROJECT_ID in this list so that you can call the endpoint from the same project it's in.
INFERENCE_TIMEOUT_SECS: (Optional) Number of seconds in the optional inference_timeout value.

PROJECT_ID = "VERTEX_AI_PROJECT_ID"
REGION = "REGION"
VERTEX_AI_ENDPOINT_NAME = "VERTEX_AI_ENDPOINT_NAME"
INFERENCE_TIMEOUT_SECS = "INFERENCE_TIMEOUT_SECS"

from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION)

# Create the forwarding rule in the consumer project
psc_endpoint = aiplatform.PrivateEndpoint.create(
display_name=VERTEX_AI_ENDPOINT_NAME,
project=PROJECT_ID,
location=REGION,
private_service_connect_config=aiplatform.PrivateEndpoint.PrivateServiceConnectConfig(
    project_allowlist=["ALLOWED_PROJECTS"],
    ),
inference_timeout=INFERENCE_TIMEOUT_SECS,
)

Make a note of the ENDPOINT_ID at the end of the returned endpoint URI:

INFO:google.cloud.aiplatform.models:To use this PrivateEndpoint in another session:
INFO:google.cloud.aiplatform.models:endpoint = aiplatform.PrivateEndpoint('projects/VERTEX_AI_PROJECT_ID/locations/REGION/endpoints/ENDPOINT_ID')

Create the online inference endpoint with PSC automation (Preview)

Online inference integrates with service connectivity automation, which lets you configure inference endpoints with PSC automation. This simplifies the process by automatically creating PSC endpoints, and is particularly beneficial for ML developers who lack permissions to create network resources such as forwarding rules within a project.

To get started, your network administrator must establish a service connection policy. This policy is a one-time configuration per project and network that lets Vertex AI (service class gcp-vertexai) generate PSC endpoints within your projects and networks.

Next, you can create endpoints using the PSC automation configuration and then deploy your models. Once the deployment is complete, the relevant PSC endpoint information is accessible within the endpoints.

Limitations

VPC Service Controls aren't supported.
A regional limit of 500 endpoints applies to PSC automation configurations.
PSC automation results are purged when no model is deployed or is in the process of being deployed to the endpoint. Upon cleanup and subsequent model deployment, new automation results feature distinct IP addresses and forwarding rules.

Create a service connection policy

You must be a network administrator to create the service connection policy. A Service connection policy is required to let Vertex AI create PSC endpoints in your networks. Without a valid policy, the automation fails with a CONNECTION_POLICY_MISSING error.

Create your service connection policy.
- POLICY_NAME: A user-specified name for the policy.
- PROJECT_ID: The ID of the service project where you are creating Vertex AI resources.
- VPC_PROJECT: The project ID where your client VPC is located. For single VPC setup, this is the same as $PROJECT. For Shared VPC setup, this is the VPC host project.
- NETWORK_NAME: The name of the network to deploy to.
- REGION: The network's region.
- PSC_SUBNETS: The Private Service Connect subnets to use.
```
gcloud network-connectivity service-connection-policies create POLICY_NAME \
    --project=VPC_PROJECT \
    --network=projects/PROJECT_ID/global/networks/NETWORK_NAME \
    --service-class=gcp-vertexai --region=REGION --subnets=PSC_SUBNETS
```

View your service connection policy.

gcloud network-connectivity service-connection-policies list \
    --project=VPC_PROJECT -–region=REGION

For a single VPC setup, a sample looks like this:

    gcloud network-connectivity service-connection-policies create test-policy \
        --network=default \
        --project=YOUR_PROJECT_ID \
        --region=us-central1 \
        --service-class=gcp-vertexai \
        --subnets=default \
        --psc-connection-limit=500 \
        --description=test

Create the online inference endpoint with PSC automation config

In the PSCAutomationConfig, check to be sure that the projectId is in the allowlist.

REST

Before using any of the request data, make the following replacements:

REGION: The region where you're using Vertex AI.
VERTEX_AI_PROJECT_ID: The ID of the Google Cloud project where you're creating the online inference endpoint.
VERTEX_AI_ENDPOINT_NAME: The display name for the online prediction endpoint.
NETWORK_NAME: the full resource name, including the project ID, instead of the project number.

HTTP method and URL:

POST https://REGION-aiplatform.googleapis.com/v1/projects/VERTEX_AI_PROJECT_ID/locations/REGION/endpoints

Request JSON body:

{
  {
    displayName: "VERTEX_AI_ENDPOINT_NAME",
    privateServiceConnectConfig: {
      enablePrivateServiceConnect: true,
      projectAllowlist: ["VERTEX_AI_PROJECT_ID"],
      pscAutomationConfigs: [
        { "project_id": "VERTEX_AI_PROJECT_ID", "network": "projects/VERTEX_AI_PROJECT_ID/global/networks/NETWORK_NAME" },
      ],
    },
  },

To send your request, expand one of these options:

curl (Linux, macOS, or Cloud Shell)

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://REGION-aiplatform.googleapis.com/v1/projects/VERTEX_AI_PROJECT_ID/locations/REGION/endpoints"

PowerShell (Windows)

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://REGION-aiplatform.googleapis.com/v1/projects/VERTEX_AI_PROJECT_ID/locations/REGION/endpoints" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

{
  "name": "projects/VERTEX_AI_PROJECT_NUMBER/locations/REGION/endpoints/ENDPOINT_ID/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1.CreateEndpointOperationMetadata",
    "genericMetadata": {
      "createTime": "2020-11-05T17:45:42.812656Z",
      "updateTime": "2020-11-05T17:45:42.812656Z"
    }
  }
}

Make a note of the ENDPOINT_ID.

Python

Before trying this sample, follow the Python setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Python API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

Replace the following:

VERTEX_AI_PROJECT_ID: the ID of the Google Cloud project where you're creating the online inference endpoint
REGION: the region where you're using Vertex AI
VERTEX_AI_ENDPOINT_NAME: the display name for the online inference endpoint
NETWORK_NAME: the full resource name, including the project ID, instead of the project number.

PROJECT_ID = "VERTEX_AI_PROJECT_ID"
REGION = "REGION"
VERTEX_AI_ENDPOINT_NAME = "VERTEX_AI_ENDPOINT_NAME"

from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION)

config =
aiplatform.compat.types.service_networking.PrivateServiceConnectConfig(
        enable_private_service_connect=True,
        project_allowlist="VERTEX_AI_PROJECT_ID"
        psc_automation_configs=[
            aiplatform.compat.types.service_networking.PSCAutomationConfig(
                project_id="VERTEX_AI_PROJECT_ID"
network=projects/"VERTEX_AI_PROJECT_ID"/global/networks/"NETWORK_NAME",
            )
        ]
    )
psc_endpoint = aiplatform.PrivateEndpoint.create(
     display_name="VERTEX_AI_ENDPOINT_NAME"
     private_service_connect_config=config,
)

Deploy the model

After you create your online inference endpoint with Private Service Connect enabled, deploy your model to it, following the steps outlined in Deploy a model to an endpoint.

Create PSC Endpoint Manually

Get the service attachment URI

When you deploy your model, a service attachment is created for the online inference endpoint. This service attachment represents the Vertex AI online inference service that's being exposed to your VPC network. Run the gcloud ai endpoints describe command to get the service attachment URI.

List only the serviceAttachment value from the endpoint details:
```
gcloud ai endpoints describe ENDPOINT_ID \
--project=VERTEX_AI_PROJECT_ID \
--region=REGION \
| grep -i serviceAttachment
```
Replace the following:
- ENDPOINT_ID: the ID of your online inference endpoint
- VERTEX_AI_PROJECT_ID: the ID of the Google Cloud project where you created your online inference endpoint
- REGION: the region for this request
The output is similar to the following:
```
serviceAttachment: projects/ac74a9f84c2e5f2a1-tp/regions/us-central1/serviceAttachments/gkedpm-c6e6a854a634dc99472bb802f503c1
```
Make a note of the entire string in the serviceAttachment field. This is the service attachment URI.

Create a forwarding rule

You can reserve an internal IP address and create a forwarding rule with that address. To create the forwarding rule, you need the service attachment URI from the previous step.

To reserve an internal IP address for the forwarding rule, use the gcloud compute addresses create command:
```
gcloud compute addresses create ADDRESS_NAME \
--project=VPC_PROJECT_ID \
--region=REGION \
--subnet=SUBNETWORK \
--addresses=INTERNAL_IP_ADDRESS
```
Replace the following:
- ADDRESS_NAME: a name for the internal IP address
- VPC_PROJECT_ID: the ID of the Google Cloud project that hosts your VPC network. If your online inference endpoint and your Private Service Connect forwarding rule are hosted in the same project, use VERTEX_AI_PROJECT_ID for this parameter.
- REGION: the Google Cloud region where the Private Service Connect forwarding rule is to be created
- SUBNETWORK: the name of the VPC subnet that contains the IP address
- INTERNAL_IP_ADDRESS: the internal IP address to reserve. This parameter is optional.
  - If this parameter is specified, the IP address must be within the subnet's primary IP address range. The IP address can be an RFC 1918 address or a subnet with non-RFC ranges.
  - If this parameter is omitted, an internal IP address is allocated automatically.
  - For more information, see Reserve a new static internal IPv4 or IPv6 address.
To verify that the IP address is reserved, use the gcloud compute addresses list command:
```
gcloud compute addresses list --filter="name=(ADDRESS_NAME)" \
--project=VPC_PROJECT_ID
```
In the response, verify that a RESERVED status appears for the IP address.
To create the forwarding rule and point it to the online inference service attachment, use the gcloud compute forwarding-rules create command:
```
gcloud compute forwarding-rules create PSC_FORWARDING_RULE_NAME \
    --address=ADDRESS_NAME \
    --project=VPC_PROJECT_ID \
    --region=REGION \
    --network=VPC_NETWORK_NAME \
    --target-service-attachment=SERVICE_ATTACHMENT_URI
```
Replace the following:
- PSC_FORWARDING_RULE_NAME: a name for the forwarding rule
- VPC_NETWORK_NAME: the name of the VPC network where the endpoint is to be created
- SERVICE_ATTACHMENT_URI: the service attachment that you made a note of earlier
To verify that the service attachment accepts the endpoint, use the gcloud compute forwarding-rules describe command:
```
gcloud compute forwarding-rules describe PSC_FORWARDING_RULE_NAME \
--project=VPC_PROJECT_ID \
--region=REGION
```
In the response, verify that an ACCEPTED status appears in the pscConnectionStatus field.

Optional: Get the internal IP address

If you didn't specify a value for INTERNAL_IP_ADDRESS when you created the forwarding rule, you can get the address that was allocated automatically by using the gcloud compute forwarding-rules describe command:

gcloud compute forwarding-rules describe PSC_FORWARDING_RULE_NAME \
--project=VERTEX_AI_PROJECT_ID \
--region=REGION \
| grep -i IPAddress

Replace the following:

VERTEX_AI_PROJECT_ID: your project ID
REGION: the region name for this request

Optional: Get PSC endpoint from PSC automation result

You can get the generated IP address and forwarding rule from the inference endpoint. Here's an example:

"privateServiceConnectConfig": {
  "enablePrivateServiceConnect": true,
  "projectAllowlist": [
    "your-project-id",
  ],
  "pscAutomationConfigs": [
    {
      "projectId": "your-project-id",
      "network": "projects/your-project-id/global/networks/default",
      "ipAddress": "10.128.15.209",
      "forwardingRule": "https://www.googleapis.com/compute/v1/projects/your-project-id/regions/us-central1/forwardingRules/sca-auto-fr-47b0d6a4-eaff-444b-95e6-e4dc1d10101e",
      "state": "PSC_AUTOMATION_STATE_SUCCESSFUL"
    },
  ]
}

Here are some error handling details.

Automation failure doesn't affect the outcome of the model deployment.
The success or failure of the operation is indicated in the state.
- If successful, the IP address and forwarding rule is displayed.
- If unsuccessful, an error message is displayed.
Automation configurations are removed when no models are deployed or in the process of being deployed to the endpoint. This results in a change to the IP address and forwarding rule if a model is deployed later.
Failed automation won't recover. In case of failure, you can still create the PSC endpoint manually, see Create PSC Endpoint Manually.

Get online inferences

Getting online inferences from an endpoint with Private Service Connect is similar to getting online inferences from public endpoints, except for the following considerations:

The request must be sent from a project that was specified in the projectAllowlist when the online inference endpoint was created.
If global access isn't enabled, the request must be sent from the same region.
There are two ports open, 443 with TLS using self-signed certificate and 80 without TLS. Both ports support HTTP and GRPC. All traffic will be under private network and won't go through public internet.
To obtain inferences, a connection must be established using the endpoint's static IP address, unless a DNS record is created for the internal IP address. For example, send the predict requests to the following endpoint:
```
https://INTERNAL_IP_ADDRESS/v1/projects/VERTEX_AI_PROJECT_ID/locations/REGION/endpoints/ENDPOINT_ID:predict
```
Replace INTERNAL_IP_ADDRESS with the internal IP address that you reserved earlier.
For gRPC requests: To ensure proper endpoint identification for gRPC requests, it is necessary to include the header x-vertex-ai-endpoint-id. This is required as endpoint information is not conveyed within the request path for gRPC communication.
Secure connections (Port 443): When establishing secure connections using port 443, the server uses a self-signed certificate. To proceed with the connection, one of the following approaches is recommended:
- Option 1: Certificate validation bypass: Configure the client to ignore certificate validation and establish the connection using either the IP address of the server or a preferred DNS resolution method.
- Option 2: Trust store integration: Obtain the server's self-signed certificate, add it to the local trust store of the client system, and use the DNS name in the format *.prediction.p.vertexai.goog for establishing the connection. This method ensures secure communication through certificate validation. The server's certificate can be written to PSC_CERTIFICATE_FILE.pem with the command:
```
openssl s_client -showcerts -connect INTERNAL_IP_ADDRESS:443 \
  -servername *.prediction.p.vertexai.goog \
  </dev/null 2>/dev/null | sed -n \
  '/-----BEGIN CERTIFICATE-----/,/-----END CERTIFICATE-----/p' >  PSC_CERTIFICATE_FILE.pem
```

The following sections provide examples of how you can send the predict request using Python.

First example

psc_endpoint = aiplatform.PrivateEndpoint("projects/VERTEX_AI_PROJECT_ID/locations/REGION/endpoints/ENDPOINT_ID")
REQUEST_FILE = "PATH_TO_INPUT_FILE"
import json

import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

with open(REQUEST_FILE) as json_file:
    data = json.load(json_file)
    response = psc_endpoint.predict(
        instances=data["instances"], endpoint_override=INTERNAL_IP_ADDRESS
    )
print(response)

Replace PATH_TO_INPUT_FILE with a path to a JSON file containing the request input.

Second example

import json
import requests
import urllib3
import google.auth.transport.requests

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

REQUEST_FILE = "PATH_TO_INPUT_FILE"

# Programmatically get credentials and generate an access token
creds, project = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)
access_token = creds.token
# Note: the credential lives for 1 hour by default
# After expiration, it must be refreshed
# See https://cloud.google.com/docs/authentication/token-types#access-tokens
# for token lifetimes.

with open(REQUEST_FILE) as json_file:
    data = json.load(json_file)
    url = "https://INTERNAL_IP_ADDRESS/v1/projects/VERTEX_AI_PROJECT_ID/locations/REGION/endpoints/ENDPOINT_ID:predict"
    headers = {
      "Content-Type": "application/json",
      "Authorization": f"Bearer {access_token}"  # Add access token to headers
    }
    payload = {
      "instances": data["instances"],
    }

response = requests.post(url, headers=headers, json=payload, verify=False)

print(response.json())

Optional: Create a DNS record for the internal IP address

We recommend that you create a DNS record so that you can get online inferences from your endpoint without needing to specify the internal IP address.

For more information, see Other ways to configure DNS.

Create a private DNS zone by using the gcloud dns managed-zones create command. This zone is associated with the VPC network that the forwarding rule was created in.

DNS_NAME_SUFFIX="prediction.p.vertexai.goog."  # DNS names have "." at the end.
gcloud dns managed-zones create ZONE_NAME \
--project=VPC_PROJECT_ID \
--dns-name=$DNS_NAME_SUFFIX \
--networks=VPC_NETWORK_NAME \
--visibility=private \
--description="A DNS zone for Vertex AI endpoints using Private Service Connect."

Replace the following:

ZONE_NAME: the name of the DNS zone

To create a DNS record in the zone, use the gcloud dns record-sets create command:
```
DNS_NAME=ENDPOINT_ID-REGION-VERTEX_AI_PROJECT_NUMBER.$DNS_NAME_SUFFIX
gcloud dns record-sets create $DNS_NAME \
--rrdatas=INTERNAL_IP_ADDRESS \
--zone=ZONE_NAME \
--type=A \
--ttl=60 \
--project=VPC_PROJECT_ID
```
Replace the following:
- VERTEX_AI_PROJECT_NUMBER: the project number for your VERTEX_AI_PROJECT_ID project. You can locate this project number in the Google Cloud console. For more information, see Identifying projects.
- INTERNAL_IP_ADDRESS: the internal IP address of your online inference endpoint
Now you can send your predict requests to:
```
https://ENDPOINT_ID-REGION-VERTEX_AI_PROJECT_NUMBER.prediction.p.vertexai.goog/v1/projects/VERTEX_AI_PROJECT_ID/locations/REGION/endpoints/ENDPOINT_ID:predict
```

The following is an example of how you can send the predict request to the DNS zone using Python:

REQUEST_FILE = "PATH_TO_INPUT_FILE"
import json

import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

with open(REQUEST_FILE) as json_file:
    data = json.load(json_file)
    response = psc_endpoint.predict(
        instances=data["instances"], endpoint_override=DNS_NAME
    )
print(response)

Replace DNS_NAME with the DNS name that you specified in the gcloud dns record-sets create command.

Limitations

Vertex AI endpoints with Private Service Connect are subject to the following limitations:

Deployment of tuned Gemini models isn't supported.
Private egress from within the endpoint isn't supported. Because Private Service Connect forwarding rules are unidirectional, other private Google Cloud workloads aren't accessible inside your container.
An endpoint's projectAllowlist value can't be changed.
Vertex Explainable AI isn't supported.
Before you delete an endpoint, you must undeploy your model from that endpoint.
If all models are undeployed for more than 10 minutes, the service attachment might be deleted. Check the Private Service Connect connection status; if it's CLOSED, recreate the forwarding rule.
After you've deleted your endpoint, you won't be able to reuse that endpoint name for up to 7 days.
A project can have up to 10 different projectAllowlist values in its Private Service Connect configurations.

Use dedicated private endpoints based on Private Service Connect for online inference

Required roles

Create the online inference endpoint

Console

API

REST

curl (Linux, macOS, or Cloud Shell)

PowerShell (Windows)

Python

Create the online inference endpoint with PSC automation (Preview)

Limitations

Create a service connection policy

Create the online inference endpoint with PSC automation config

REST

curl (Linux, macOS, or Cloud Shell)

PowerShell (Windows)

Python

Deploy the model

Create PSC Endpoint Manually

Get the service attachment URI

Create a forwarding rule

Optional: Get the internal IP address

Optional: Get PSC endpoint from PSC automation result

Get online inferences

First example

Second example

Optional: Create a DNS record for the internal IP address

Limitations

What's next