Networking

If you're interested in Vertex AI Managed Training, contact your sales representative for access.

Managed Training is a managed Google Cloud service that is provisioned as a Compute Engine instance within your VPC. This deployment model allows the service to connect securely to other workloads within your VPC, Google-managed services, or multi-cloud networks.

Network MTU requirement

To achieve optimal network performance for the training infrastructure, you must configure the Maximum Transmission Unit (MTU) of your VPC network.

The recommended MTU value depends on the GPU machine type in your cluster:

  • For A3 Ultra and A4 nodes: Use an MTU of 8896.
  • For A3 Mega nodes: Use an MTU of 8244.

You may either create a new VPC or use an existing VPC.

Deploying Managed Training in a new VPC (recommended)

The recommended approach is to deploy the Managed Training cluster into a new, pre-configured VPC network. This ensures the correct MTU setting is applied automatically and avoids impacting existing workloads.

There are two main steps for deploying Managed Training in a new VPC:

  1. Create the VPC network: Create a new VPC network. To enable jumbo frames, set its MTU to 8896.

  2. Deploy the cluster: Deploy the Managed Training cluster into this newly configured network.

Following this order, the cluster's VM instances will automatically inherit the correct MTU setting upon their initial boot.

Create and set up a new VPC

  1. Create the VPC network. To enable jumbo frames, set NETWORK_MTU to 8896.
        # create VPC network
        gcloud compute networks create NETWORK \
          --project=PROJECT_ID \
          --subnet-mode=custom \
          --mtu=NETWORK_MTU
        
  2. Create the subnet used to deploy the Managed Training cluster, and update the range based on your environment's requirements. In this example, the subnet 192.168.0.0/19 is used for the Managed Training deployment.
        # create VPC subnet
        gcloud compute networks subnets create SUBNETWORK \
          --project=PROJECT_ID \
          --network=NETWORK \
          --region=REGION \
          --enable-private-ip-google-access \
          --range=192.168.0.0/19
        
  3. Create an IAP firewall rule that allows SSH connectivity to the Managed Training cluster.
        gcloud compute firewall-rules create allow-ssh-ingress-from-iap \
        --direction=INGRESS   --action=allow   --rules=tcp:22 \
        --source-ranges=35.235.240.0/20 --network NETWORK
        
  4. Create an ingress firewall rule that allows all ports and protocols to the Managed Training cluster subnet.
       gcloud compute --project=PROJECT_ID firewall-rules create allow-internal \
       --direction=INGRESS --priority=1000 --network=NETWORK \
       --action=ALLOW --rules=tcp:1-65535,udp:1-65535,icmp \
       --source-ranges=192.168.0.0/19 --enable-logging
       

Deploying Managed Training in an existing VPC

If you are deploying the Managed Training cluster into an existing network with Cloud Storage instances, it's highly recommended that you use jumbo frames (MTU 8896) to ensure optimal performance. Before you begin, verify that the operating systems and applications on your existing VMs can support this change.

Implementing jumbo frames requires updating your VPC's MTU, which must be done during a planned maintenance window to prevent network instability.

The only safe procedure is to first stop all running VM instances in that network. Changing the MTU while VMs are active results in mismatched settings and unreliable connectivity.

Once all VMs are stopped, you can proceed with these steps:

  1. Change the network's MTU to your selected setting (for example, 8896).
  2. Restart all the VMs after the network update is complete.
  3. Manually update non-Linux VMs. Be aware that this restart isn't enough for all operating systems. While VMs from public Linux images automatically adopt the new MTU, you must manually update the MTU setting inside the OS for all Windows VMs and any custom-image VMs that don't use DHCP for MTU configuration.

Further requirements:

  • Enable Private Google Access in the subnet used to deploy the cluster.
  • Create an ingress firewall rule to grant IAP access to the cluster.
  • Create an ingress firewall rule to permit all traffic to the cluster.