This document describes the requirements needed for Dataproc Serverless for Spark network configuration.
Virtual Private Cloud subnetwork requirements
This document explains the Virtual Private Cloud network requirements for Dataproc Serverless for Spark batch workloads and interactive sessions.
Private Google Access
Dataproc Serverless batch workloads and interactive sessions run on VMs with internal IP addresses only and on a regional subnet with Private Google Access (PGA) automatically enabled on the session subnet.
If you don't specify a subnet, Dataproc Serverless selects the
default
subnet in the batch workload or session region as the subnet for a
batch workload or session.
If your workload requires external network or internet access, for example to download resources such as ML models from PyTorch Hub or Hugging Face, you can set up Cloud NAT to allow outbound traffic using internal IPs on your VPC network.
Open subnet connectivity
The VPC subnet for the region selected for the Dataproc Serverless batch workload or interactive session must allow internal subnet communication on all ports between VM instances.
The following Google Cloud CLI command attaches a network firewall to a subnet that allows internal ingress communications among VMs using all protocols on all ports:
gcloud compute firewall-rules create allow-internal-ingress \ --network=NETWORK_NAME \ --source-ranges=SUBNET_RANGES \ --destination-ranges=SUBNET_RANGES \ --direction=ingress \ --action=allow \ --rules=all
Notes:
SUBNET_RANGES: See Allow internal ingress connections between VMs. The
default
VPC network in a project with thedefault-allow-internal
firewall rule, which allows ingress communication on all ports (tcp:0-65535
,udp:0-65535
, andicmp protocols:ports
), meets the open-subnet-connectivity requirement. However, this rule also allows ingress by any VM instance on the network.
Dataproc Serverless and VPC-SC networks
With VPC Service Controls, network administrators can define a security perimeter around resources of Google-managed services to control communication to and between those services.
Note the following strategies when using VPC-SC networks with Dataproc Serverless:
Create a custom container image that pre-installs dependencies outside the VPC-SC perimeter, and then submit a Spark batch workload that uses your custom container image.
For more information, see VPC Service Controls—Dataproc Serverless for Spark.