- NAME
-
- gcloud beta dataproc batches submit pyspark - submit a PySpark batch job
- SYNOPSIS
-
-
gcloud beta dataproc batches submit pyspark
MAIN_PYTHON_FILE
[--archives
=[ARCHIVE
,…]] [--async
] [--batch
=BATCH
] [--container-image
=CONTAINER_IMAGE
] [--deps-bucket
=DEPS_BUCKET
] [--files
=[FILE
,…]] [--history-server-cluster
=HISTORY_SERVER_CLUSTER
] [--jars
=[JAR
,…]] [--kms-key
=KMS_KEY
] [--labels
=[KEY
=VALUE
,…]] [--metastore-service
=METASTORE_SERVICE
] [--properties
=[PROPERTY
=VALUE
,…]] [--py-files
=[PY
,…]] [--region
=REGION
] [--request-id
=REQUEST_ID
] [--service-account
=SERVICE_ACCOUNT
] [--staging-bucket
=STAGING_BUCKET
] [--tags
=[TAGS
,…]] [--ttl
=TTL
] [--version
=VERSION
] [--network
=NETWORK
|--subnet
=SUBNET
] [GCLOUD_WIDE_FLAG …
] [--JOB_ARG
…]
-
- DESCRIPTION
-
(BETA)
Submit a PySpark batch job. - EXAMPLES
-
To submit a PySpark batch job called "my-batch" that runs "my-pyspark.py", run:
gcloud beta dataproc batches submit pyspark my-pyspark.py --batch=my-batch --deps-bucket=gs://my-bucket --region=us-central1 --py-files='path/to/my/python/script.py'
- POSITIONAL ARGUMENTS
-
MAIN_PYTHON_FILE
-
URI of the main Python file to use as the Spark driver. Must be a
file..py
- [--
JOB_ARG
…] -
Arguments to pass to the driver.
The '--' argument must be specified between gcloud specific args on the left and JOB_ARG on the right.
- FLAGS
-
--archives
=[ARCHIVE
,…]- Archives to be extracted into the working directory. Supported file types: .jar, .tar, .tar.gz, .tgz, and .zip.
--async
- Return immediately without waiting for the operation in progress to complete.
--batch
=BATCH
- The ID of the batch job to submit. The ID must contain only lowercase letters (a-z), numbers (0-9) and hyphens (-). The length of the name must be between 4 and 63 characters. If this argument is not provided, a random generated UUID will be used.
--container-image
=CONTAINER_IMAGE
- Optional custom container image to use for the batch/session runtime environment. If not specified, a default container image will be used. The value should follow the container image naming format: {registry}/{repository}/{name}:{tag}, for example, gcr.io/my-project/my-image:1.2.3
--deps-bucket
=DEPS_BUCKET
- A Cloud Storage bucket to upload workload dependencies.
--files
=[FILE
,…]- Files to be placed in the working directory.
--history-server-cluster
=HISTORY_SERVER_CLUSTER
- Spark History Server configuration for the batch/session job. Resource name of an existing Dataproc cluster to act as a Spark History Server for the workload in the format: "projects/{project_id}/regions/{region}/clusters/{cluster_name}".
--jars
=[JAR
,…]- Comma-separated list of jar files to be provided to the classpaths.
--kms-key
=KMS_KEY
- Cloud KMS key to use for encryption.
--labels
=[KEY
=VALUE
,…]-
List of label KEY=VALUE pairs to add.
Keys must start with a lowercase character and contain only hyphens (
-
), underscores (_
), lowercase characters, and numbers. Values must contain only hyphens (-
), underscores (_
), lowercase characters, and numbers. --metastore-service
=METASTORE_SERVICE
- Name of a Dataproc Metastore service to be used as an external metastore in the format: "projects/{project-id}/locations/{region}/services/{service-name}".
--properties
=[PROPERTY
=VALUE
,…]- Specifies configuration properties for the workload. See Dataproc Serverless for Spark documentation for the list of supported properties.
--py-files
=[PY
,…]-
Comma-separated list of Python scripts to be passed to the PySpark framework.
Supported file types:
,.py
and.egg
.zip.
-
Region resource - Dataproc region to use. Each Dataproc region constitutes an
independent resource namespace constrained to deploying instances into Compute
Engine zones inside the region. This represents a Cloud resource. (NOTE) Some
attributes are not given arguments in this group but can be set in other ways.
To set the
project
attribute:-
provide the argument
--region
on the command line with a fully specified name; -
set the property
dataproc/region
with a fully specified name; -
provide the argument
--project
on the command line; -
set the property
core/project
.
--region
=REGION
-
ID of the region or fully qualified identifier for the region.
To set the
region
attribute:-
provide the argument
--region
on the command line; -
set the property
dataproc/region
.
-
provide the argument
-
provide the argument
--request-id
=REQUEST_ID
-
A unique ID that identifies the request. If the service receives two batch
create requests with the same request_id, the second request is ignored and the
operation that corresponds to the first batch created and stored in the backend
is returned. Recommendation: Always set this value to a UUID. The value must
contain only letters (a-z, A-Z), numbers (0-9), underscores (
), and hyphens (-). The maximum length is 40 characters.
--service-account
=SERVICE_ACCOUNT- The IAM service account to be used for a batch/session job.
--staging-bucket
=STAGING_BUCKET- The Cloud Storage bucket to use to store job dependencies, config files, and job driver console output. If not specified, the default [staging bucket] (https://cloud.google.com/dataproc-serverless/docs/concepts/buckets) is used.
- Network tags for traffic control.
--ttl
=TTL- The duration after the workload will be unconditionally terminated, for example, '20m' or '1h'. Run gcloud topic datetimes for information on duration formats.
--version
=VERSION- Optional runtime version. If not specified, a default version will be used.
-
At most one of these can be specified:
--network
=NETWORK- Network URI to connect network to.
--subnet
=SUBNET- Subnetwork URI to connect network to. Subnet must have Private Google Access enabled.
- GCLOUD WIDE FLAGS
-
These flags are available to all commands:
--access-token-file
,--account
,--billing-project
,--configuration
,--flags-file
,--flatten
,--format
,--help
,--impersonate-service-account
,--log-http
,--project
,--quiet
,--trace-token
,--user-output-enabled
,--verbosity
.Run
$ gcloud help
for details. - NOTES
-
This command is currently in beta and might change without notice. This variant
is also available:
gcloud dataproc batches submit pyspark
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2024-04-23 UTC.