Customize your Spark job runtime environment with Docker on YARN

The Dataproc Docker on YARN feature allows you to create and use a Docker image to customize your Spark job runtime environment. The image can include customizations to Java, Python, and R dependencies, and to your job jar.


Feature availability or support is not available with:

  • Dataproc image versions prior to 2.0.49 (not available in 1.5 images)
  • MapReduce jobs (only supported for Spark jobs )
  • Spark client mode (only supported with Spark cluster mode)
  • Kerberos clusters: cluster creation fails if you create a cluster with Docker on YARN and Kerberos enabled.
  • Customizations of JDK, Hadoop and Spark: the host JDK, Hadoop, and Spark are used, not your customizations.

Create a Docker image

The first step to customize your Spark environment is building a Docker image.


You can use the following Dockerfile as an example, making changes and additions to meet you needs.

FROM debian:10-slim

# Suppress interactive prompts.
ENV DEBIAN_FRONTEND=noninteractive

# Required: Install utilities required by Spark scripts.
RUN apt update && apt install -y procps tini

# Optional: Add extra jars.
ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'

# Optional: Install and configure Miniconda3.
ENV CONDA_HOME=/opt/miniconda3

RUN bash -b -p /opt/miniconda3 \
  && ${CONDA_HOME}/bin/conda config --system --set always_yes True \
  && ${CONDA_HOME}/bin/conda config --system --set auto_update_conda False \
  && ${CONDA_HOME}/bin/conda config --system --prepend channels conda-forge \
  && ${CONDA_HOME}/bin/conda config --system --set channel_priority strict

# Optional: Install Conda packages.
# The following packages are installed in the default image. It is strongly
# recommended to include all of them.
# Use mamba to install packages quickly.
RUN ${CONDA_HOME}/bin/conda install mamba -n base -c conda-forge \
    && ${CONDA_HOME}/bin/mamba install \
      conda \
      cython \
      fastavro \
      fastparquet \
      gcsfs \
      google-cloud-bigquery-storage \
      google-cloud-bigquery[pandas] \
      google-cloud-bigtable \
      google-cloud-container \
      google-cloud-datacatalog \
      google-cloud-dataproc \
      google-cloud-datastore \
      google-cloud-language \
      google-cloud-logging \
      google-cloud-monitoring \
      google-cloud-pubsub \
      google-cloud-redis \
      google-cloud-spanner \
      google-cloud-speech \
      google-cloud-storage \
      google-cloud-texttospeech \
      google-cloud-translate \
      google-cloud-vision \
      koalas \
      matplotlib \
      nltk \
      numba \
      numpy \
      openblas \
      orc \
      pandas \
      pyarrow \
      pysal \
      pytables \
      python \
      regex \
      requests \
      rtree \
      scikit-image \
      scikit-learn \
      scipy \
      seaborn \
      sqlalchemy \
      sympy \

# Optional: Add extra Python modules.
ENV PYTHONPATH=/opt/python/packages
RUN mkdir -p "${PYTHONPATH}"

# Required: Create the 'yarn_docker_user' group/user.
# The GID and UID must be 1099. Home directory is required.
RUN groupadd -g 1099 yarn_docker_user
RUN useradd -u 1099 -g 1099 -d /home/yarn_docker_user -m yarn_docker_user
USER yarn_docker_user

Build and push the image

The following is commands for building and pushing the example Docker image, you can make changes according to your customizations.

# Increase the version number when there is a change to avoid referencing
# a cached older image. Avoid reusing the version number, including the default
# `latest` version.

# Download the BigQuery connector.
gcloud storage cp \
  gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.22.2.jar .

# Download the Miniconda3 installer.

# Python module example:
cat > <<EOF
def hello(name):
  print("hello {}".format(name))

def read_lines(path):
  with open(path) as f:
    return f.readlines()

# Build and push the image.
docker build -t "${IMAGE}" .
docker push "${IMAGE}"

Create a Dataproc cluster

After creating a Docker image that customizes your Spark environment, create a Dataproc cluster that will use your Docker image when running Spark jobs.


gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --image-version=DP_IMAGE \
    --optional-components=DOCKER \
    --properties=dataproc:yarn.docker.enable=true,dataproc:yarn.docker.image=DOCKER_IMAGE \
    other flags

Replace the following;

  • CLUSTER_NAME: The cluster name.
  • REGION: The cluster region.
  • DP_IMAGE: Dataproc image version must be 2.0.49 or later (--image-version=2.0 will use a qualified minor version later than 2.0.49).
  • --optional-components=DOCKER: Enables the Docker component on the cluster.
  • --properties flag:
    • dataproc:yarn.docker.enable=true: Required property to enable the Dataproc Docker on YARN feature.
    • dataproc:yarn.docker.image: Optional property that you can add to specify your DOCKER_IMAGE using the following Container Registry image naming format: {hostname}/{project-id}/{image}:{tag}.


      Requirement: You must host your Docker image on Container Registry or Artifact Registry. (Dataproc cannot fetch containers from other registries).

      Recommendation: Add this property when you create your cluster to cache your Docker image and avoid YARN timeouts later when you submit a job that uses the image.

When dataproc:yarn.docker.enable is set to true, Dataproc updates Hadoop and Spark configurations to enable the Docker on YARN feature in the cluster. For example, spark.submit.deployMode is set to cluster, and spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS and spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS are set to mount directories from the host into the container.

Submit a Spark job to the cluster

After creating a Dataproc cluster, submit a Spark job to the cluster that uses your Docker image. The example in this section submits a PySpark job to the cluster.

Set job properties:

# Set the Docker image URI.

# Required: Use `#` as the delimiter for properties to avoid conflicts.

# Required: Set Spark properties with the Docker image.

# Optional: Add custom jars to Spark classpath. Don't set these properties if
# there are no customizations.

# Optional: Set custom PySpark Python path only if there are customizations.

# Optional: Set custom Python module path only if there are customizations.
# Since the `PYTHONPATH` environment variable defined in the Dockerfile is
# overridden by Spark, it must be set as a job property.



Submit the job to the cluster.

gcloud dataproc jobs submit pyspark PYFILE \
    --cluster=CLUSTER_NAME \
    --region=REGION \

Replace the following;

  • PYFILE: The file path to your PySpark job file. It can be a local file path or the URI of the file in Cloud Storage (gs://BUCKET_NAME/PySpark filename).
  • CLUSTER_NAME: The cluster name.
  • REGION: The cluster region.