Cluster rotation

Organization security policies, regulatory compliance rules, and other considerations can prompt you to "rotate" your Dataproc clusters at regular intervals by deleting, then recreating clusters on a schedule. As part of cluster rotation, new clusters can be provisioned with the latest Dataproc image versions while retaining the configuration settings of the replaced clusters.

This page shows you how to set up clusters that you plan to rotate ("rotated clusters"), submit jobs to them, and then rotate the clusters as needed.

Custom image cluster rotation: You can apply previous or new customizations to a previous or new Dataproc base image when recreating the custom image cluster.

Set up rotated clusters

To set up rotated clusters, create unique, timestamp-suffixed cluster names to distinguish previous from new clusters, and then attach labels to clusters that indicate if a cluster is part of a rotated cluster pool and actively receiving new job submissions. This example uses cluster-pool and cluster-state=active labels for these purposes, but you can use your own label names.

Set environment variables:
```
PROJECT=project ID \
  REGION=region \
  CLUSTER_POOL=cluster-pool-name \
  CLUSTER_NAME=$CLUSTER_POOL-$(date '+%Y%m%d%H%M') \
  BUCKET=Cloud Storage bucket-name
```
Notes:
- cluster-pool-name: The name of the cluster pool associated with one or more clusters. This name is used in the cluster name and with the cluster-pool label attached to the cluster to identify the cluster as part of the pool.

Create the cluster. You can add arguments and use different labels.

gcloud dataproc clusters create ${CLUSTER_NAME} \
  --project=${PROJECT_ID} \
  --region=${REGION} \
  --bucket=${BUCKET} \
  --labels="cluster-pool=${CLUSTER_POOL},cluster-state=active"

Submit jobs to clusters

The following Google Cloud CLI and Apache Airflow directed acyclic graph (DAG) examples submit an Apache Pig job to a cluster. Cluster labels are used to submit the job to an active cluster within a cluster pool.

gcloud

Submit an Apache Pig job located in Cloud Storage. Pick the cluster using labels.

gcloud dataproc jobs submit pig \
    --region=${REGION} \
    --file=gs://${BUCKET}/scripts/script.pig \
    --cluster-labels="cluster-pool=${CLUSTER_POOL},cluster-state=active"

Airflow

Submit an Apache Pig job located in Cloud Storage using Airflow. Pick the cluster using labels.

from airflow import DAG
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator
from datetime import datetime

# Declare variables
project_id=  # e.g: my-project
region="us-central1"
dag_id='pig_wordcount'
cluster_labels={"cluster-pool":${CLUSTER_POOL},
                "cluster-state":"active"}
wordcount_script="gs://bucket-name/scripts/wordcount.pig"

# Define DAG

dag = DAG(
    dag_id,
    schedule_interval=None,
    start_date=datetime(2023, 8, 16),
    catchup=False
)

PIG_JOB = {
    "reference": {"project_id": project_id},
    "placement": {"cluster_labels": cluster_labels},
    "pig_job": {"query_file_uri": wordcount_script},
}

wordcount_task = DataprocSubmitJobOperator(
    task_id='wordcount',
    region=region,
    project_id=project_id,
    job=PIG_JOB,
    dag=dag
)

Rotate clusters

Update the cluster labels attached to the clusters you are rotating out. This examples uses the cluster-state=pendingfordeletion label to signify that the cluster is not receiving new job submissions and is being rotated out, but you can use your own label for this purpose.
```
gcloud dataproc clusters update ${CLUSTER_NAME} \
    --region=${REGION} \
    --update-labels="cluster-state=pendingfordeletion"
```
After the cluster label is updated, the cluster does not receive new jobs since jobs are submitted to clusters within a cluster pool with active labels only (see Submit jobs to clusters).
Delete clusters you are rotating out after they finish running jobs.

Note: You can automate this step with a monitoring script that fetches clusters with the cluster-state=pendingfordeletion label (or other label you added with the previous command), checks that no jobs are running on the cluster, and then deletes the cluster.