Hive to Cloud Storage template

Use the Dataproc Serverless Hive to Cloud Storage template to extract data from Hive to Cloud Storage.

Use the template

Run the template using the gcloud CLI or Dataproc API.

gcloud

Before using any of the command data below, make the following replacements:

  • PROJECT_ID: Required. Your Google Cloud project ID listed in the IAM Settings.
  • REGION: Required. Compute Engine region.
  • TEMPLATE_VERSION: Required. Specify latest for the latest template version, or the date of a specific version, for example, 2023-03-17_v0.1.0-beta (visit gs://dataproc-templates-binaries or run gcloud storage ls gs://dataproc-templates-binaries to list available template versions).
  • SUBNET: Optional. If a subnet is not specified, the subnet in the specified REGION in the default network is selected.

    Example: projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME

  • HOST and PORT: Required. Hostname or IP address and port of the source Hive database host.

    Example: 10.0.0.33

  • TABLE: Required. Hive input table name.
  • DATABASE: Required. Hive input database name.
  • CLOUD_STORAGE_OUTPUT_PATH: Required. Cloud Storage path where output will be stored.

    Example: gs://dataproc-templates/hive_to_cloud_storage_output

  • FORMAT: Optional. Output data format. Options: avro, parquet, csv, or json. Default: avro. Note: If avro, you must add file:///usr/lib/spark/connector/spark-avro.jar to the jars gcloud CLI flag or API field.

    Example (the file:// prefix references a Dataproc Serverless jar file):

    --jars=file:///usr/lib/spark/connector/spark-avro.jar, [, ... other jars]
  • HIVE_PARTITION_COLUMN: Optional. Column to partition Hive data.
  • MODE: Required. Write mode for Cloud Storage output. Options: append, overwrite, ignore, or errorifexists.
  • SERVICE_ACCOUNT: Optional. If not provided, the default Compute Engine service account is used.
  • PROPERTY and PROPERTY_VALUE: Optional. Comma-separated list of Spark property=value pairs.
  • LABEL and LABEL_VALUE: Optional. Comma-separated list of label=value pairs.
  • LOG_LEVEL: Optional. Level of logging. Can be one of ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, or WARN. Default: INFO.
  • KMS_KEY: Optional. The Cloud Key Management Service key to use for encryption. If a key is not specified, data is encrypted at rest using a Google-owned and Google-managed encryption key.

    Example: projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME

Execute the following command:

Linux, macOS, or Cloud Shell

gcloud dataproc batches submit spark \
    --class=com.google.cloud.dataproc.templates.main.DataProcTemplate \
    --version="1.2" \
    --project="PROJECT_ID" \
    --region="REGION" \
    --jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar" \
    --subnet="SUBNET" \
    --kms-key="KMS_KEY" \
    --service-account="SERVICE_ACCOUNT" \
    --properties="spark.hadoop.hive.metastore.uris=thrift://HOST:PORT,PROPERTY=PROPERTY_VALUE" \
    --labels="LABEL=LABEL_VALUE" \
    -- --template=HIVETOGCS \
    --templateProperty log.level="LOG_LEVEL" \
    --templateProperty hive.input.table="TABLE" \
    --templateProperty hive.input.db="DATABASE" \
    --templateProperty hive.gcs.output.path="CLOUD_STORAGE_OUTPUT_PATH" \
    --templateProperty hive.gcs.output.format="FORMAT" \
    --templateProperty hive.partition.col="HIVE_PARTITION_COLUMN" \
    --templateProperty hive.gcs.save.mode="MODE"

Windows (PowerShell)

gcloud dataproc batches submit spark `
    --class=com.google.cloud.dataproc.templates.main.DataProcTemplate `
    --version="1.2" `
    --project="PROJECT_ID" `
    --region="REGION" `
    --jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar" `
    --subnet="SUBNET" `
    --kms-key="KMS_KEY" `
    --service-account="SERVICE_ACCOUNT" `
    --properties="spark.hadoop.hive.metastore.uris=thrift://HOST:PORT,PROPERTY=PROPERTY_VALUE" `
    --labels="LABEL=LABEL_VALUE" `
    -- --template=HIVETOGCS `
    --templateProperty log.level="LOG_LEVEL" `
    --templateProperty hive.input.table="TABLE" `
    --templateProperty hive.input.db="DATABASE" `
    --templateProperty hive.gcs.output.path="CLOUD_STORAGE_OUTPUT_PATH" `
    --templateProperty hive.gcs.output.format="FORMAT" `
    --templateProperty hive.partition.col="HIVE_PARTITION_COLUMN" `
    --templateProperty hive.gcs.save.mode="MODE"

Windows (cmd.exe)

gcloud dataproc batches submit spark ^
    --class=com.google.cloud.dataproc.templates.main.DataProcTemplate ^
    --version="1.2" ^
    --project="PROJECT_ID" ^
    --region="REGION" ^
    --jars="gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar" ^
    --subnet="SUBNET" ^
    --kms-key="KMS_KEY" ^
    --service-account="SERVICE_ACCOUNT" ^
    --properties="spark.hadoop.hive.metastore.uris=thrift://HOST:PORT,PROPERTY=PROPERTY_VALUE" ^
    --labels="LABEL=LABEL_VALUE" ^
    -- --template=HIVETOGCS ^
    --templateProperty log.level="LOG_LEVEL" ^
    --templateProperty hive.input.table="TABLE" ^
    --templateProperty hive.input.db="DATABASE" ^
    --templateProperty hive.gcs.output.path="CLOUD_STORAGE_OUTPUT_PATH" ^
    --templateProperty hive.gcs.output.format="FORMAT" ^
    --templateProperty hive.partition.col="HIVE_PARTITION_COLUMN" ^
    --templateProperty hive.gcs.save.mode="MODE"

REST

Before using any of the request data, make the following replacements:

  • PROJECT_ID: Required. Your Google Cloud project ID listed in the IAM Settings.
  • REGION: Required. Compute Engine region.
  • TEMPLATE_VERSION: Required. Specify latest for the latest template version, or the date of a specific version, for example, 2023-03-17_v0.1.0-beta (visit gs://dataproc-templates-binaries or run gcloud storage ls gs://dataproc-templates-binaries to list available template versions).
  • SUBNET: Optional. If a subnet is not specified, the subnet in the specified REGION in the default network is selected.

    Example: projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME

  • HOST and PORT: Required. Hostname or IP address and port of the source Hive database host.

    Example: 10.0.0.33

  • TABLE: Required. Hive input table name.
  • DATABASE: Required. Hive input database name.
  • CLOUD_STORAGE_OUTPUT_PATH: Required. Cloud Storage path where output will be stored.

    Example: gs://dataproc-templates/hive_to_cloud_storage_output

  • FORMAT: Optional. Output data format. Options: avro, parquet, csv, or json. Default: avro. Note: If avro, you must add file:///usr/lib/spark/connector/spark-avro.jar to the jars gcloud CLI flag or API field.

    Example (the file:// prefix references a Dataproc Serverless jar file):

    --jars=file:///usr/lib/spark/connector/spark-avro.jar, [, ... other jars]
  • HIVE_PARTITION_COLUMN: Optional. Column to partition Hive data.
  • MODE: Required. Write mode for Cloud Storage output. Options: append, overwrite, ignore, or errorifexists.
  • SERVICE_ACCOUNT: Optional. If not provided, the default Compute Engine service account is used.
  • PROPERTY and PROPERTY_VALUE: Optional. Comma-separated list of Spark property=value pairs.
  • LABEL and LABEL_VALUE: Optional. Comma-separated list of label=value pairs.
  • LOG_LEVEL: Optional. Level of logging. Can be one of ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, or WARN. Default: INFO.
  • KMS_KEY: Optional. The Cloud Key Management Service key to use for encryption. If a key is not specified, data is encrypted at rest using a Google-owned and Google-managed encryption key.

    Example: projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME

HTTP method and URL:

POST https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches

Request JSON body:


{
  "environmentConfig":{
    "executionConfig":{
      "subnetworkUri":"SUBNET",
      "kmsKey": "KMS_KEY",
      "serviceAccount": "SERVICE_ACCOUNT"
    }
  },
  "labels": {
    "LABEL": "LABEL_VALUE"
  },
  "runtimeConfig": {
    "version": "1.2",
    "properties": {
      "spark.hadoop.hive.metastore.uris":"thrift://HOST:PORT",
      "PROPERTY": "PROPERTY_VALUE"
    }
  },
  "sparkBatch":{
    "mainClass":"com.google.cloud.dataproc.templates.main.DataProcTemplate",
    "args":[
      "--template","HIVETOGCS",
      "--templateProperty","log.level=LOG_LEVEL",
      "--templateProperty","hive.input.table=TABLE",
      "--templateProperty","hive.input.db=DATABASE",
      "--templateProperty","hive.gcs.output.path=CLOUD_STORAGE_OUTPUT_PATH",
      "--templateProperty","hive.gcs.output.format=FORMAT",
      "--templateProperty","hive.partition.col=HIVE_PARTITION_COLUMN",
      "--templateProperty","hive.gcs.save.mode=MODE"
    ],
    "jarFileUris":[
      "file:///usr/lib/spark/connector/spark-avro.jar",
      "gs://dataproc-templates-binaries/TEMPLATE_VERSION/java/dataproc-templates.jar"
    ]
  }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:


{
  "name": "projects/PROJECT_ID/regions/REGION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata",
    "batch": "projects/PROJECT_ID/locations/REGION/batches/BATCH_ID",
    "batchUuid": "de8af8d4-3599-4a7c-915c-798201ed1583",
    "createTime": "2023-02-24T03:31:03.440329Z",
    "operationType": "BATCH",
    "description": "Batch"
  }
}