Connecting your visualization software to Hadoop on Google Cloud

Last reviewed 2024-04-17 UTC

This tutorial is the second part of a series that shows you how to build an end-to-end solution to give data analysts secure access to data when using business intelligence (BI) tools.

This tutorial is intended for operators and IT administrators who set up environments that provide data and processing capabilities to the business intelligence (BI) tools used by data analysts.

Tableau is used as the BI tool in this tutorial. To follow along with this tutorial, you must have Tableau Desktop installed on your workstation.

The series is made up of the following parts:

  • The first part of the series, Architecture for connecting visualization software to Hadoop on Google Cloud, defines the architecture of the solution, its components, and how the components interact.
  • This second part of the series tells you how to set up the architecture components that make up the end-to-end Hive topology on Google Cloud. The tutorial uses open source tools from the Hadoop ecosystem, with Tableau as the BI tool.

The code snippets in this tutorial are available in a GitHub repository. The GitHub repository also includes Terraform configuration files to help you set up a working prototype.

Throughout the tutorial, you use the name sara as the fictitious user identity of a data analyst. This user identity is in the LDAP directory that both Apache Knox and Apache Ranger use. You can also choose to configure LDAP groups, but this procedure is outside the scope of this tutorial.

Objectives

  • Create an end-to-end setup that enables a BI tool to use data from a Hadoop environment.
  • Authenticate and authorize user requests.
  • Set up and use secure communication channels between the BI tool and the cluster.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Dataproc, Cloud SQL, and Cloud Key Management Service (Cloud KMS) APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Dataproc, Cloud SQL, and Cloud Key Management Service (Cloud KMS) APIs.

    Enable the APIs

Initializing your environment

  1. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

  2. In Cloud Shell, set environment variables with your project ID, and the region and zones of the Dataproc clusters:

    export PROJECT_ID=$(gcloud info --format='value(config.project)')
    export REGION=us-central1
    export ZONE=us-central1-b
    

    You can choose any region and zone, but keep them consistent as you follow this tutorial.

Setting up a service account

  1. In Cloud Shell, create a service account.

    gcloud iam service-accounts create cluster-service-account \
      --description="The service account for the cluster to be authenticated as." \
      --display-name="Cluster service account"
    

    The cluster uses this account to access Google Cloud resources.

  2. Add the following roles to the service account:

    • Dataproc Worker: to create and manage Dataproc clusters.
    • Cloud SQL Editor: for Ranger to connect to its database using Cloud SQL Proxy.
    • Cloud KMS CryptoKey Decrypter: to decrypt the passwords encrypted with Cloud KMS.

      bash -c 'array=( dataproc.worker cloudsql.editor cloudkms.cryptoKeyDecrypter )
      for i in "${array[@]}"
      do
        gcloud projects add-iam-policy-binding ${PROJECT_ID} \
          --member "serviceAccount:cluster-service-account@${PROJECT_ID}.iam.gserviceaccount.com" \
          --role roles/$i
      done'
      

Creating the backend cluster

In this section, you create the backend cluster where Ranger is located. You also create the Ranger database to store the policy rules, and a sample table in Hive to apply the Ranger policies.

Create the Ranger database instance

  1. Create a MySQL instance to store the Apache Ranger policies:

    export CLOUD_SQL_NAME=cloudsql-mysql
    gcloud sql instances create ${CLOUD_SQL_NAME} \
        --tier=db-n1-standard-1 --region=${REGION}
    

    This command creates an instance called cloudsql-mysql with the machine type db-n1-standard-1 located in the region specified by the ${REGION} variable. For more information, see the Cloud SQL documentation.

  2. Set the instance password for the user root connecting from any host. You can use the example password for demonstrative purposes, or create your own. If you create your own password, use a minimum of eight characters, including at least one letter and one number.

    gcloud sql users set-password root \
      --host=% --instance ${CLOUD_SQL_NAME} --password mysql-root-password-99
    

Encrypt the passwords

In this section, you create a cryptographic key to encrypt the passwords for Ranger and MySQL. To prevent exfiltration, you store the cryptographic key in Cloud KMS. For security purposes, you can't view, extract, or export the key bits.

You use the cryptographic key to encrypt the passwords and write them into files. You upload these files into a Cloud Storage bucket so that they are accessible to the service account that is acting on behalf of the clusters. The service account can decrypt these files because it has the cloudkms.cryptoKeyDecrypter role and access to the files and the cryptographic key. Even if a file is exfiltrated, the file can't be decrypted without the role and the key.

As an extra security measure, you create separate password files for each service. This action minimizes the potential impacted area if a password is exfiltrated.

For more information about key management, see the Cloud KMS documentation.

  1. In Cloud Shell, create a Cloud KMS key ring to hold your keys:

    gcloud kms keyrings create my-keyring --location global
    
  2. To encrypt your passwords, create a Cloud KMS cryptographic key:

    gcloud kms keys create my-key \
      --location global \
      --keyring my-keyring \
      --purpose encryption
    
  3. Encrypt your Ranger admin user password using the key. You can use the example password or create your own. Your password must be a minimum of eight characters, including at least one letter and one number.

    echo "ranger-admin-password-99" | \
    gcloud kms encrypt \
      --location=global \
      --keyring=my-keyring \
      --key=my-key \
      --plaintext-file=- \
      --ciphertext-file=ranger-admin-password.encrypted
    
  4. Encrypt your Ranger database admin user password with the key:

    echo "ranger-db-admin-password-99" | \
    gcloud kms encrypt \
      --location=global \
      --keyring=my-keyring \
      --key=my-key \
      --plaintext-file=- \
      --ciphertext-file=ranger-db-admin-password.encrypted
    
  5. Encrypt your MySQL root password with the key:

    echo "mysql-root-password-99" | \
    gcloud kms encrypt \
      --location=global \
      --keyring=my-keyring \
      --key=my-key \
      --plaintext-file=- \
      --ciphertext-file=mysql-root-password.encrypted
    
  6. Create a Cloud Storage bucket to store encrypted password files:

    gsutil mb -l ${REGION} gs://${PROJECT_ID}-ranger
    
  7. Upload the encrypted password files to the Cloud Storage bucket:

    gsutil -m cp *.encrypted gs://${PROJECT_ID}-ranger
    

Create the cluster

In this section, you create a backend cluster with Ranger support. For more information about the Ranger optional component in Dataproc, see the Dataproc Ranger Component documentation page.

  1. In Cloud Shell, create a Cloud Storage bucket to store the Apache Solr audit logs:

    gsutil mb -l ${REGION} gs://${PROJECT_ID}-solr
    
  2. Export all the variables required in order to create the cluster:

    export BACKEND_CLUSTER=backend-cluster
    
    export PROJECT_ID=$(gcloud info --format='value(config.project)')
    export REGION=us-central1
    export ZONE=us-central1-b
    export CLOUD_SQL_NAME=cloudsql-mysql
    
    export RANGER_KMS_KEY_URI=\
    projects/${PROJECT_ID}/locations/global/keyRings/my-keyring/cryptoKeys/my-key
    
    export RANGER_ADMIN_PWD_URI=\
    gs://${PROJECT_ID}-ranger/ranger-admin-password.encrypted
    
    export RANGER_DB_ADMIN_PWD_URI=\
    gs://${PROJECT_ID}-ranger/ranger-db-admin-password.encrypted
    
    export MYSQL_ROOT_PWD_URI=\
    gs://${PROJECT_ID}-ranger/mysql-root-password.encrypted
    

    For convenience, some of the variables that you set before are repeated in this command so you can modify them as you require.

    The new variables contain:

    • The name of the backend cluster.
    • The URI of the cryptographic key so that the service account can decrypt the passwords.
    • The URI of the files containing the encrypted passwords.

    If you used a different key ring or key, or different filenames, use the corresponding values in your command.

  3. Create the backend Dataproc cluster:

    gcloud beta dataproc clusters create ${BACKEND_CLUSTER} \
      --optional-components=SOLR,RANGER \
      --region ${REGION} \
      --zone ${ZONE} \
      --enable-component-gateway \
      --scopes=default,sql-admin \
      --service-account=cluster-service-account@${PROJECT_ID}.iam.gserviceaccount.com \
      --properties="\
    dataproc:ranger.kms.key.uri=${RANGER_KMS_KEY_URI},\
    dataproc:ranger.admin.password.uri=${RANGER_ADMIN_PWD_URI},\
    dataproc:ranger.db.admin.password.uri=${RANGER_DB_ADMIN_PWD_URI},\
    dataproc:ranger.cloud-sql.instance.connection.name=${PROJECT_ID}:${REGION}:${CLOUD_SQL_NAME},\
    dataproc:ranger.cloud-sql.root.password.uri=${MYSQL_ROOT_PWD_URI},\
    dataproc:solr.gcs.path=gs://${PROJECT_ID}-solr,\
    hive:hive.server2.thrift.http.port=10000,\
    hive:hive.server2.thrift.http.path=cliservice,\
    hive:hive.server2.transport.mode=http"
    

    This command has the following properties:

    • The final three lines in the command are the Hive properties to configure HiveServer2 in HTTP mode, so that Apache Knox can call Apache Hive through HTTP.
    • The other parameters in the command operate as follows:
      • The --optional-components=SOLR,RANGER parameter enables Apache Ranger and its Solr dependency.
      • The --enable-component-gateway parameter enables the Dataproc Component Gateway to make the Ranger and other Hadoop user interfaces available directly from the cluster page in Google Cloud console. When you set this parameter, there is no need for SSH tunneling to the backend master node.
      • The --scopes=default,sql-admin parameter authorizes Apache Ranger to access its Cloud SQL database.

If you need to create an external Hive metastore that persists beyond the lifetime of any cluster and can be used across multiple clusters, see Using Apache Hive on Dataproc. To run the procedure, you must run the table creation examples directly on Beeline. While the gcloud dataproc jobs submit hive commands use Hive binary transport, these commands aren't compatible with HiveServer2 when it's configured in HTTP mode.

Create a sample Hive table

  1. In Cloud Shell, create a Cloud Storage bucket to store a sample Apache Parquet file:

    gsutil mb -l ${REGION} gs://${PROJECT_ID}-hive
    
  2. Copy a publicly available sample Parquet file into your bucket:

    gsutil cp gs://hive-solution/part-00000.parquet \
      gs://${PROJECT_ID}-hive/dataset/transactions/part-00000.parquet
    
  3. Connect to the master node of the backend cluster you created in the previous section using SSH:

    gcloud compute ssh --zone ${ZONE} ${BACKEND_CLUSTER}-m
    

    The name of your cluster master node is the name of the cluster followed by -m. The HA cluster master node names have an extra suffix.

    If it's your first time connecting to your master node from Cloud Shell, you are prompted to generate SSH keys.

  4. In the terminal you opened with SSH, connect to the local HiveServer2 using Apache Beeline, which is pre-installed on the master node:

    beeline -u "jdbc:hive2://localhost:10000/;transportMode=http;httpPath=cliservice admin admin-password"\
      --hivevar PROJECT_ID=$(gcloud info --format='value(config.project)')
    

    This command starts the Beeline command-line tool and passes the name of your Google Cloud project in an environment variable.

    Hive isn't performing any user authentication, but to perform most tasks it requires a user identity. The admin user here is a default user that's configured in Hive. The identity provider that you configure with Apache Knox later in this tutorial handles user authentication for any requests that come from BI tools.

  5. In the Beeline prompt, create a table using the Parquet file you previously copied to your Hive bucket:

    CREATE EXTERNAL TABLE transactions
      (SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
      STORED AS PARQUET
      LOCATION 'gs://${PROJECT_ID}-hive/dataset/transactions';
    
  6. Verify that the table was created correctly:

    SELECT *
      FROM transactions
      LIMIT 10;
    
    SELECT TransactionType, AVG(TransactionAmount) AS AverageAmount
      FROM transactions
      WHERE SubmissionDate = '2017-12-22'
      GROUP BY TransactionType;
    

    The results of the two queries appear in the Beeline prompt.

  7. Exit the Beeline command-line tool:

    !quit
    
  8. Copy the internal DNS name of the backend master:

    hostname -A | tr -d '[:space:]'; echo
    

    You use this name in the next section as backend-master-internal-dns-name to configure the Apache Knox topology. You also use the name to configure a service in Ranger.

  9. Exit the terminal on the node:

    exit
    

Creating the proxy cluster

In this section, you create the proxy cluster that has the Apache Knox initialization action.

Create a topology

  1. In Cloud Shell, clone the Dataproc initialization-actions GitHub repository:

    git clone https://github.com/GoogleCloudDataproc/initialization-actions.git
    
  2. Create a topology for the backend cluster:

    export KNOX_INIT_FOLDER=`pwd`/initialization-actions/knox
    cd ${KNOX_INIT_FOLDER}/topologies/
    mv example-hive-nonpii.xml hive-us-transactions.xml
    

    Apache Knox uses the name of the file as the URL path for the topology. In this step, you change the name to represent a topology called hive-us-transactions. You can then access the fictitious transaction data that you loaded into Hive in Create a sample Hive table

  3. Edit the topology file:

    vi hive-us-transactions.xml
    

    To see how backend services are configured, see the topology descriptor file. This file defines a topology that points to one or more backend services. Two services are configured with sample values: WebHDFS and HIVE. The file also defines the authentication provider for the services in this topology and the authorization ACLs.

  4. Add the data analyst sample LDAP user identity sara.

    <param>
       <name>hive.acl</name>
       <value>admin,sara;*;*</value>
    </param>
    

    Adding the sample identity lets the user access the Hive backend service through Apache Knox.

  5. Change the HIVE URL to point to the backend cluster Hive service. You can find the HIVE service definition at the bottom of the file, under WebHDFS.

    <service>
      <role>HIVE</role>
      <url>http://<backend-master-internal-dns-name>:10000/cliservice</url>
    </service>
    
  6. Replace the <backend-master-internal-dns-name> placeholder with the internal DNS name of the backend cluster that you acquired in Create a sample Hive table.

  7. Save the file and close the editor.

To create additional topologies, repeat the steps in this section. Create one independent XML descriptor for each topology.

In Create the proxy cluster you copy these files into a Cloud Storage bucket. To create new topologies, or change them after you create the proxy cluster, modify the files, and upload them again to the bucket. The Apache Knox initialization action creates a cron job that regularly copies changes from the bucket to the proxy cluster.

Configure the SSL/TLS certificate

A client uses an SSL/TLS certificate when it communicates with Apache Knox. The initialization action can generate a self-signed certificate, or you can provide your CA-signed certificate.

  1. In Cloud Shell, edit the Apache Knox general configuration file:

    vi ${KNOX_INIT_FOLDER}/knox-config.yaml
    
  2. Replace HOSTNAME with the external DNS name of your proxy master node as the value for the certificate_hostname attribute. For this tutorial, use localhost.

    certificate_hostname: localhost
    

    Later in this tutorial, you create an SSH tunnel and the proxy cluster for the localhost value.

    The Apache Knox general configuration file also contains the master_key that encrypts the certificates BI tools use to communicate with the proxy cluster. By default, this key is the word secret.

  3. If you are providing your own certificate, change the following two properties:

    generate_cert: false
    custom_cert_name: <filename-of-your-custom-certificate>
    
  4. Save the file and close the editor.

    If you are providing your own certificate, you can specify it in the property custom_cert_name.

Create the proxy cluster

  1. In Cloud Shell, create a Cloud Storage bucket: