Kerberized data lake on Dataproc

Last reviewed 2024-04-16 UTC

This document describes the concepts, best practices, and reference architecture for the networking, authentication, and authorization of a Kerberized data lake on Google Cloud using Dataproc on-cluster Key Distribution Center (KDC) and Apache Ranger. Dataproc is Google Cloud's managed Hadoop and Spark service. This document is intended for Apache Hadoop administrators, cloud architects, and big data teams who are migrating their traditional Hadoop and Spark clusters to a modern data lake powered by Dataproc.

A Kerberized data lake on Google Cloud helps organizations with hybrid and multi-cloud deployments to extend and use their existing IT investments in identity and access control management.

On Google Cloud, organizations can provide their teams with as many job-scoped ephemeral clusters as needed. This approach removes much of the complexity of maintaining a single cluster with growing dependencies and software configuration interactions. Organizations can also create longer-running clusters for multiple users and services to access. This document shows how to use industry standard tools, such as Kerberos and Apache Ranger, to help ensure fine-grained user security (authentication, authorization, and audit) for both cluster cases on Dataproc.

Customer use case

Enterprises are migrating their on-premises Hadoop-based data lakes to public cloud platforms to solve the challenges they are facing managing their traditional clusters.

One of these organizations, a large technology leader in Enterprise Software and Hardware, decided to migrate their on-premises Hadoop system to Google Cloud. Their on-premises Hadoop environment served the analytics needs of hundreds of teams and business units, including their cybersecurity team that had 200 data analytics team members. When one team member ran a large query with their legacy data lake, they experienced issues due to the rigid nature of their resources. The organization struggled to keep up with the analytics needs of the team using their on-premises environment, so they moved to Google Cloud. By moving to Google Cloud, the organization was able to reduce the number of issues being reported on their on-premises data lake by 25% a month.

The foundation of the organization's migration plan to Google Cloud was the decision to reshape and optimize their large monolithic clusters according to teams' workloads, and shift the focus from cluster management to unlocking business value. The few large clusters were broken into smaller, cost-effective Dataproc clusters, while workloads and teams were migrated to the following types of models:

  • Ephemeral job-scoped clusters: With only a few minutes spin-up time, the ephemeral model allows a job or a workflow to have a dedicated cluster that is shut down upon job completion. This pattern decouples storage from compute nodes by substituting Hadoop Distributed File System (HDFS) with Cloud Storage, using Dataproc's built-in Cloud Storage Connector for Hadoop.
  • Semi-long-running clusters: When ephemeral job-scoped clusters can't serve the use case, then Dataproc clusters can be long running. When the cluster's stateful data is offloaded to Cloud Storage, the cluster can be easily shut down, and they are considered as semi-long running. Smart cluster autoscaling also allows these clusters to start small and to optimize their compute resources for specific applications. This autoscaling replaces management of YARN queues.

The hybrid security challenge

In the preceding customer scenario, the customer migrated their substantial data management system to the cloud. However, other parts of the organization's IT needed to remain on-premises (for example, some of the legacy operational systems that feed the data lake).

The security architecture needed to help ensure the on-premises central LDAP-based identity provider (IdP) remains the authoritative source for their corporate identities using the data lake. On-premises Hadoop security is based on Kerberos and LDAP for authentication (often as part of the organization's Microsoft Active Directory (AD)) and on several other open source software (OSS) products, such as Apache Ranger. This security approach allows for fine-grained authorization and audit of users' activities and teams' activities in the data lake clusters. On Google Cloud, Identity and Access Management (IAM) is used to manage access to specific Google Cloud resources, such as Dataproc and Cloud Storage.

This document discusses a security approach that uses the best of on-premises and OSS Hadoop security (focusing on Kerberos, corporate LDAP, and Apache Ranger) along with IAM to help secure workloads and data both inside and outside the Hadoop clusters.

Architecture

The following diagram shows the high-level architecture:

The high-level architecture of a Kerberized data lake on Google Cloud using Dataproc.

In the preceding diagram, clients run jobs on multi-team or single-team clusters. The clusters use a central Hive metastore and Kerberos authentication with a corporate identity provider.

Components

The architecture proposes a combination of industry standard open source tools and IAM to authenticate and authorize the different ways to submit jobs that are described later in this document. The following are the main components that work together to provide fine-grained security of teams' and users' workloads in the Hadoop clusters:

  • Kerberos: Kerberos is a network authentication protocol that uses secret-key cryptography to provide strong authentication for client/server applications. The Kerberos server is known as Key Distribution Center (KDC).

    Kerberos is widely used in on-premises systems like AD to authenticate human users, services, and machines (client entities are denoted as user principals). Enabling Kerberos on Dataproc uses the free MIT distribution of Kerberos to create an on-cluster KDC. Dataproc's on-cluster KDC serves user principals' requests to access resources inside the cluster, like Apache Hadoop YARN, HDFS, and Apache Spark (server resources are denoted as service principals). Kerberos cross-realm trust lets you connect the user principals of one realm to another.

  • Apache Ranger: Apache Ranger provides fine-grained authorization for users to perform specific actions on Hadoop services. It also audits user access and implements administrative actions. Ranger can synchronize with an on-premises corporate LDAP server or with AD to get user and services identities.

  • Shared Hive metastore: The Hive metastore is a service that stores metadata for Apache Hive and other Hadoop tools. Because many of these tools are built around it, the Hive metastore has become a critical component of many data lakes. In the proposed architecture, a centralized and Kerberized Hive metastore allows multiple clusters to share metadata in a secure manner.

While Kerberos, Ranger, and a shared Hive metastore work together to allow fine-grained user security within the Hadoop clusters, IAM controls access to Google Cloud resources. For example, Dataproc uses the Dataproc Service Account to perform reads and writes on Cloud Storage buckets.

Cluster dimensions

The following dimensions characterize a Dataproc cluster:

  • Tenancy: A cluster is multi-tenant if it serves the requests of more than one human user or service, or single-tenant if it serves the requests of a single user or service.
  • Kerberos: A cluster can be Kerberized if you enable Kerberos on Dataproc or non-Kerberized if you don't enable Kerberos on Dataproc.
  • Lifecycle: A cluster is ephemeral if it's created only for the duration of a specific job or workflow, contains only the resources needed to run the job, and it's shut down upon job completion. Otherwise, the cluster is considered semi-long running.

Different combinations of these dimensions determine the use cases that a specific cluster is best suited for. This document discusses the following representative examples:

  1. The sample multi-team clusters shown in the architecture are Kerberized, multi-tenant, semi-long-running clusters. These clusters are best suited for interactive query workloads, for example they serve long-term data analytics and business intelligence (BI) exploration. In the architecture, the clusters are located in a Google Cloud project that's shared by several teams and serves the requests of those teams, hence the name.

    In this document, the term team or application team describes a group of people in an organization who are working on the same software component or acting as one functional team. For example, a team might refer to backend developers of a microservice, BI analysts of a business application, or even cross-functional teams, such as Big Data infrastructure teams.

  2. The sample single-team clusters shown in the architecture are clusters that can be multi-tenant (for members of the same team) or single-tenant.

  • As ephemeral clusters, single-team clusters can be used for jobs such as by Data Engineers to run Spark batch processing jobs, or by Data Scientists for a model training job.
  • As semi-long-running clusters, single-team clusters can serve data analytics and BI workloads that are scoped for a single team or person.

    The single-team clusters are located in Google Cloud projects that belong to a single team, which simplifies usage auditing, billing, and resource isolation. For example, only members of the single team can access the Cloud Storage buckets that are used for persisting the cluster's data. In this approach, application teams have dedicated projects, so the single-team clusters aren't Kerberized.

We recommend that you analyze your particular requirements and choose the best dimension combinations for your situation.

Submitting jobs

Users can submit jobs to both types of clusters using various tools, including the following:

  • The Dataproc API, using REST calls or client libraries.
  • The Google Cloud CLI gcloud command-line tool in a local terminal window or from the Google Cloud console in Cloud Shell, opened in a local browser.
  • A Dataproc Workflow Template, which is a reusable workflow configuration that defines a graph of jobs with information about where to run those jobs. If the workflow uses the managed cluster option, it uses an ephemeral cluster.
  • Cloud Composer using the Dataproc Operator. Composer directed acyclic graphs (DAGs) can also be used to orchestrate Dataproc Workflow Templates.
  • Opening an SSH session into the master node in the cluster, and submitting a job directly, or by using tools like Apache Beeline. This tool is usually reserved only for administrators and power users. An example of a power user is a developer who wants to troubleshoot the configuration parameters for a service and verify them by running test jobs directly on the master node.

Networking

The following diagram highlights the networking concepts of the architecture:

A networking architecture using a hybrid mesh pattern.

In the preceding diagram, the networking architecture uses a meshed hybrid pattern, in which some resources are located on Google Cloud, and some are located on-premises. The meshed hybrid pattern uses a Shared VPC, with a common host project and separate projects for each Dataproc cluster type and team. The architecture is described in detail in the following On Google Cloud and On-premises sections.

On Google Cloud

On Google Cloud, the architecture is structured using a Shared VPC. A Shared VPC lets resources from multiple projects connect to a common VPC network. Using a common VPC network lets resources communicate with each other securely and efficiently using internal IP addresses from that network. To set up a Shared VPC, you create the following projects:

  • Host project: The host project contains one or more Shared VPC networks used by all the service projects.
  • Service projects: a service project contains related Google Cloud resources. A Shared VPC Admin attaches the service projects to the Host Project to allow them to use subnets and resources in the Shared VPC network. This attachment is essential for the single-team clusters to be able to access the centralized Hive metastore.

    As shown in the Networking diagram, we recommend creating separate service projects for the Hive metastore cluster, the multi-team clusters, and clusters for each individual team. Members of each team in your organization can then create single-team clusters within their respective projects.

To allow the components within the hybrid network to communicate, you must configure firewall rules to allow the following traffic:

  • Internal cluster traffic for Hadoop services including HDFS NameNode to communicate with HDFS DataNodes, and for YARN ResourceManager to communicate with YARN NodeManagers. We recommend using filtering with the cluster service account for these rules.
  • External cluster traffic on specific ports to communicate with the Hive metastore (port tcp:9083,8020), on-premises KDC (port tcp:88), and LDAP (port 636), and other centralized external services that you use in your particular scenario, for example Kafka on Google Kubernetes Engine (GKE).

All Dataproc clusters in this architecture are created with internal IP addresses only. To allow cluster nodes to access Google APIs and services, you must enable Private Google Access for the cluster subnets. To allow administrators and power users access to the private IP address VM instances, use IAP TCP forwarding to forward SSH, RDP, and other traffic over an encrypted tunnel.

The cluster web interfaces of the cluster applications and optional components (for example Spark, Hadoop, Jupyter, and Zeppelin) are securely accessed through the Dataproc Component Gateway. The Dataproc Component Gateway is an HTTP-inverting proxy that is based on Apache Knox.

On-premises

This document assumes that the resources located on-premises are the corporate LDAP directory service and the corporate Kerberos Key Distribution Center (KDC) where the user and team service principals are defined. If you don't need to use an on-premises identity provider, you can simplify the setup by using Cloud Identity and a KDC on a Dataproc cluster or on a virtual machine.

To communicate with the on-premises LDAP and KDC, you use either Cloud Interconnect or Cloud VPN. This setup helps ensure that communication between environments uses private IP addresses if the subnetworks in the RFC 1918 IP address don't overlap. For more information about the different connection options, see Choosing a Network Connectivity product.

In a hybrid scenario, your authentication requests must be handled with minimal latency. To achieve this goal, you can use the following techniques:

  • Serve all authentication requests for service identities from the on-cluster KDC, and only use an identity provider external to the cluster for user identities. Most of the authentication traffic is expected to be requests from service identities.
  • If you're using AD as your identity provider, User Principal Names (UPNs) represent the human users and AD service accounts. We recommend that you replicate the UPNs from your on-premises AD into a Google Cloud region that is close to your data lake clusters. This AD replica handles authentication requests for UPNs, so the requests never transit to your on-premises AD. Each on-cluster KDC handles the Service Principal Names (SPNs) using the first technique.

The following diagram shows an architecture that uses both techniques:

An on-premises AD synchronizes UPNs to an AD replica, while an on-cluster KDC authenticates UPNs and SPNs.

In the preceding diagram, an on-premises AD synchronizes UPNs to an AD replica in a Google Cloud region. The AD replica authenticates UPNs, and an on-cluster KDC authenticates SPNs.

For information about deploying AD on Google Cloud, see Deploying a fault-tolerant Microsoft Active Directory environment. For information about how to size the number of instances for domain controllers, see Integrating MIT Kerberos and Active Directory.

Authentication

The following diagram shows the components that are used to authenticate users in the different Hadoop clusters. Authentication lets users use services such as Apache Hive or Apache Spark.

Components authenticate users in different Hadoop clusters.

In the preceding diagram, clusters in Kerberos realms can set up cross-realm trust to use services on other clusters, such as the Hive metastore. Non-kerberized clusters can use a Kerberos client and an account keytab to use services on other clusters.

Shared and secured Hive metastore

The centralized Hive metastore allows multiple clusters that are running different open source query engines—such as Apache Spark, Apache Hive/Beeline, and Presto—to share metadata.

You deploy the Hive metastore server on a Kerberized Dataproc cluster and deploy the Hive metastore database on a remote RDBMS, such as a MySQL instance on Cloud SQL. As a shared service, a Hive metastore cluster only serves authenticated requests. For more information about configuring the Hive metastore, see Using Apache Hive on Dataproc.

Instead of Hive metastore, you can use the Dataproc Metastore, which is a fully managed, highly available (within a region), autohealing, serverless Apache Hive metastore. You can also enable Kerberos for the Dataproc Metastore service, as explained in Configuring Kerberos for a service.

Kerberos

In this architecture, the multi-team clusters are used for analytics purposes and they are Kerberized by following the guide to Dataproc security configuration. The Dataproc secure mode creates an on-cluster KDC and it manages the cluster's service principals and keytabs as required by the Hadoop secure mode specification.

A keytab is a file that contains one or more pairs of Kerberos principals and an encrypted copy of that principal's key. A keytab allows programmatic Kerberos authentication when interactive login with the kinit command is infeasible.

Access to a keytab means the ability to impersonate the principals that are contained in it. Therefore, a keytab is a highly sensitive file that needs to be securely transferred and stored. We recommend using Secret Manager to store the contents of keytabs before they are transferred to their respective clusters. For an example of how to store the contents of a keytab, see Configuring Kerberos for a service. After a keytab is downloaded to the cluster master node, the file must have restricted file access permissions.

The on-cluster KDC handles the authentication requests for all services within that cluster. Most authentication requests are expected to be this type of request. To minimize latency, it is important for the KDC to resolve those requests without them leaving the cluster.

The remaining requests are from human users and AD service accounts. The AD replica on Google Cloud or the central ID provider on-premises handles these requests, as explained in the preceding On-premises section.

In this architecture, the single-team clusters aren't Kerberized, so there is no KDC present. To allow these clusters to access the shared Hive metastore, you only need to install a Kerberos client. To automate access, you can use the team's keytab. For more information, see the Identity mapping section later in this document.

Kerberos cross-realm trust in multi-team clusters

Cross-realm trust is highly relevant when your workloads are hybrid or multi-cloud. Cross-realm trust lets you integrate central corporate identities into shared services available in your Google Cloud data lake.

In Kerberos, a realm defines a group of systems under a common KDC. Cross-realm authentication enables a user principal from one realm to authenticate in another r