Cross-silo and cross-device federated learning on Google Cloud

Last reviewed 2024-04-12 UTC

This document describes two reference architectures that help you create a federated learning platform on Google Cloud. The reference architectures and associated resources that are described in this document support the following:

  • Cross-silo federated learning
  • Cross-device federated learning, building upon the cross-silo architecture

The intended audiences for this document are cloud architects and AI and ML engineers who want to implement federated learning use cases on Google Cloud. It's also intended for decision-makers who are evaluating whether to implement federated learning on Google Cloud.

Architecture

The diagrams in this section show a cross-silo architecture and a cross-device architecture for federated learning.

Cross-silo architecture

The following diagram shows an architecture that supports cross-silo federated learning:

Cross-silo architecture, components described in following text.

The preceding architecture includes the following components:

  • A Virtual Private Cloud (VPC) network and subnet.
  • A private GKE cluster that helps you to do the following:
    • Isolate cluster nodes from the internet.
    • Limit exposure of your cluster nodes and control plane to the internet by creating a private GKE cluster with authorized networks.
    • Use shielded cluster nodes that use a hardened operating system image.
    • Enable Dataplane V2 for optimized Kubernetes networking.
    • Encrypt cluster secrets at the application layer.
  • Dedicated GKE node pools: You create a dedicated node pool to exclusively host tenant apps and resources. The nodes have taints to ensure that only tenant workloads are scheduled onto the tenant nodes. Other cluster resources are hosted in the main node pool.
  • VPC Firewall rules which apply the following:
    • Baseline rules that apply to all nodes in the cluster.
    • Additional rules that only apply to nodes in the tenant node pool. These firewall rules limit ingress to and egress from tenant nodes.
  • Cloud NAT to allow egress to the internet.
  • Cloud DNS records to enable Private Google Access such that apps within the cluster can access Google APIs without going over the internet.
  • Service accounts which are as follows:
    • A dedicated service account for the nodes in the tenant node pool.
    • A dedicated service account for tenant apps to use with Workload Identity Federation.
  • Support for using Google Groups for Kubernetes role-based access control (RBAC).
  • A Cloud Source Repository to store configuration descriptors.
  • An Artifact Registry repository to store container images.

Cross-device architecture

The following diagram shows an architecture that supports cross-device federated learning:

Cross-device architecture, components explained in following text.

The preceding cross-device architecture builds upon the cross-silo architecture with the addition of the following components:

  • A Cloud Run service that simulates devices connecting to the server
  • A Certificate Authority Service that creates private certificates for the server and clients to run
  • A Vertex AI TensorBoard to visualize the result of the training
  • A Cloud Storage bucket to store the consolidated model
  • The private GKE cluster that uses confidential nodes as its primary pool to help secure the data in use

The cross-device architecture uses components from the open source Federated Compute Platform (FCP) project. This project includes the following:

  • Client code for communicating with a server and executing tasks on the devices
  • A protocol for client-server communication
  • Connection points with TensorFlow Federated to make it easier to define your federated computations

The FCP components shown in the preceding diagram can be deployed as a set of microservices. These components do the following:

  • Aggregator: This job reads device gradients and calculates aggregated result with Differential Privacy.
  • Collector: This job runs periodically to query active tasks and encrypted gradients. This information determines when aggregation starts.
  • Model uploader: This job listens to events and publishes results so that devices can download updated models.
  • Task-assignment: This front-end service distributes training tasks to devices.
  • Task-management: This job manages tasks.
  • Task-scheduler: This job either runs periodically or is triggered by specific events.

Products used

The reference architectures for both federated learning use cases use the following Google Cloud components:

GKE also provides the following capabilities to your federated learning platform:

  • Hosting the federated learning coordinator: The federated learning coordinator is responsible for managing the federated learning process. This management includes tasks such as distributing the global model to participants, aggregating updates from participants, and updating the global model. GKE can be used to host the federated learning coordinator in a highly available and scalable way.
  • Hosting federated learning participants: Federated learning participants are responsible for training the global model on their local data. GKE can be used to host federated learning participants in a secure and isolated way. This approach can help ensure that participants' data is kept local.
  • Providing a secure and scalable communication channel: Federated learning participants need to be able to communicate with the federated learning coordinator in a secure and scalable way. GKE can be used to provide a secure and scalable communication channel between participants and the coordinator.
  • Managing the lifecycle of federated learning deployments: GKE can be used to manage the lifecycle of federated learning deployments. This management includes tasks such as provisioning resources, deploying the federated learning platform, and monitoring the performance of the federated learning platform.

In addition to these benefits, GKE also provides a number of features that can be useful for federated learning deployments, such as the following:

  • Regional clusters: GKE lets you create regional clusters, helping you to improve the performance of federated learning deployments by reducing latency between participants and the coordinator.
  • Network policies: GKE lets you create network policies, helping to improve the security of federated learning deployments by controlling the flow of traffic between participants and the coordinator.
  • Load balancing: GKE provides a number of load balancing options, helping to improve the scalability of federated learning deployments by distributing traffic between participants and the coordinator.

TFF provides the following features to facilitate the implementation of federated learning use cases:

  • The ability to declaratively express federated computations, which are a set of processing steps that run on a server and set of clients. These computations can be deployed to diverse runtime environments.
  • Custom aggregators can be built using TFF open source.
  • Support for a variety of federated learning algorithms, including the following algorithms:
    • Federated averaging: A simple algorithm that averages the model parameters of participating clients. It's particularly well-suited for use cases where the data is relatively homogeneous and the model is not too complex. Typical use cases are as follows:
      • Personalized recommendations: A company can use federated averaging to train a model that recommends products to users based on their purchase history.
      • Fraud detection: A consortium of banks can use federated averaging to train a model that detects fraudulent transactions.
      • Medical diagnosis: A group of hospitals can use federated averaging to train a model that diagnoses cancer.
    • Federated stochastic gradient descent (FedSGD): An algorithm that uses stochastic gradient descent to update the model parameters. It's well-suited for use cases where the data is heterogeneous and the model is complex. Typical use cases are as follows:
      • Natural language processing: A company can use FedSGD to train a model that improves the accuracy of speech recognition.
      • Image recognition: A company can use FedSGD to train a model that can identify objects in images.
      • Predictive maintenance: A company can use FedSGD to train a model that predicts when a machine is likely to fail.
    • Federated Adam: An algorithm that uses the Adam optimizer to update the model parameters. Typical use cases are as follows:
      • Recommender systems: A company can use federated Adam to train a model that recommends products to users based on their purchase history.
      • Ranking: A company can use federated Adam to train a model that ranks search results.
      • Click-through rate prediction: A company can use federated Adam to train a model that predicts the likelihood that a user clicks an advertisement.

Use cases

This section describes use cases for which the cross-silo and cross-device architectures are appropriate choices for your federated learning platform.

Federated learning is a machine learning setting where many clients collaboratively train a model. This process is led by a central coordinator, and the training data remains decentralized.

In the federated learning paradigm, clients download a global model and improve the model by training locally on their data. Then, each client sends its calculated model updates back to the central server where the model updates are aggregated and a new iteration of the global model is generated. In these reference architectures, the model training workloads run on GKE.

Federated learning embodies the privacy principle of data minimization, by restricting what data is collected at each stage of computation, limiting access to data, and processing then discarding data as early as possible. Additionally, the problem setting of federated learning is compatible with additional privacy preserving techniques, such as using differential privacy (DP) to improve the model anonymization to ensure the final model does not memorize individual user's data.

Depending on the use case, training models with federated learning can have additional benefits:

  • Compliance: In some cases, regulations might constrain how data can be used or shared. Federated learning might be used to comply with these regulations.
  • Communication efficiency: In some cases, it's more efficient to train a model on distributed data than to centralize the data. For example, the datasets that the model needs to be trained on are too large to move centrally.
  • Making data accessible: Federated learning allows organizations to keep the training data decentralized in per-user or per-organization data silos.
  • Higher model accuracy: By training on real user data (while ensuring privacy) rather than synthetic data (sometimes referred to as proxy data), it often results in higher model accuracy.

There are different kinds of federated learning, which are characterized by where the data originates and where the local computations occur. The architectures in this document focus on two types of federated learning: cross-silo and cross-device. Other types of federated learning are out of scope for this document.

Federated learning is further categorized by how the datasets are partitioned, which can be as follows:

  • Horizontal federated learning (HFL): Datasets with the same features (columns) but different samples (rows). For example, multiple hospitals might have patient records with the same medical parameters but different patient populations.
  • Vertical federated learning (VFL): Datasets with the same samples (rows) but different features (columns). For example, a bank and an ecommerce company might have customer data with overlapping individuals but different financial and purchasing information.
  • Federated Transfer Learning (FTL): Partial overlap in both samples and features among the datasets. For example, two hospitals might have patient records with some overlapping individuals and some shared medical parameters, but also unique features in each dataset.

Cross-silo federated computation is where the participating members are organizations or companies. In practice, the number of members is usually small (for example, within one hundred members). Cross-silo computation is typically used in scenarios where the participating organizations have different datasets, but they want to train a shared model or analyze aggregated results without sharing their raw data with each other. To help you isolate workloads that belong to different participant organizations, the cross-silo reference architecture implements security controls, such as dedicated namespaces and GKE node pools. Cross-namespace communication and cluster inbound and outbound traffic are forbidden by default, unless you explicitly override this setting.

Example use cases for cross-silo federated learning are as follows:

  • Fraud detection: Federated learning can be used to train a fraud detection model on data that is distributed across multiple organizations. For example, a consortium of banks could use federated learning to train a model that detects fraudulent transactions.
  • Medical diagnosis: Federated learning can be used to train a medical diagnosis model on data that is distributed across multiple hospitals. For example, a group of hospitals could use federated learning to train a model that diagnoses cancer.

Cross-device federated learning is a type of federated computation where the participating members are end-user devices such as mobile phones, vehicles, or IoT devices. The number of members can reach up to a scale of millions or even tens of millions.

The process for cross-device federated learning is similar to that of cross-silo federated learning. However, it also requires you to adapt the reference architecture to accommodate some of the extra factors that you must consider when you are dealing with thousands to millions of devices. You must deploy administrative workloads to handle scenarios that are encountered in cross-device federated learning use cases. For example, the need to coordinate a subset of clients that will take place in the round of training. The cross-device architecture provides this ability by letting you deploy the FCP services. These services have workloads that have connection points with TFF. TFF is used to write the code that manages this coordination.

Example use cases for cross-device federated learning are as follows:

  • Personalized recommendations: You can use cross-device federated learning to train a personalized recommendation model on data that's distributed across multiple devices. For example, a company could use federated learning to train a model that recommends products to users based on their purchase history.
  • Natural language processing: Federated learning can be used to train a natural language processing model on data that is distributed across multiple devices. For example, a company could use federated learning to train a model that improves the accuracy of speech recognition.
  • Predicting vehicle maintenance needs: Federated learning can be used to train a model that predicts when a vehicle is likely to need maintenance. This model could be trained on data that is collected from multiple vehicles. This approach lets the model learn from the experiences of all the vehicles, without compromising the privacy of any individual vehicle.

The following table summarizes the features of the cross-silo and cross-device architectures, and shows you how to categorize the type of federated learning scenario that is applicable for your use case.

Feature Cross-silo federated computations Cross-device federated computations
Population size Usually small (for example, within one hundred devices) Scalable to thousands, millions, or hundreds of millions of devices
Participating members Organizations or companies Mobile devices, edge devices, vehicles
Most common data partitioning HFL, VFL, FTL HFL
Data sensitivity Sensitive data that participants don't want to share with each other in raw format Data that's too sensitive to be shared with a central server
Data availability Participants are almost always available Only a fraction of participants are available at any time
Example use cases Fraud detection, medical diagnosis, financial forecasting Fitness tracking, voice recognition, image classification

Design considerations

This section provides guidance to help you use this reference architecture to develop one or more architectures that meet your specific requirements for security, reliability, operational efficiency, cost, and performance.

Cross-silo architecture design considerations

To implement a cross-silo federated learning architecture in Google Cloud, you must implement the following minimum prerequisites, which are explained in more detail in the following sections:

  1. Establish a federated learning consortium.
  2. Determine the collaboration model for the federated learning consortium to implement.
  3. Determine the responsibilities of the participant organizations.

In addition to these prerequisites, there are other actions that the federation owner must take which are outside the scope of this document, such as the following:

  • Manage the federated learning consortium.
  • Design and implement a collaboration model.
  • Prepare, manage, and operate the model training data and the model that the federation owner intends to train.
  • Create, containerize, and orchestrate federated learning workflows.
  • Deploy and manage federated learning workloads.
  • Set up the communication channels for the participant organizations to securely transfer data.

Establish a federated learning consortium

A federated learning consortium is the group of organizations that participate in a cross-silo federated learning effort. Organizations in the consortium only share the parameters of the ML models, and you can encrypt these parameters to increase privacy. If the federated learning consortium allows the practice, organizations can also aggregate data that don't contain personally identifiable information (PII).

Determine a collaboration model for the federated learning consortium

The federated learning consortium can implement different collaboration models, such as the following:

  • A centralized model that consists of a single coordinating organization, called the federation owner or orchestrator, and a set of participant organizations or data owners.
  • A decentralized model that consists of organizations that coordinate as a group.
  • A heterogeneous model that consists of a consortium of diverse participating organizations, all of which bring different resources to the consortium.

This document assumes that the collaboration model is a centralized model.

Determine the responsibilities of the participant organizations

After choosing a collaboration model for the federated learning consortium, the federation owner must determine the responsibilities for the participant organizations.

The federation owner must also do the following when they begin to build a federated learning consortium:

  • Coordinate the federated learning effort.
  • Design and implement the global ML model and the ML models to share with the participant organizations.
  • Define the federated learning rounds—the approach for the iteration of the ML training process.
  • Select the participant organizations that contribute to any given federated learning round. This selection is called a cohort.
  • Design and implement a consortium membership verification procedure for the participant organizations.
  • Update the global ML model and the ML models to share with the participant organizations.
  • Provide the participant organizations with the tools to validate that the federated learning consortium meets their privacy, security, and regulatory requirements.
  • Provide the participant organizations with secure and encrypted communication channels.
  • Provide the participant organizations with all the necessary non-confidential, aggregated data that they need to complete each federated learning round.

The participant organizations have the following responsibilities:

  • Provide and maintain a secure, isolated environment (a silo). The silo is where participant organizations store their own data, and where ML model training is implemented.
  • Train the models supplied by the federation owner using their own computing infrastructure and their own local data.
  • Share model training results with the federation owner in the form of aggregated data, after removing any PII.

The federation owner and the participant organizations refine the ML model training until the model meets their requirements.

Implement federated learning on Google Cloud

After establishing the federated learning consortium and determining how the federated learning consortium will collaborate, we recommend that participant organizations do the following:

  1. Provision and configure the necessary infrastructure for the federated learning consortium.
  2. Implement the collaboration model.
  3. Start the federated learning effort.

Provision and configure the infrastructure for the federated learning consortium

When provisioning and configuring the infrastructure for the federated learning consortium, it's the responsibility of the federation owner to create and distribute the workloads that train the federated ML models to the participant organizations. Because a third party (the federation owner) created and provided the workloads, the participant organizations must take precautions when deploying those workloads in their runtime environments.

Participant organizations must configure their environments according to their individual security best practices, and apply controls that limit the scope and the permissions granted to each workload. In addition to following their individual security best practices, we recommend that the federation owner and the participant organizations consider threat vectors that are specific to federated learning.

Implement the collaboration model

After the federated learning consortium infrastructure is prepared, the federation owner designs and implements the mechanisms that let the participant organizations interact with each other. The approach follows the collaboration model that the federation owner chose for the federated learning consortium.

Start the federated learning effort

After implementing the collaboration model, the federation owner implements the global ML model to train, and the ML models to share with the participant organization. After those ML models are ready, the federation owner starts the first round of the federated learning effort.

During each round of the federated learning effort, the federation owner does the following:

  1. Distributes the ML models to share with the participant organizations.
  2. Waits for the participant organizations to deliver the results of the training of the ML models that the federation owner shared.
  3. Collects and processes the training results that the participant organizations produced.
  4. Updates the global ML model when they receive appropriate training results from participating organizations.
  5. Updates the ML models to share with the other members of the consortium when applicable.
  6. Prepares the training data for the next round of federated learning.
  7. Starts the next round of federated learning.

Security, privacy, and compliance

This section describes factors that you should consider when you use this reference architecture to design and build a federated learning platform on Google Cloud. This guidance applies to both of the architectures that this document describes.

The federated learning workloads that you deploy in your environments might expose you, your data, your federated learning models, and your infrastructure to threats that might impact your business.

To help you increase the security of your federated learning environments, these reference architectures configure GKE security controls that focus on the infrastructure of your environments. These controls might not be enough to protect you from threats that are specific to your federated learning workloads and use cases. Given the specificity of each federated learning workload and use case, security controls aimed at securing your federated learning implementation are out of the scope of this document. For more information and examples about these threats, see Federated Learning security considerations.

GKE security controls

This section discusses the controls that you apply with these architectures to help you secure your GKE cluster.

Enhanced security of GKE clusters

These reference architectures help you create a GKE cluster which implements the following security settings: