How Google protects the physical-to-logical space in a data center

This content was last updated in May 2023, and represents the status quo as of the time it was written. Google's security policies and systems may change going forward, as we continually improve protection for our customers.

Each Google data center is a large and diverse environment of machines, networking devices, and control systems. Data centers are designed as industrial complexes that require a wide range of roles and skills to manage, maintain, and operate.

In these complex environments, the security of your data is our top priority. Google implements six layers of physical controls (video) and many logical controls on the machines themselves. We also continuously model threat scenarios in which certain controls fail or aren't applied.

Some threat scenarios model insider risk and assume that an attacker already has legitimate access to the data center floor. These scenarios reveal a space between physical and logical controls that also requires defense in depth. That space, defined as arms-length from a machine in a rack to the machine's runtime environment, is known as the physical-to-logical space.

The physical-to-logical space is similar to the physical environment around your smartphone. Even though your phone is locked, you only give physical access to people who have a valid reason for access. Google takes the same approach to the machines that hold your data.

Physical-to-logical controls summary

Within the physical-to-logical space, Google uses four controls that work together:

  • Hardware hardening: Reduce each machine's physical access paths, known as the attack surface, in the following ways:
    • Minimize physical access vectors, like ports.
    • Lock down remaining paths at the firmware level, including the basic input/output system (BIOS), any management controllers, and peripheral devices.
  • Task-based access control: Provide access to secure rack enclosures only to personnel who have a valid, time-bound business justification.
  • Anomalous event detection: Generate alerts when physical-to-logical controls detect anomalous events.
  • System self-defense: Recognize a change in the physical environment and respond to threats with defensive actions.

Together, these controls provide a defense-in-depth response to security events that occur in the physical-to-logical space. The following diagram shows all four controls that are active on a secure rack enclosure.

The four controls that are active on a secure rack enclosure.

Hardware hardening

Hardware hardening helps to reduce the physical attack surface to minimize residual risks.

A conventional enterprise data center has an open floor plan and rows of racks with no barriers between the front panel and people on the data center floor. Such a data center might have machines with many external ports—such as USB-A, Micro-USB, or RJ-45—that increase the risk of an attack. Anyone with physical access to the data center floor can quickly and easily access removable storage or plug a USB stick with malware into an exposed front panel port. Google data centers use hardware hardening as a foundational control to help mitigate these risks.

Hardware hardening is a suite of preventative measures on the rack and its machines that helps reduce the physical attack surface as much as possible. Hardening on machines include the following:

  • Remove or disable exposed ports and lock down remaining ports at the firmware level.
  • Monitor storage media with high-fidelity tamper-detection signals.
  • Encrypt data at rest.
  • Where supported by the hardware, use device attestation to help prevent unauthorized devices from deploying in the runtime environment.

In certain scenarios, to help ensure that no personnel have physical access to machines, Google also installs secure rack enclosures that help to prevent or deter tampering. The secure rack enclosures provide an immediate physical barrier to passersby and can also trigger alarms and notifications for security personnel. Enclosures, combined with the machine remediations discussed earlier, provide a powerful layer of protection for the physical-to-logical space.

The following images illustrate the progression from fully open racks to secure rack enclosures with full hardware hardening.

  • The following image shows a rack with no hardware hardening:

    A rack with no hardware hardening.

  • The following image shows a rack with some hardware hardening:

    A rack with some hardware hardening.

  • The following image shows the front and back of a rack with full hardware hardening:

    The front and back of a rack with full hardware hardening.

Task-based access control

Task-based access controls (TBAC) help ensure that only personnel with a valid business need can access sensitive machines.

Secure rack enclosures must balance physical security with access for valid reasons. To maintain our complex infrastructure for our customers, Google must be able to grant quick, reliable access for valid business needs, like machine repairs. Also, unauthorized access attempts must be logged and flagged for investigation.

TBAC enables both capabilities. Data center personnel receive time-bound access to an individual secure rack enclosure based on specific business tasks, and TBAC systems enforce that access. TBAC logs all access attempts and alerts security staff when potential security events are detected.

For example: after receiving a work request, a supervisor can generate a task for a machine that is housed in a rack that is named Secure Rack Enclosure 123. The supervisor then sets a timeframe for the work (for example, two hours). When a technician claims the work ticket, TBAC allows access to Secure Rack Enclosure 123 for that person and starts a two-hour timer when the enclosure door opens. TBAC revokes access to Secure Rack Enclosure 123 when two hours have passed or when the technician closes the task, which marks the work complete.

Secure rack enclosures have various authentication and authorization mechanisms. The most basic enclosure uses a physical key, which grants authentication and authorization together, and therefore only provides a coarse-grained security control. For additional security value, some enclosures use keypads that have individually assigned and rotating PINs.

In some cases, Google employs two-factor authentication that is paired with a separate authorization mechanism. Authentication begins with an individual swiping their assigned badge, and the second factor can be a user-assigned PIN or a more sophisticated factor, like biometrics.

Anomalous event detection

Anomalous event detection let security staff know when machines experience unexpected events.

Industry-wide, organizations can take months or years to discover security breaches, and often only after significant damage or loss has occurred. The critical indicator of compromise (IoC) might be lost in a high volume of logging and telemetry data from millions of production machines. Google, however, uses TBAC and multiple data streams to help identify potential physical-to-logical security events in real time. This control is called anomalous event detection.

Modern machines monitor and record their physical state as well as events that occur in the physical-to-logical space. Machines receive this information through ever-present automated system software. This software may run on miniature computers inside the machine, called baseboard management controllers (BMCs), or as part of an operating system daemon. This software reports important events such as login attempts, insertion of physical devices, and sensor alarms such as an enclosure tamper sensor.

With anomalous event detection, Google combines context from system-reported events with work tracking from TBAC to detect unusual activity. For example, if a machine in Secure Rack Enclosure 123 reports that a hard drive was removed, our systems check to see if that machine was recently authorized for a hard drive swap. If no authorization exists, the reported event, combined with the task-based authorization data, triggers an alert for security staff to investigate further.

For machines with hardware root-of-trust, anomalous event detection signals become even stronger. Hardware root-of-trust allows system software, such as BMC firmware, to attest that it booted safely. Google detection systems, therefore, have an even higher degree of confidence that reported events are valid. For more information about independent roots of trust, see Remote attestation of disaggregated machines.

System self-defense

System self-defense lets systems respond to potential compromises with immediate defensive action.

Some threat scenarios assume that an attacker in the physical-to-logical space can defeat the physical access measures discussed in Hardware hardening. Such an attacker might be targeting user data or a sensitive process that is running on a machine.

To mitigate this risk, Google implements system self-defense: a control that provides an immediate and decisive response to any potential compromise. This control uses the telemetry from the physical environment to act in the logical environment.

Most large-scale production environments have multiple physical machines in one rack. Each physical machine runs multiple workloads, like virtual machines (VMs) or Kubernetes containers. Each VM runs its own operating system using dedicated memory and storage.

To determine which workloads are exposed to security events, Google aggregates the telemetry data from the hardware-hardening controls, TBAC, and anomalous event detection. We then correlate the data to generate a small set of events that are high-risk and require immediate action. For example, the combination of a secure rack door alarm, a machine chassis opening signal, and the lack of a valid work authorization might constitute a high-risk event.

When Google detects these events, systems can take immediate action:

  • Exposed workloads can immediately terminate sensitive services and wipe any sensitive data.
  • The networking fabric can isolate the affected rack.
  • The affected workloads can be rescheduled on other machines or even data centers, depending on the situation.

Because of the system self-defense control, even if an attacker succeeds in getting physical access to a machine, the attacker can't extract any data and can't move laterally in the environment.

What's next

Authors: Thomas Koh and Kevin Plybon