自动响应完整性验证失败

了解如何使用 Cloud Run functions 触发器自动处理安全强化型虚拟机的完整性监控事件。

概览

完整性监控可从安全强化型虚拟机实例收集测量结果，并在 Cloud Logging 中显示。如果完整性测量结果在安全强化型虚拟机实例的启动之间发生更改，则完整性验证将失败。此故障会作为记录的事件捕获，也在 Cloud Monitoring 中引发。

有时，安全强化型虚拟机的完整性测量结果会因合理原因而发生更改。例如，系统更新可能会导致操作系统内核发生预期更改。因此，完整性监控允许您在预期的完整性验证失败的情况下，提示安全强化型虚拟机实例学习新的完整性政策基准。

在本教程中，您将首先创建一个简单的自动化系统，以关闭未通过完整性验证的安全强化型虚拟机实例：

将所有完整性监控事件导出到 Pub/Sub 主题。
创建一个 Cloud Run 函数触发器，以使用该主题中的事件来识别并关停未通过完整性验证的安全强化型虚拟机实例。

接下来，您可以选择扩展该系统，以便它提示未通过完整性验证的安全强化型虚拟机实例学习新基准（如果实例与已知良好的测量结果匹配）或关闭（其他情况下）：

创建 Firestore 数据库，以维护一组已知的良好完整性基准测量结果。
更新 Cloud Run functions 触发器，以便它提示未通过完整性验证的安全强化型虚拟机实例了解新基准（如果基准在该数据库中）或关停（其他情况下）。

如果您选择实施扩展后的解决方案，请按以下方式使用相应解决方案：

每次出现预计会因合理原因而导致验证失败的更新时，请在实例组中的单个安全强化型虚拟机实例上运行该更新。
使用来自更新后的虚拟机实例的启动后期事件作为来源，通过在 known_good_measurements 集合中创建新文档，将新的政策基准测量结果添加到数据库。如需了解详情，请参阅创建已知良好基准测量结果的数据库。
更新安全强化型虚拟机的剩余实例。触发器会提示剩余实例学习新基准，因为这些实例可以被验证为属于已知良好的类型。如需了解详情，请参阅更新 Cloud Run functions 触发器以了解已知良好的基准。

前提条件

使用一个选用原生模式 Firestore 作为数据库服务的项目。您可以在创建项目时进行这一选择，并且选择后无法更改。如果您的项目不使用原生模式下的 Firestore，则在打开 Firestore 控制台时，您将看到消息“此项目使用另一个数据库服务”。
在项目中使用 Compute Engine 安全强化型虚拟机实例作为完整性基准测量结果的源。安全强化型虚拟机实例必须已被重启至少一次。
已安装 gcloud 命令行工具。
按照以下步骤启用 Cloud Logging 和 Cloud Run functions API：
1. 在 Google Cloud 控制台中，前往 API 和服务页面。
  
  进入“API 和服务”
2. 查看已启用的 API 和服务列表中是否包含 Cloud Functions API 和 Stackdriver Logging API。
3. 如果未显示任一 API，请点击添加 API 和服务。
4. 根据需要搜索并启用相应 API。

将完整性监控日志条目导出到 Pub/Sub 主题

使用 Logging 将安全强化型虚拟机实例生成的所有完整性监控日志条目导出到 Pub/Sub 主题。您可以将此主题用作 Cloud Run functions 触发器的数据源，以自动响应完整性监控事件。

日志浏览器

在 Google Cloud 控制台中，前往 Logs Explorer 页面。

转到 Cloud Logging

在查询构建器中，输入以下值。

resource.type="gce_instance"
AND logName:  "projects/YOUR_PROJECT_ID/logs/compute.googleapis.com/shielded_vm_integrity"

点击运行过滤条件。
点击 更多操作，然后选择创建接收器。
在创建日志路由接收器页面上：
1. 在接收器详情中，为接收器名称输入 integrity-monitoring，然后点击下一步。
2. 在接收器目标位置中展开接收器服务，然后选择 Cloud Pub/Sub。
3. 展开选择 Cloud Pub/Sub 主题，然后选择创建主题。
4. 在创建主题对话框中，对于主题 ID，输入 integrity-monitoring，然后点击创建主题。
5. 点击下一步，然后点击创建接收器。

日志浏览器

在 Google Cloud 控制台中，前往 Logs Explorer 页面。

转到 Cloud Logging
点击选项，然后选择返回到旧版日志浏览器。
展开按标签过滤或搜索文字，然后点击转换为高级过滤条件。

输入以下高级过滤条件：

resource.type="gce_instance"
AND logName:  "projects/YOUR_PROJECT_ID/logs/compute.googleapis.com/shielded_vm_integrity"

请注意，logName: 后面有两个空格。

点击提交过滤条件。
点击创建导出。
对于接收器名称，输入 integrity-monitoring。
对于接收器服务，选择 Cloud Pub/Sub。
展开接收器目标位置，然后点击创建新的 Cloud Pub/Sub 主题。
对于名称，输入 integrity-monitoring，然后点击创建。
点击创建接收器。

创建 Cloud Run functions 触发器以响应完整性验证失败

创建一个 Cloud Run functions 触发器，以读取 Pub/Sub 主题中的数据并停止任何未通过完整性验证的安全强化型虚拟机实例。

以下代码定义了 Cloud Run functions 触发器。将其复制到名为 main.py 的文件中。

import base64
import json
import googleapiclient.discovery

def shutdown_vm(data, context):
    """A Cloud Function that shuts down a VM on failed integrity check."""
    log_entry = json.loads(base64.b64decode(data['data']).decode('utf-8'))
    payload = log_entry.get('jsonPayload', {})
    entry_type = payload.get('@type')
    if entry_type != 'type.googleapis.com/cloud_integrity.IntegrityEvent':
      raise TypeError("Unexpected log entry type: %s" % entry_type)

    report_event = (payload.get('earlyBootReportEvent')
        or payload.get('lateBootReportEvent'))

    if report_event is None:
      # We received a different event type, ignore.
      return

    policy_passed = report_event['policyEvaluationPassed']
    if not policy_passed:
      print('Integrity evaluation failed: %s' % report_event)
      print('Shutting down the VM')

      instance_id = log_entry['resource']['labels']['instance_id']
      project_id = log_entry['resource']['labels']['project_id']
      zone = log_entry['resource']['labels']['zone']

      # Shut down the instance.
      compute = googleapiclient.discovery.build(
          'compute', 'v1', cache_discovery=False)

      # Get the instance name from instance id.
      list_result = compute.instances().list(
          project=project_id,
          zone=zone,
              filter='id eq %s' % instance_id).execute()
      if len(list_result['items']) != 1:
        raise KeyError('unexpected number of items: %d'
            % len(list_result['items']))
      instance_name = list_result['items'][0]['name']

      result = compute.instances().stop(project=project_id,
          zone=zone,
          instance=instance_name).execute()
      print('Instance %s in project %s has been scheduled for shut down.'
          % (instance_name, project_id))

在 main.py 所在的位置，创建名为 requirements.txt 的文件并将以下依赖项复制到该文件中：
```
google-api-python-client==1.6.6
google-auth==1.4.1
google-auth-httplib2==0.0.3
```
打开终端窗口，然后导航到包含 main.py 和 requirements.txt 的目录。

运行 gcloud beta functions deploy 命令以部署该触发器：

gcloud beta functions deploy shutdown_vm \
    --project PROJECT_ID \
    --runtime python37 \
    --trigger-resource integrity-monitoring \
    --trigger-event google.pubsub.topic.publish

创建已知良好基准测量结果的数据库

创建 Firestore 数据库，以提供已知良好完整性政策基准测量结果的源。您必须手动添加基准测量结果以使此数据库保持最新。

在 Google Cloud 控制台中，前往虚拟机实例页面。

转到“虚拟机实例”页面
点击安全强化型虚拟机实例 ID 以打开虚拟机实例详情页面。
在日志下，点击 Stackdriver Logging。
找到最新的 lateBootReportEvent 日志条目。
展开日志条目 > jsonPayload > lateBootReportEvent > policyMeasurements。
记下 lateBootReportEvent > policyMeasurements 中包含的元素的值。
在 Google Cloud 控制台中，前往 Firestore 页面。

转到 Firestore 控制台
选择开始使用集合。
对于集合 ID，键入 known_good_measurements。
对于文档 ID，键入 baseline1。
对于字段名称，键入 lateBootReportEvent > policyMeasurements 中元素 0 的 pcrNum 字段的值。
对于字段类型，选择映射。
向映射字段添加三个字符串字段，分别命名为 hashAlgo、pcrNum、value。使用 lateBootReportEvent > policyMeasurements 中元素 0 的字段值为上述三个字段赋值。
创建更多映射字段，每个字段对应 lateBootReportEvent > policyMeasurements 中的一个附加元素。为它们提供与第一个映射字段相同的子字段。这些子字段的值应映射到各个附加元素中的对应的值。

例如，如果您使用的是 Linux 虚拟机，操作完成后该集合应类似于以下内容：

如果您使用的是 Windows 虚拟机，则会看到更多测量结果，因此该集合看起来应如下所示：

更新 Cloud Run functions 触发器以了解已知良好的基准

以下代码会创建一个 Cloud Run functions 触发器，该触发器会使任何未通过完整性验证的安全强化型虚拟机实例了解新基准（如果基准在已知良好的测量结果的数据库中）或关停（其他情况下）。复制以下代码并用它覆盖 main.py 中的现有代码。

import base64
import json
import googleapiclient.discovery

import firebase_admin
from firebase_admin import credentials
from firebase_admin import firestore

PROJECT_ID = 'PROJECT_ID'

firebase_admin.initialize_app(credentials.ApplicationDefault(), {
    'projectId': PROJECT_ID,
})

def pcr_values_to_dict(pcr_values):
  """Converts a list of PCR values to a dict, keyed by PCR num"""
  result = {}
  for value in pcr_values:
    result[value['pcrNum']] = value
  return result

def instance_id_to_instance_name(compute, zone, project_id, instance_id):
  list_result = compute.instances().list(
      project=project_id,
      zone=zone,
      filter='id eq %s' % instance_id).execute()
  if len(list_result['items']) != 1:
    raise KeyError('unexpected number of items: %d'
        % len(list_result['items']))
  return list_result['items'][0]['name']

def relearn_if_known_good(data, context):
    """A Cloud Function that shuts down a VM on failed integrity check.
    """
    log_entry = json.loads(base64.b64decode(data['data']).decode('utf-8'))
    payload = log_entry.get('jsonPayload', {})
    entry_type = payload.get('@type')
    if entry_type != 'type.googleapis.com/cloud_integrity.IntegrityEvent':
      raise TypeError("Unexpected log entry type: %s" % entry_type)

    # We only send relearn signal upon receiving late boot report event: if
    # early boot measurements are in a known good database, but late boot
    # measurements aren't, and we send relearn signal upon receiving early boot
    # report event, the VM will also relearn late boot policy baseline, which we
    # don't want, because they aren't known good.
    report_event = payload.get('lateBootReportEvent')
    if report_event is None:
      return

    evaluation_passed = report_event['policyEvaluationPassed']
    if evaluation_passed:
      # Policy evaluation passed, nothing to do.
      return

    # See if the new measurement is known good, and if it is, relearn.
    measurements = pcr_values_to_dict(report_event['actualMeasurements'])

    db = firestore.Client()
    kg_ref = db.collection('known_good_measurements')

    # Check current measurements against known good database.
    relearn = False
    for kg in kg_ref.get():

      kg_map = kg.to_dict()

      # Check PCR values for lateBootReportEvent measurements against the known good
      # measurements stored in the Firestore table

      if ('PCR_0' in kg_map and kg_map['PCR_0'] == measurements['PCR_0'] and
          'PCR_4' in kg_map and kg_map['PCR_4'] == measurements['PCR_4'] and
          'PCR_7' in kg_map and kg_map['PCR_7'] == measurements['PCR_7']):

        # Linux VM (3 measurements), only need to check above 3 measurements
        if len(kg_map) == 3:
          relearn = True

        # Windows VM (6 measurements), need to check 3 additional measurements
        elif len(kg_map) == 6:
          if ('PCR_11' in kg_map and kg_map['PCR_11'] == measurements['PCR_11'] and
              'PCR_13' in kg_map and kg_map['PCR_13'] == measurements['PCR_13'] and
              'PCR_14' in kg_map and kg_map['PCR_14'] == measurements['PCR_14']):
            relearn = True

    compute = googleapiclient.discovery.build('compute', 'beta',
        cache_discovery=False)

    instance_id = log_entry['resource']['labels']['instance_id']
    project_id = log_entry['resource']['labels']['project_id']
    zone = log_entry['resource']['labels']['zone']

    instance_name = instance_id_to_instance_name(compute, zone, project_id, instance_id)

    if not relearn:
      # Issue shutdown API call.
      print('New measurement is not known good. Shutting down a VM.')

      result = compute.instances().stop(project=project_id,
          zone=zone,
          instance=instance_name).execute()

      print('Instance %s in project %s has been scheduled for shut down.'
            % (instance_name, project_id))

    else:
      # Issue relearn API call.
      print('New measurement is known good. Relearning...')

      result = compute.instances().setShieldedInstanceIntegrityPolicy(
          project=project_id,
          zone=zone,
          instance=instance_name,
          body={'updateAutoLearnPolicy':True}).execute()

      print('Instance %s in project %s has been scheduled for relearning.'
        % (instance_name, project_id))

复制以下依赖项并用它们覆盖 requirements.txt 中的现有代码：

google-api-python-client==1.6.6
google-auth==1.4.1
google-auth-httplib2==0.0.3
google-cloud-firestore==0.29.0
firebase-admin==2.13.0

打开终端窗口，然后导航到包含 main.py 和 requirements.txt 的目录。

运行 gcloud beta functions deploy 命令以部署该触发器：

gcloud beta functions deploy relearn_if_known_good \
    --project PROJECT_ID \
    --runtime python37 \
    --trigger-resource integrity-monitoring \
    --trigger-event google.pubsub.topic.publish

在 Cloud Functions 函数控制台中手动删除之前的 shutdown_vm 函数。
在 Google Cloud 控制台中，前往 Cloud Functions 页面。

转到 Cloud Functions
选择 shutdown_vm 函数，然后点击“删除”。

验证自动响应完整性验证失败

首先，检查是否有正在运行的实例，并且是否启用了安全启动（一个安全强化型虚拟机选项）。如果没有，您可以使用安全强化型虚拟机映像 (Ubuntu 18.04LTS) 创建一个新实例，然后启用安全启动选项。您可能要为该实例支付几美分的费用（此步骤可以在一小时内完成）。
现在，假设您因为某种原因想要对内核进行手动升级。
通过 SSH 连接到该实例，然后使用以下命令检查当前内核。
```
uname -sr
```
您应该会看到类似 Linux 4.15.0-1028-gcp 的内容。
从 https://kernel.ubuntu.com/~kernel-ppa/mainline/ 下载通用内核
使用以下命令进行安装。
```
sudo dpkg -i *.deb
```
重新启动虚拟机。
您应该会看到虚拟机未启动（无法通过 SSH 连接到该机器）。这是我们所期望的结果，因为新内核的签名不在我们的安全启动白名单中。这也演示了安全启动如何能够防止对内核进行未经授权/恶意的修改。
但由于我们知道这次内核升级没有恶意，并且确实是由我们自己实施的，因此我们可以关闭安全启动来启动新内核。
关停该虚拟机并取消勾选安全启动选项，然后重启该虚拟机。
启动该机器的操作应该会再次失败！不过，这次该机器是由我们创建的 Cloud Functions 函数自动关停的，因为安全启动选项已更改（而且也因为新内核映像的缘故），导致测量结果与基准不同。（我们可以在 Cloud Functions 函数的 Stackdriver 日志中查看这一情况。）
由于我们知道这不是恶意修改，并且知道根本原因，因此我们可以将 lateBootReportEvent 中的当前测量结果添加到已知良好的测量结果 Firebase 表中。（请注意，有两点需要更改：一是安全启动选项，二是内核映像。）

按照上一步（创建已知良好基准测量结果的数据库）操作，使用最新 lateBootReportEvent 中的实际测量结果将新基准附加到 Firestore 数据库。
现在重新启动该机器。您在查看 Stackdriver 日志时，会看到 lateBootReportEvent 仍然显示 false，但该机器现在应该能够成功启动，因为 Cloud Functions 函数信任并重新学习了新的测量结果。我们可以通过检查 Cloud Functions 函数的 Stackdriver 来验证这一点。
停用安全启动后，我们现在可以启动进入内核。通过 SSH 连接到该机器并再次检查内核，您会看到新的内核版本。
```
uname -sr
```
最后，让我们来清理此步骤中使用的资源和数据。
如果您为此步骤创建了虚拟机，请关停以避免产生额外费用。
在 Google Cloud 控制台中，前往虚拟机实例页面。

转到“虚拟机实例”页面
移除您在此步骤中添加的已知良好的测量结果。
在 Google Cloud 控制台中，前往 Firestore 页面。

转到 Firestore 页面