快速入门:创建数据湖

本文档介绍如何在Google Cloud 控制台中开始使用 Dataplex Universal Catalog。我们将为您演示如何创建数据湖、添加区域和附加资产。

准备工作

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Dataplex Universal Catalog, Dataproc, Dataproc Metastore, BigQuery, and Cloud Storage APIs.

    Enable the APIs

  5. Make sure that you have the following role or roles on the project: roles/dataplex.admin, roles/dataplex.editor

    Check for the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

    4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

    Grant the roles

    1. In the Google Cloud console, go to the IAM page.

      前往 IAM
    2. 选择项目。
    3. 点击 授予访问权限
    4. 新的主账号字段中,输入您的用户标识符。 这通常是 Google 账号的电子邮件地址。

    5. 选择角色列表中,选择一个角色。
    6. 如需授予其他角色,请点击 添加其他角色,然后添加其他各个角色。
    7. 点击 Save(保存)。
  6. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  7. Make sure that billing is enabled for your Google Cloud project.

  8. Enable the Dataplex Universal Catalog, Dataproc, Dataproc Metastore, BigQuery, and Cloud Storage APIs.

    Enable the APIs

  9. Make sure that you have the following role or roles on the project: roles/dataplex.admin, roles/dataplex.editor

    Check for the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

    4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

    Grant the roles

    1. In the Google Cloud console, go to the IAM page.

      前往 IAM
    2. 选择项目。
    3. 点击 授予访问权限
    4. 新的主账号字段中,输入您的用户标识符。 这通常是 Google 账号的电子邮件地址。

    5. 选择角色列表中,选择一个角色。
    6. 如需授予其他角色,请点击 添加其他角色,然后添加其他各个角色。
    7. 点击 Save(保存)。
  10. 创建 Cloud Storage 存储桶:
    1. In the Google Cloud console, go to the Cloud Storage Buckets page.

      Go to Buckets

    2. Click Create.
    3. On the Create a bucket page, enter your bucket information. To go to the next step, click Continue.
      1. For Name your bucket, enter a unique bucket name. Don't include sensitive information in the bucket name, because the bucket namespace is global and publicly visible.
      2. In the Choose where to store your data section, do the following:
        1. Select a Location type.
        2. Choose a location where your bucket's data is permanently stored from the Location type drop-down menu.
        3. To set up cross-bucket replication, select Add cross-bucket replication via Storage Transfer Service and follow these steps:

          Set up cross-bucket replication

          1. In the Bucket menu, select a bucket.
          2. In the Replication settings section, click Configure to configure settings for the replication job.

            The Configure cross-bucket replication pane appears.

            • To filter objects to replicate by object name prefix, enter a prefix that you want to include or exclude objects from, then click Add a prefix.
            • To set a storage class for the replicated objects, select a storage class from the Storage class menu. If you skip this step, the replicated objects will use the destination bucket's storage class by default.
            • Click Done.
      3. In the Choose how to store your data section, do the following:
        1. In the Set a default class section, select the following: Standard.
        2. To enable hierarchical namespace, in the Optimize storage for data-intensive workloads section, select Enable hierarchical namespace on this bucket.
      4. In the Choose how to control access to objects section, select whether or not your bucket enforces public access prevention, and select an access control method for your bucket's objects.
      5. In the Choose how to protect object data section, do the following:
        • Select any of the options under Data protection that you want to set for your bucket.
          • To enable soft delete, click the Soft delete policy (For data recovery) checkbox, and specify the number of days you want to retain objects after deletion.
          • To set Object Versioning, click the Object versioning (For version control) checkbox, and specify the maximum number of versions per object and the number of days after which the noncurrent versions expire.
          • To enable the retention policy on objects and buckets, click the Retention (For compliance) checkbox, and then do the following:
            • To enable Object Retention Lock, click the Enable object retention checkbox.
            • To enable Bucket Lock, click the Set bucket retention policy checkbox, and choose a unit of time and a length of time for your retention period.
        • To choose how your object data will be encrypted, expand the Data encryption section (), and select a Data encryption method.
    4. Click Create.

创建数据湖

数据湖是代表数据域或业务单元的逻辑结构。例如,如果您需要根据组使用情况整理数据,则可以为每个部门(例如零售、销售和财务)创建一个数据湖。

以下步骤介绍了如何使用 Google Cloud 控制台创建数据湖。

  1. 在 Google Cloud 控制台中,前往 Dataplex Universal Catalog。

    前往 Dataplex Universal Catalog

  2. 导航到管理视图。

  3. 点击 创建

  4. 输入显示名称

  5. 系统会自动为您生成数据湖 ID。

  6. 指定要在其中创建数据湖的区域

    对于在给定区域(例如 us-central1)中创建的数据湖,可以附加单区域 (us-central1) 数据和多区域 (us multi-region) 数据,具体取决于区域设置。

  7. 点击创建

向数据湖添加区域

数据湖创建完成后,就可以向其添加区域了。区域是数据湖中的逻辑分组,可用于对结构化数据和非结构化数据进行分类。

  1. 管理视图中,点击要为其添加区域的数据湖的名称。

  2. 点击 添加区域

  3. 输入区域的显示名称

  4. 点击类型下拉列表。选择原始区域精选区域。详细了解区域类型

  5. 数据位置下,选择单区域多区域。您选择的位置一经设置便无法更改。单区域数据和多区域数据不能混合到同一区域。

  6. 点击创建

创建区域可能需要几分钟时间。

附加素材资源

数据可以存储在 Cloud Storage 存储桶或 BigQuery 数据集中,并可以作为资产附加到 Dataplex Universal Catalog 数据湖中的数据区域。

如需将 Cloud Storage 存储桶作为资产附加,请按照以下步骤操作:

  1. 管理视图中,点击要将 Cloud Storage 存储桶附加到的数据湖的名称。

  2. 区域标签页上,点击要将资产添加到的区域。

  3. 资产标签页上,点击 添加资产

  4. 点击添加资产

  5. 类型下,选择存储桶

  6. 显示名称下,输入资产的名称。

  7. 存储桶字段中,点击浏览。如果您有 Cloud Storage 存储桶,请找到该存储桶,然后点击选择。如果您没有 Cloud Storage 存储桶,可以通过点击 按钮创建一个。

    1. 为存储桶输入一个唯一的名称。点击继续

    2. 选择位置类型。点击继续

    3. 为数据选择一个默认存储类别。点击继续

    4. 选择访问权限控制级别。点击继续

    5. 选择数据保护选项或。点击继续

    6. 点击创建

    7. 点击选择

  8. 点击完成

  9. 点击继续

  10. 发现设置下,选择继承以继承区域级别的发现设置

  11. 点击继续

  12. 添加资产下,点击提交

等待创建资产的操作完成。

如需使用数据湖,请参阅后续步骤部分。否则,请按照清理部分中的步骤删除您创建的资源。

清理

为避免因本页中使用的资源导致您的 Google Cloud 账号产生费用,请按照以下步骤操作。

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. If the project that you plan to delete is attached to an organization, expand the Organization list in the Name column.
  3. In the project list, select the project that you want to delete, and then click Delete.
  4. In the dialog, type the project ID, and then click Shut down to delete the project.

或者,您可以删除本教程中使用的资源。除非您删除了湖的所有数据区域资源,否则湖不会被删除。同样,除非您删除了数据区域的所有资产资源,否则数据区域不会被删除。

分离存储桶

如需分离您创建的 Dataplex Universal Catalog 资产,请按照以下步骤操作:

  1. 在 Google Cloud 控制台中,前往 Dataplex Universal Catalog。

    前往 Dataplex Universal Catalog

  2. 管理视图中,点击您创建的数据湖的名称。

  3. 可用区标签页中,点击您创建的可用区的名称。

  4. 资产标签页中,通过勾选存储桶名称左侧的复选框来选择要分离的资产。

  5. 点击删除资产

  6. 点击删除以确认分离。

删除区域

如需删除您创建的 Dataplex Universal Catalog 区域,请按照以下步骤操作:

  1. 在 Google Cloud 控制台中,前往 Dataplex Universal Catalog。

    前往 Dataplex Universal Catalog

  2. 管理视图中,点击您创建的数据湖。

  3. 区域标签页上,通过勾选数据区域名称左侧的复选框来选择要删除的区域。

  4. 点击删除区域

  5. 再次点击删除,确认删除该规则。

删除数据湖

以下步骤演示了如何删除您创建的 Dataplex Universal Catalog 数据湖。

  1. 在 Google Cloud 控制台中,前往 Dataplex Universal Catalog。

    前往 Dataplex Universal Catalog

  2. 管理视图中,点击您创建的数据湖。

  3. 点击页面顶部的删除

  4. 在相应字段中输入“delete”以确认删除。

  5. 点击删除数据湖以确认删除。

后续步骤