建立叢集

建立 Dataproc 叢集

需求條件:

  • 名稱:叢集名稱開頭須為小寫英文字母,其後最多可接 51 個小寫英文字母、數字和連字號,但結尾不得為連字號。

  • 叢集區域:您必須為叢集指定 Compute Engine 區域 (例如 us-east1europe-west1),才能在該區域內隔離叢集資源,例如儲存在 Cloud Storage 中的 VM 執行個體和叢集中繼資料。

    • 如要進一步瞭解地區端點,請參閱「地區端點」。
    • 如要瞭解如何選取地區,請參閱「可用地區與區域」一節。您也可以執行 gcloud compute regions list 指令,查看可用地區清單。
  • 連線:在由主要 VM 和工作站 VM 組成的 Dataproc 叢集中,Compute Engine 虛擬機器執行個體 (VM) 需要彼此的完整內部 IP 網路交叉連線。default 虛擬私有雲網路會提供這類連線 (請參閱「Dataproc 叢集網路設定」)。

gcloud

如要在指令列建立 Dataproc 叢集,請在本機的終端機視窗或 Cloud Shell 中執行 gcloud dataproc clusters create 指令。

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION

這項指令建立的叢集會具有主要虛擬機器執行個體和工作站虛擬機器執行個體的預設 Dataproc 服務設定、磁碟大小和類型、網路類型、部署叢集的地區和區域,以及其他叢集設定。要瞭解如何使用指令列標記來自訂叢集設定,請參閱 gcloud dataproc clusters create 指令說明。

使用 YAML 檔案建立叢集

  1. 執行下列 gcloud 指令,將現有 Dataproc 叢集的設定匯出到 cluster.yaml 檔案。
    gcloud dataproc clusters export EXISTING_CLUSTER_NAME \
        --region=REGION \
        --destination=cluster.yaml
    
  2. 匯入 YAML 檔案設定,建立新叢集。
    gcloud dataproc clusters import NEW_CLUSTER_NAME \
        --region=REGION \
        --source=cluster.yaml
    

注意:在匯出作業期間,系統會篩選叢集專用欄位 (例如叢集名稱)、僅限輸出欄位,以及自動套用的標籤。用來建立叢集的匯入 YAML 檔案不允許有這些欄位。

REST

本節說明如何使用必要值和預設設定 (1 個主要節點、2 個工作站節點) 建立叢集。

使用任何要求資料之前,請先替換以下項目:

  • CLUSTER_NAME:叢集名稱
  • PROJECT: Google Cloud 專案 ID
  • REGION:要建立叢集的 Compute Engine 地區
  • ZONE:在所選區域中,可選的可用區 (叢集建立位置)。

HTTP 方法和網址:

POST https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters

JSON 要求主體:

{
  "project_id":"PROJECT",
  "cluster_name":"CLUSTER_NAME",
  "config":{
    "master_config":{
      "num_instances":1,
      "machine_type_uri":"n1-standard-2",
      "image_uri":""
    },
    "softwareConfig": {
      "imageVersion": "",
      "properties": {},
      "optionalComponents": []
    },
    "worker_config":{
      "num_instances":2,
      "machine_type_uri":"n1-standard-2",
      "image_uri":""
    },
    "gce_cluster_config":{
      "zone_uri":"ZONE"
    }
  }
}

如要傳送要求,請展開以下其中一個選項:

您應該會收到如下的 JSON 回應:

{
"name": "projects/PROJECT/regions/REGION/operations/b5706e31......",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.dataproc.v1.ClusterOperationMetadata",
    "clusterName": "CLUSTER_NAME",
    "clusterUuid": "5fe882b2-...",
    "status": {
      "state": "PENDING",
      "innerState": "PENDING",
      "stateStartTime": "2019-11-21T00:37:56.220Z"
    },
    "operationType": "CREATE",
    "description": "Create cluster with 2 workers",
    "warnings": [
      "For PD-Standard without local SSDs, we strongly recommend provisioning 1TB ...""
    ]
  }
}

控制台

在瀏覽器中開啟 Google Cloud 控制台的 Dataproc「Create a cluster」(建立叢集) 頁面,然後在「Create a Dataproc cluster on Compute Engine」(在 Compute Engine 上建立 Dataproc 叢集) 頁面的「Compute engine」(運算引擎) 列叢集中,按一下「Create」(建立)選取「設定叢集」面板,並在欄位中填入預設值。您可以選取各個面板,確認或變更預設值,以自訂分群。

按一下 [建立],建立叢集。叢集名稱會出現在「Clusters」(叢集) 頁面中,且在叢集佈建完成後,其狀態會更新為「Running」(執行中)。按一下叢集名稱,即可開啟叢集詳細資料頁面,查看叢集的工作、執行個體和配置設定,以及連線至叢集中執行的網頁介面。

Go

  1. 安裝用戶端程式庫
  2. 設定應用程式預設憑證。
  3. 執行程式碼。
    import (
    	"context"
    	"fmt"
    	"io"
    
    	dataproc "cloud.google.com/go/dataproc/apiv1"
    	"cloud.google.com/go/dataproc/apiv1/dataprocpb"
    	"google.golang.org/api/option"
    )
    
    func createCluster(w io.Writer, projectID, region, clusterName string) error {
    	// projectID := "your-project-id"
    	// region := "us-central1"
    	// clusterName := "your-cluster"
    	ctx := context.Background()
    
    	// Create the cluster client.
    	endpoint := region + "-dataproc.googleapis.com:443"
    	clusterClient, err := dataproc.NewClusterControllerClient(ctx, option.WithEndpoint(endpoint))
    	if err != nil {
    		return fmt.Errorf("dataproc.NewClusterControllerClient: %w", err)
    	}
    	defer clusterClient.Close()
    
    	// Create the cluster config.
    	req := &dataprocpb.CreateClusterRequest{
    		ProjectId: projectID,
    		Region:    region,
    		Cluster: &dataprocpb.Cluster{
    			ProjectId:   projectID,
    			ClusterName: clusterName,
    			Config: &dataprocpb.ClusterConfig{
    				MasterConfig: &dataprocpb.InstanceGroupConfig{
    					NumInstances:   1,
    					MachineTypeUri: "n1-standard-2",
    				},
    				WorkerConfig: &dataprocpb.InstanceGroupConfig{
    					NumInstances:   2,
    					MachineTypeUri: "n1-standard-2",
    				},
    			},
    		},
    	}
    
    	// Create the cluster.
    	op, err := clusterClient.CreateCluster(ctx, req)
    	if err != nil {
    		return fmt.Errorf("CreateCluster: %w", err)
    	}
    
    	resp, err := op.Wait(ctx)
    	if err != nil {
    		return fmt.Errorf("CreateCluster.Wait: %w", err)
    	}
    
    	// Output a success message.
    	fmt.Fprintf(w, "Cluster created successfully: %s", resp.ClusterName)
    	return nil
    }
    

Java

  1. 安裝用戶端程式庫
  2. 設定應用程式預設憑證。
  3. 執行程式碼。
    import com.google.api.gax.longrunning.OperationFuture;
    import com.google.cloud.dataproc.v1.Cluster;
    import com.google.cloud.dataproc.v1.ClusterConfig;
    import com.google.cloud.dataproc.v1.ClusterControllerClient;
    import com.google.cloud.dataproc.v1.ClusterControllerSettings;
    import com.google.cloud.dataproc.v1.ClusterOperationMetadata;
    import com.google.cloud.dataproc.v1.InstanceGroupConfig;
    import java.io.IOException;
    import java.util.concurrent.ExecutionException;
    
    public class CreateCluster {
    
      public static void createCluster() throws IOException, InterruptedException {
        // TODO(developer): Replace these variables before running the sample.
        String projectId = "your-project-id";
        String region = "your-project-region";
        String clusterName = "your-cluster-name";
        createCluster(projectId, region, clusterName);
      }
    
      public static void createCluster(String projectId, String region, String clusterName)
          throws IOException, InterruptedException {
        String myEndpoint = String.format("%s-dataproc.googleapis.com:443", region);
    
        // Configure the settings for the cluster controller client.
        ClusterControllerSettings clusterControllerSettings =
            ClusterControllerSettings.newBuilder().setEndpoint(myEndpoint).build();
    
        // Create a cluster controller client with the configured settings. The client only needs to be
        // created once and can be reused for multiple requests. Using a try-with-resources
        // closes the client, but this can also be done manually with the .close() method.
        try (ClusterControllerClient clusterControllerClient =
            ClusterControllerClient.create(clusterControllerSettings)) {
          // Configure the settings for our cluster.
          InstanceGroupConfig masterConfig =
              InstanceGroupConfig.newBuilder()
                  .setMachineTypeUri("n1-standard-2")
                  .setNumInstances(1)
                  .build();
          InstanceGroupConfig workerConfig =
              InstanceGroupConfig.newBuilder()
                  .setMachineTypeUri("n1-standard-2")
                  .setNumInstances(2)
                  .build();
          ClusterConfig clusterConfig =
              ClusterConfig.newBuilder()
                  .setMasterConfig(masterConfig)
                  .setWorkerConfig(workerConfig)
                  .build();
          // Create the cluster object with the desired cluster config.
          Cluster cluster =
              Cluster.newBuilder().setClusterName(clusterName).setConfig(clusterConfig).build();
    
          // Create the Cloud Dataproc cluster.
          OperationFuture<Cluster, ClusterOperationMetadata> createClusterAsyncRequest =
              clusterControllerClient.createClusterAsync(projectId, region, cluster);
          Cluster response = createClusterAsyncRequest.get();
    
          // Print out a success message.
          System.out.printf("Cluster created successfully: %s", response.getClusterName());
    
        } catch (ExecutionException e) {
          System.err.println(String.format("Error executing createCluster: %s ", e.getMessage()));
        }
      }
    }

Node.js

  1. 安裝用戶端程式庫
  2. 設定應用程式預設憑證。
  3. 執行程式碼。
    const dataproc = require('@google-cloud/dataproc');
    
    // TODO(developer): Uncomment and set the following variables
    // projectId = 'YOUR_PROJECT_ID'
    // region = 'YOUR_CLUSTER_REGION'
    // clusterName = 'YOUR_CLUSTER_NAME'
    
    // Create a client with the endpoint set to the desired cluster region
    const client = new dataproc.v1.ClusterControllerClient({
      apiEndpoint: `${region}-dataproc.googleapis.com`,
      projectId: projectId,
    });
    
    async function createCluster() {
      // Create the cluster config
      const request = {
        projectId: projectId,
        region: region,
        cluster: {
          clusterName: clusterName,
          config: {
            masterConfig: {
              numInstances: 1,
              machineTypeUri: 'n1-standard-2',
            },
            workerConfig: {
              numInstances: 2,
              machineTypeUri: 'n1-standard-2',
            },
          },
        },
      };
    
      // Create the cluster
      const [operation] = await client.createCluster(request);
      const [response] = await operation.promise();
    
      // Output a success message
      console.log(`Cluster created successfully: ${response.clusterName}`);

Python

  1. 安裝用戶端程式庫
  2. 設定應用程式預設憑證。
  3. 執行程式碼。
    from google.cloud import dataproc_v1 as dataproc
    
    
    def create_cluster(project_id, region, cluster_name):
        """This sample walks a user through creating a Cloud Dataproc cluster
        using the Python client library.
    
        Args:
            project_id (string): Project to use for creating resources.
            region (string): Region where the resources should live.
            cluster_name (string): Name to use for creating a cluster.
        """
    
        # Create a client with the endpoint set to the desired cluster region.
        cluster_client = dataproc.ClusterControllerClient(
            client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
        )
    
        # Create the cluster config.
        cluster = {
            "project_id": project_id,
            "cluster_name": cluster_name,
            "config": {
                "master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-2"},
                "worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-2"},
            },
        }
    
        # Create the cluster.
        operation = cluster_client.create_cluster(
            request={"project_id": project_id, "region": region, "cluster": cluster}
        )
        result = operation.result()
    
        # Output a success message.
        print(f"Cluster created successfully: {result.cluster_name}")