Esta página foi traduzida pela API Cloud Translation.

Fluxo de trabalho usando funções do Cloud Run

Antes de começar

Se você ainda não tiver feito isso, configure um Google Cloud projeto e dois (2) buckets do Cloud Storage.

Criar o projeto

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Dataproc, Compute Engine, Cloud Storage, and Cloud Run functions APIs.

Enable the APIs

Install the Google Cloud CLI.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

To initialize the gcloud CLI, run the following command:

gcloud init

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Dataproc, Compute Engine, Cloud Storage, and Cloud Run functions APIs.

Enable the APIs

Install the Google Cloud CLI.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

To initialize the gcloud CLI, run the following command:

gcloud init

Criar ou usar dois buckets do Cloud Storage no projeto

Você precisará de dois buckets do Cloud Storage no projeto: um para arquivos de entrada e outro para saída.

In the Google Cloud console, go to the Cloud Storage Buckets page.
Go to Buckets
Click Create.
On the Create a bucket page, enter your bucket information. To go to the next step, click Continue.
1. In the Get started section, do the following:
  - Enter a globally unique name that meets the bucket naming requirements.
  - To add a bucket label, expand the Labels section (), click Add label, and specify a key and a value for your label.
2. In the Choose where to store your data section, do the following:
  1. Select a Location type.
  2. Choose a location where your bucket's data is permanently stored from the Location type drop-down menu.
    - If you select the dual-region location type, you can also choose to enable turbo replication by using the relevant checkbox.
  3. To set up cross-bucket replication, select Add cross-bucket replication via Storage Transfer Service and follow these steps:
    Set up cross-bucket replication
    
    In the Bucket menu, select a bucket.
    
    In the Replication settings section, click Configure to configure settings for the replication job.
    
    The Configure cross-bucket replication pane appears.
    
    To filter objects to replicate by object name prefix, enter a prefix that you want to include or exclude objects from, then click Add a prefix.
    
    To set a storage class for the replicated objects, select a storage class from the Storage class menu. If you skip this step, the replicated objects will use the destination bucket's storage class by default.
    
    Click Done.
3. In the Choose how to store your data section, do the following:
  1. Select a default storage class for the bucket or Autoclass for automatic storage class management of your bucket's data.
  2. To enable hierarchical namespace, in the Optimize storage for data-intensive workloads section, select Enable hierarchical namespace on this bucket.
    Note: You cannot enable hierarchical namespace in existing buckets.
4. In the Choose how to control access to objects section, select whether or not your bucket enforces public access prevention, and select an access control method for your bucket's objects.
  Note: You cannot change the Prevent public access setting if this setting is enforced at an organization policy.
5. In the Choose how to protect object data section, do the following:
  - Select any of the options under Data protection that you want to set for your bucket.
    - To enable soft delete, click the Soft delete policy (For data recovery) checkbox, and specify the number of days you want to retain objects after deletion.
    - To set Object Versioning, click the Object versioning (For version control) checkbox, and specify the maximum number of versions per object and the number of days after which the noncurrent versions expire.
    - To enable the retention policy on objects and buckets, click the Retention (For compliance) checkbox, and then do the following:
      - To enable Object Retention Lock, click the Enable object retention checkbox.
      - To enable Bucket Lock, click the Set bucket retention policy checkbox, and choose a unit of time and a length of time for your retention period.
  - To choose how your object data will be encrypted, expand the Data encryption section (), and select a Data encryption method.
Click Create.

Criar um modelo de fluxo de trabalho

Para criar e definir um modelo de fluxo de trabalho, copie e execute os comandos a seguir em uma janela de terminal local ou no Cloud Shell.

Crie o modelo de fluxo de trabalho.

  gcloud dataproc workflow-templates create wordcount-template \
      --region=us-central1

Adicione o job de contagem de palavras ao modelo de fluxo de trabalho.
1. Especifique o output-bucket-name antes de executar o comando (a função fornecerá o bucket de entrada). Depois que você inserir o nome do bucket de saída, o argumento do bucket de saída deverá ler da seguinte maneira: gs://your-output-bucket/wordcount-output".
2. O ID da etapa "count" é obrigatório e identifica o job do hadoop adicionado.
```
          gcloud dataproc workflow-templates add-job hadoop \
              --workflow-template=wordcount-template \
              --step-id=count \
              --jar=file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
              --region=us-central1 \
              -- wordcount gs://input-bucket gs://output-bucket-name/wordcount-output
        
```
Use um cluster gerenciado de nó único para executar o fluxo de trabalho. O Dataproc criará o cluster, executará o fluxo de trabalho nele e excluirá o cluster quando o fluxo de trabalho for concluído.
```
    gcloud dataproc workflow-templates set-managed-cluster wordcount-template \
        --cluster-name=wordcount \
        --single-node \
        --region=us-central1
    
```
Clique no nome do wordcount-template na página Fluxos de trabalho do Dataproc no console Google Cloud para abrir a página Detalhes do modelo de fluxo de trabalho. Confirme os atributos do modelo de contagem de palavras.

Parametrizar o modelo de fluxo de trabalho

Parametrize a variável do bucket de entrada que será transmitida para o modelo de fluxo de trabalho.

Exporte o modelo de fluxo de trabalho para um arquivo de texto wordcount.yaml para parametrização.

gcloud dataproc workflow-templates export wordcount-template \
    --destination=wordcount.yaml \
    --region=us-central1

Usando um editor de texto, abra wordcount.yaml e adicione um bloco parameters ao final do arquivo YAML para que o INPUT_BUCKET_URI do Cloud Storage possa ser transmitido como args[1] para o binário de contagem de palavras, quando o fluxo de trabalho for acionado.
Veja abaixo um exemplo de arquivo YAML exportado. Você pode adotar uma das duas abordagens para atualizar seu modelo:
1. Copie e cole o arquivo inteiro para substituir o wordcount.yaml exportado depois de substituí-lo your-output_bucket pelo nome do bucket de saída, OU
2. Copie e cole somente a seção parameters no final do arquivo wordcount.yaml exportado.
.
```
jobs:
- hadoopJob:
    args:
    - wordcount
    - gs://input-bucket
    - gs://your-output-bucket/wordcount-output
    mainJarFileUri: file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
  stepId: count
placement:
  managedCluster:
    clusterName: wordcount
    config:
      softwareConfig:
        properties:
          dataproc:dataproc.allow.zero.workers: 'true'
parameters:
- name: INPUT_BUCKET_URI
  description: wordcount input bucket URI
  fields:
  - jobs['count'].hadoopJob.args[1]
```

Importe o arquivo de texto wordcount.yaml parametrizado. Digite "Y" quando solicitado para substituir o modelo.

gcloud dataproc workflow-templates import  wordcount-template \
    --source=wordcount.yaml \
    --region=us-central1

crie uma função do Cloud

Abra a página Cloud Run functions no Google Cloud console e clique em CRIAR FUNÇÃO.

Na página Criar função, digite ou selecione as seguintes informações:

Nome: contagem de palavras
Memória alocada: mantenha a seleção padrão.
Gatilho:
- Cloud Storage
- Tipo de evento: finalizar/criar
- Bucket: selecione o bucket de entrada. Consulte Criar um bucket do Cloud Storage no projeto. Quando um arquivo é adicionado a esse bucket, a função aciona o fluxo de trabalho. O fluxo de trabalho executará o aplicativo de contagem de palavras, que processará todos os arquivos de texto no bucket.

Código-fonte:

Editor in-line
Ambiente de execução: Node.js 8
Guia INDEX.JS: substitua o snippet de código padrão pelo código a seguir e edite a linha const projectId para fornecer -your-project-id- (sem um "-" inicial ou final).

const dataproc = require('@google-cloud/dataproc').v1;

exports.startWorkflow = (data) => {
 const projectId = '-your-project-id-'
 const region = 'us-central1'
 const workflowTemplate = 'wordcount-template'

const client = new dataproc.WorkflowTemplateServiceClient({
   apiEndpoint: `${region}-dataproc.googleapis.com`,
});

const file = data;
console.log("Event: ", file);

const inputBucketUri = `gs://${file.bucket}/${file.name}`;

const request = {
  name: client.projectRegionWorkflowTemplatePath(projectId, region, workflowTemplate),
  parameters: {"INPUT_BUCKET_URI": inputBucketUri}
};

client.instantiateWorkflowTemplate(request)
  .then(responses => {
    console.log("Launched Dataproc Workflow:", responses[1]);
  })
  .catch(err => {
    console.error(err);
  });
};

Guia PACKAGE.JSON: substitua o snippet de código padrão pelo código a seguir.

{
  "name": "dataproc-workflow",
  "version": "1.0.0",
  "dependencies":{ "@google-cloud/dataproc": ">=1.0.0"}
}

Função a ser executada: Insert: "startWorkflow".

Clique em CRIAR

Testar a função

Copie o arquivo público rose.txt para o bucket para acionar a função. Insira your-input-bucket-name (o bucket usado para acionar a função) no comando.
```
gcloud storage cp gs://pub/shakespeare/rose.txt gs://your-input-bucket-name
```

Aguarde 30 segundos e execute o seguinte comando para verificar se a função foi concluída.

gcloud functions logs read wordcount

...
Function execution took 1348 ms, finished with status: 'ok'

Para ver os registros de função na página da lista Funções no console Google Cloud , clique no nome da função wordcount e depois em VER LOGS na página Detalhes da função.
Você pode ver a pasta wordcount-output no bucket de saída na página Navegador de armazenamento no console doGoogle Cloud .

Observação: o job de contagem de palavras falhará se a pasta wordcount-output existir. Antes de executar novamente o fluxo de trabalho acionando outra vez a função, exclua a pasta wordcount-output no bucket de saída.
Após a conclusão do fluxo de trabalho, os detalhes do job permanecem no consoleGoogle Cloud . Clique no job count... listado na página Jobs do Dataproc para visualizar os detalhes do job do fluxo de trabalho.

Limpar

O fluxo de trabalho neste tutorial exclui o cluster gerenciado quando o fluxo de trabalho é concluído. Para evitar custos recorrentes, exclua outros recursos associados a este tutorial.

Excluir um projeto

Cuidado: excluir um projeto tem os seguintes efeitos:

Tudo no projeto é excluído. Se você tiver usado um projeto existente para as tarefas neste documento, a exclusão dele incluirá a exclusão de quaisquer outros trabalhos feitos no projeto.
Os IDs do projeto personalizados são perdidos. Ao criar o projeto, você pode ter criado um código do projeto personalizado para ser usado no futuro. Para preservar os URLs que usam o ID do projeto, como um URL appspot.com, exclua recursos específicos do projeto, em vez de excluir o projeto inteiro.

Se você planeja passar por várias arquiteturas, tutoriais ou guias de início rápido, a reutilização de projetos pode evitar que você exceda os limites da cota do projeto.

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

Excluir buckets do Cloud Storage

In the Google Cloud console, go to the Cloud Storage Buckets page.
Go to Buckets
Click the checkbox for the bucket that you want to delete.
To delete the bucket, click Delete, and then follow the instructions.

Excluir o modelo de fluxo de trabalho

gcloud dataproc workflow-templates delete wordcount-template \
    --region=us-central1

Excluir a função do Cloud

Abra a página Funções do Cloud Run no console Google Cloud , marque a caixa à esquerda da função wordcount e clique em Excluir.

A seguir

Consulte Visão geral dos modelos de fluxo de trabalho do Dataproc.
Consulte Soluções de programação de fluxo de trabalho.