In this tutorial, you create a pipeline that uses custom containers with C++ libraries to run a Dataflow HPC highly parallel workflow. Use this tutorial to learn how to use Dataflow and Apache Beam to run grid computing applications that require data to be distributed to functions running on many cores.
The tutorial demonstrates how to run the pipeline first by using the Direct Runner and then by using the Dataflow Runner. By running the pipeline locally, you can test the pipeline before deploying it.
This example uses Cython bindings and functions from the GMP library. Regardless of the library or binding tool that you use, you can apply the same principles to your pipeline.
The example code is available on GitHub.
Objectives
Create a pipeline that uses custom containers with C++ libraries.
Build a Docker container image using a Dockerfile.
Package the code and dependencies into a Docker container.
Run the pipeline locally to test it.
Run the pipeline in a distributed environment.
Costs
In this document, you use the following billable components of Google Cloud:
- Artifact Registry
- Cloud Build
- Cloud Storage
- Compute Engine
- Dataflow
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
Create or select a Google Cloud project.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_ID
with a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_ID
with your Google Cloud project name.
-
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Cloud Storage, Cloud Storage JSON, Compute Engine, Dataflow, Resource Manager, Artifact Registry, and Cloud Build APIs:
gcloud services enable compute.googleapis.com
dataflow.googleapis.com storage_component storage_api cloudresourcemanager.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com -
Create local authentication credentials for your user account:
gcloud auth application-default login
-
Grant roles to your user account. Run the following command once for each of the following IAM roles:
roles/iam.serviceAccountUser
gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE
- Replace
PROJECT_ID
with your project ID. -
Replace
USER_IDENTIFIER
with the identifier for your user account. For example,user:myemail@example.com
. - Replace
ROLE
with each individual role.
- Replace
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
Create or select a Google Cloud project.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_ID
with a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_ID
with your Google Cloud project name.
-
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Cloud Storage, Cloud Storage JSON, Compute Engine, Dataflow, Resource Manager, Artifact Registry, and Cloud Build APIs:
gcloud services enable compute.googleapis.com
dataflow.googleapis.com storage_component storage_api cloudresourcemanager.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com -
Create local authentication credentials for your user account:
gcloud auth application-default login
-
Grant roles to your user account. Run the following command once for each of the following IAM roles:
roles/iam.serviceAccountUser
gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE
- Replace
PROJECT_ID
with your project ID. -
Replace
USER_IDENTIFIER
with the identifier for your user account. For example,user:myemail@example.com
. - Replace
ROLE
with each individual role.
- Replace
Create a user-managed worker service account for your new pipeline and grant the necessary roles to the service account.
To create the service account, run the
gcloud iam service-accounts create
command:gcloud iam service-accounts create parallelpipeline \ --description="Highly parallel pipeline worker service account" \ --display-name="Highly parallel data pipeline access"
Grant roles to the service account. Run the following command once for each of the following IAM roles:
roles/dataflow.admin
roles/dataflow.worker
roles/storage.objectAdmin
roles/artifactregistry.reader
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:parallelpipeline@PROJECT_ID.iam.gserviceaccount.com" --role=SERVICE_ACCOUNT_ROLE
Replace
SERVICE_ACCOUNT_ROLE
with each individual role.Grant your Google Account a role that lets you create access tokens for the service account:
gcloud iam service-accounts add-iam-policy-binding parallelpipeline@PROJECT_ID.iam.gserviceaccount.com --member="user:EMAIL_ADDRESS" --role=roles/iam.serviceAccountTokenCreator
Download the code sample and change directories
Download the code sample and then change directories. The code samples in the GitHub repository provide all the code that you need to run this pipeline. When you are ready to build your own pipeline, you can use this sample code as a template.
Clone the beam-cpp-example repository.
Use the
git clone
command to clone the GitHub repository:git clone https://github.com/GoogleCloudPlatform/dataflow-sample-applications.git
Switch to the application directory:
cd dataflow-sample-applications/beam-cpp-example
Pipeline code
You can customize the pipeline code from this tutorial. This pipeline completes the following tasks:
- Dynamically produces all integers in an input range.
- Runs the integers through a C++ function and filters bad values.
- Writes the bad values to a side channel.
- Counts the occurrence of each stopping time and normalizes the results.
- Prints the output, formatting and writing the results to a text file.
- Creates a
PCollection
with a single element. - Processes the single element with a
map
function and passes the frequencyPCollection
as a side input. - Processes the
PCollection
and produces a single output.
The starter file looks like the following:
Set up your development environment
Use the Apache Beam SDK for Python.
Install the GMP library:
apt-get install libgmp3-dev
To install the dependencies, use the
requirements.txt
file.pip install -r requirements.txt
To build the Python bindings, run the following command.
python setup.py build_ext --inplace
You can customize the requirements.txt
file from this tutorial. The starter file
includes the following dependencies:
Run the pipeline locally
Running the pipeline locally is useful for testing. By running the pipeline locally, you can confirm that the pipeline runs and behaves as expected before you deploy the pipeline to a distributed environment.
You can run the pipeline locally by using the following command.
This command outputs an image named out.png
.
python pipeline.py
Create the Google Cloud resources
This section explains how to create the following resources:
- A Cloud Storage bucket to use as a temporary storage location and an output location.
- A Docker container to package the pipeline code and dependencies.
Create a Cloud Storage bucket
Begin by creating a Cloud Storage bucket using Google Cloud CLI. This bucket is used as a temporary storage location by the Dataflow pipeline.
To create the bucket, use the
gcloud storage buckets create
command:
gcloud storage buckets create gs://BUCKET_NAME --location=LOCATION
Replace the following:
- BUCKET_NAME: a name for your Cloud Storage bucket that meets the bucket naming requirements. Cloud Storage bucket names must be globally unique.
- LOCATION: the location for the bucket.
Create and build a container image
You can customize the Dockerfile from this tutorial. The starter file looks like the following:
This Dockerfile contains the FROM
, COPY
,
and RUN
commands, which you can read about in the
Dockerfile reference.
To upload artifacts, create an Artifact Registry repository. Each repository can contain artifacts for a single supported format.
All repository content is encrypted using either Google-owned and Google-managed encryption keys or customer-managed encryption keys. Artifact Registry uses Google-owned and Google-managed encryption keys by default and no configuration is required for this option.
You must have at least Artifact Registry Writer access to the repository.
Run the following command to create a new repository. The command uses the
--async
flag and returns immediately, without waiting for the operation in progress to complete.gcloud artifacts repositories create REPOSITORY \ --repository-format=docker \ --location=LOCATION \ --async
Replace
REPOSITORY
with a name for your repository. For each repository location in a project, repository names must be unique.Create the Dockerfile.
For packages to be part of the Apache Beam container, you must specify them as part of the
requirements.txt
file. Ensure that you don't specifyapache-beam
as part of therequirements.txt
file. The Apache Beam container already hasapache-beam
.Before you can push or pull images, configure Docker to authenticate requests for Artifact Registry. To set up authentication to Docker repositories, run the following command:
gcloud auth configure-docker LOCATION-docker.pkg.dev
The command updates your Docker configuration. You can now connect with Artifact Registry in your Google Cloud project to push images.
Build the Docker image using your
Dockerfile
with Cloud Build.Update path in the following command to match the Dockerfile that you created. This command builds the file and pushes it to your Artifact Registry repository.
gcloud builds submit --tag LOCATION-docker.pkg.dev/PROJECT_ID/REPOSITORY/dataflow/cpp_beam_container:latest .
Package the code and dependencies in a Docker container
To run this pipeline in a distributed environment, package the code and dependencies into a docker container.
docker build . -t cpp_beam_container
After you package the code and dependencies, you can run the pipeline locally to test it.
python pipeline.py \ --runner=PortableRunner \ --job_endpoint=embed \ --environment_type=DOCKER \ --environment_config="docker.io/library/cpp_beam_container"
This command writes the output inside the Docker image. To view the output, run the pipeline with the
--output
, and write the output to a Cloud Storage bucket. For example, run the following command.python pipeline.py \ --runner=PortableRunner \ --job_endpoint=embed \ --environment_type=DOCKER \ --environment_config="docker.io/library/cpp_beam_container" \ --output=gs://BUCKET_NAME/out.png
Run the pipeline
You can now run the Apache Beam pipeline in Dataflow by referring to the file with the pipeline code and passing the parameters required by the pipeline.
In your shell or terminal, run the pipeline with the Dataflow Runner.
python pipeline.py \
--runner=DataflowRunner \
--project=PROJECT_ID \
--region=REGION \
--temp_location=gs://BUCKET_NAME/tmp \
--sdk_container_image="LOCATION-docker.pkg.dev/PROJECT_ID/REPOSITORY/dataflow/cpp_beam_container:latest" \
--experiment=use_runner_v2 \
--output=gs://BUCKET_NAME/out.png
After you execute the command to run the pipeline, the Dataflow returns a Job ID with the job status Queued. It might take several minutes before the job status reaches Running and you can access the job graph.
View your results
View data written to your Cloud Storage bucket. Use the
gcloud storage ls
command
to list the contents at the top level of your bucket:
gcloud storage ls gs://BUCKET_NAME
If successful, the command returns a message similar to:
gs://BUCKET_NAME/out.png
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
The easiest way to eliminate billing is to delete the Google Cloud project that you created for the tutorial.
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the individual resources
If you want to reuse the project, then delete the resources that you created for the tutorial.
Clean up Google Cloud project resources
Delete the Artifact Registry repository.
gcloud artifacts repositories delete REPOSITORY \ --location=LOCATION --async
Delete the Cloud Storage bucket. This bucket alone does not incur any charges.
gcloud storage rm gs://BUCKET_NAME --recursive
Revoke credentials
Revoke the roles that you granted to the user-managed worker service account. Run the following command once for each of the following IAM roles:
roles/dataflow.admin
roles/dataflow.worker
roles/storage.objectAdmin
roles/artifactregistry.reader
gcloud projects remove-iam-policy-binding PROJECT_ID \ --member=serviceAccount:parallelpipeline@PROJECT_ID.iam.gserviceaccount.com \ --role=SERVICE_ACCOUNT_ROLE
-
Optional: Revoke the authentication credentials that you created, and delete the local credential file.
gcloud auth application-default revoke
-
Optional: Revoke credentials from the gcloud CLI.
gcloud auth revoke
What's next
- View the sample application on GitHub.
- Use custom containers in Dataflow.
- Learn more about using container environments with Apache Beam.
- Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.