Package and import transforms

Apache Beam YAML lets you package and reuse transforms through Beam YAML providers. Providers allow you to encapsulate transforms into a reusable unit that you can then import in your Beam YAML pipelines. YAML, Python, and Java Apache Beam transforms can all be packaged in this way.

With the job builder, you can load providers from Cloud Storage to use them in your job.

Writing providers

Beam YAML providers are defined in YAML files. These files specify the implementation and configuration of the provided transforms. Individual provider listings are expressed as YAML list items with type and config keys. Java and Python providers also have a config key that specifies the transform implementation. YAML-defined provider implementations are expressed inline.

YAML providers

YAML providers define new YAML transforms as a map of names to transform definitions. For example, this provider defines a transform that squares a field from its input:

- type: yaml
  transforms:
    SquareElement:
      body:
        type: chain
        transforms:
          - type: MapToFields
            config:
              language: python
              append: true
              fields:
                power: "element ** 2"

YAML providers can also specify transform parameters with a config_schema key in the transform definition and use these parameters using Jinja2 templatization:

- type: yaml
  transforms:
    RaiseElementToPower:
      config_schema:
        properties:
          n: {type: integer}
      body:
        type: chain
        transforms:
          - type: MapToFields
            config:
              language: python
              append: true
              fields:
                power: "element ** {{n}}"

If a provided transform functions as a source, it must set requires_inputs: false:

- type: yaml
  transforms:
    CreateTestElements:
      requires_inputs: false
      body: |
        type: Create
        config:
          elements: [1,2,3,4]

It is also possible to define composite transforms:

- type: yaml
  transforms:
    ConsecutivePowers:
      config_schema:
        properties:
          end: {type: integer}
          n: {type: integer}
      requires_inputs: false
      body: |
        type: chain
        transforms:
          - type: Range
            config:
              end: {{end}}
          - type: RaiseElementToPower
            config:
              n: {{n}}

Python providers

Python transforms can be provided using the following syntax:

- type: pythonPackage
  config:
    packages:
      - pypi_package>=version
  transforms:
    MyCustomTransform: "pkg.module.PTransformClassOrCallable"

For an in-depth example, see the Python provider starter project on GitHub.

Java providers

Java transforms can be provided using the following syntax:

- type: javaJar
  config:
    jar: gs://your-bucket/your-java-transform.jar
  transforms:
    MyCustomTransform: "urn:registered:in:transform"

For an in-depth example, see the Java provider starter project on GitHub.

Using providers in the job builder

Transforms defined in providers can be imported from Cloud Storage and used in the job builder. To use a provider in the job builder:

  1. Save a provider as a YAML file in Cloud Storage.

    Go to Cloud Storage

  2. Go to the Jobs page in the Google Cloud console.

    Go to Jobs

  3. Click Create job from builder.

  4. Locate the YAML Providers section. You might need to scroll.

  5. In the YAML provider path box, enter the Cloud Storage location of the provider file.

  6. Wait for the provider to load. If the provider is valid, the transform(s) defined in the provider will appear in the Loaded transforms section.

  7. Locate your transform's name in the Loaded transforms section and click the button to insert the transform in your job.

  8. If your transform requires parameters, define them in the YAML transform configuration editor for your transform. Parameters should be defined as a YAML object mapping parameter names to parameter values.

What's next