Apache Beam YAML lets you package and reuse transforms through Beam YAML providers. Providers allow you to encapsulate transforms into a reusable unit that you can then import in your Beam YAML pipelines. YAML, Python, and Java Apache Beam transforms can all be packaged in this way.
With the job builder, you can load providers from Cloud Storage to use them in your job.
Writing providers
Beam YAML providers are defined in YAML files. These files specify the implementation and configuration of the provided transforms. Individual provider listings are expressed as YAML list items with type
and config
keys. Java and Python providers also have a config
key that specifies the transform implementation. YAML-defined provider implementations are expressed inline.
YAML providers
YAML providers define new YAML transforms as a map of names to transform definitions. For example, this provider defines a transform that squares a field from its input:
- type: yaml
transforms:
SquareElement:
body:
type: chain
transforms:
- type: MapToFields
config:
language: python
append: true
fields:
power: "element ** 2"
YAML providers can also specify transform parameters with a config_schema
key in the transform definition and use these parameters using Jinja2 templatization:
- type: yaml
transforms:
RaiseElementToPower:
config_schema:
properties:
n: {type: integer}
body:
type: chain
transforms:
- type: MapToFields
config:
language: python
append: true
fields:
power: "element ** {{n}}"
If a provided transform functions as a source, it must set requires_inputs: false
:
- type: yaml
transforms:
CreateTestElements:
requires_inputs: false
body: |
type: Create
config:
elements: [1,2,3,4]
It is also possible to define composite transforms:
- type: yaml
transforms:
ConsecutivePowers:
config_schema:
properties:
end: {type: integer}
n: {type: integer}
requires_inputs: false
body: |
type: chain
transforms:
- type: Range
config:
end: {{end}}
- type: RaiseElementToPower
config:
n: {{n}}
Python providers
Python transforms can be provided using the following syntax:
- type: pythonPackage
config:
packages:
- pypi_package>=version
transforms:
MyCustomTransform: "pkg.module.PTransformClassOrCallable"
For an in-depth example, see the Python provider starter project on GitHub.
Java providers
Java transforms can be provided using the following syntax:
- type: javaJar
config:
jar: gs://your-bucket/your-java-transform.jar
transforms:
MyCustomTransform: "urn:registered:in:transform"
For an in-depth example, see the Java provider starter project on GitHub.
Using providers in the job builder
Transforms defined in providers can be imported from Cloud Storage and used in the job builder. To use a provider in the job builder:
Save a provider as a YAML file in Cloud Storage.
Go to the Jobs page in the Google Cloud console.
Click
Create job from builder.Locate the YAML Providers section. You might need to scroll.
In the YAML provider path box, enter the Cloud Storage location of the provider file.
Wait for the provider to load. If the provider is valid, the transform(s) defined in the provider will appear in the Loaded transforms section.
Locate your transform's name in the Loaded transforms section and click the
button to insert the transform in your job.If your transform requires parameters, define them in the YAML transform configuration editor for your transform. Parameters should be defined as a YAML object mapping parameter names to parameter values.
What's next
- Learn more about Beam YAML providers.