Most streaming data pipelines require data transformations. Some users prefer transforming data after it reaches its destination in an extract, load, transform (ELT) pipeline, while others opt for transforming data before its ingestion in an extract, transform, and load (ETL) pipeline. Traditionally, this architecture required complex pipelines with tools like Dataflow or Apache Flink to perform data transformations.
Pub/Sub offers Single Message Transforms (SMTs) to simplify data transformations for streaming pipelines. SMTs enable lightweight modifications to message data and attributes directly within Pub/Sub. SMTs eliminate the need for additional data processing steps or separate data transformation products.
SMTs use cases
Consider designing an online store that wants to give customers personalized product recommendations as they browse the website. To do this, you can use Pub/Sub to collect real-time data about customer activity on the site. This includes data about the products viewed, the products added to the cart, and the ratings given to products.
However, this raw data often needs some adjustments before it can be used to generate recommendations. For example, the raw data might contain extraneous details which are irrelevant for your use case. Examples of such details are the customer's browser type or the time they visited the site. The data might also not be in the required format for the recommendation system. For example, timestamps might be in different formats, or product IDs might need to be converted to a different type.
You can use Pub/Sub SMTs to make data transformations such as the following:
Remove personally identifiable information (PII), such as full names and addresses, to protect customer privacy.
Retain only recommendation-relevant events, such as product views and purchases, and discard others, such as customer profile changes.
Ensure all timestamps, currency values, and product IDs adhere to a consistent format and type compatible with the recommendation system.
Generate new data fields from raw data, such as shopping cart total value or product page dwell time.
In summary, SMTs enable a wide range of use cases, including the following:
Data masking and redaction: Protect sensitive data by masking or redacting fields like credit card numbers or PII, aiding compliance with data privacy regulations.
Data format conversion: Transform data between different formats to ensure compatibility with downstream systems.
Message filtering: Process only relevant messages by filtering out unwanted messages based on content or attributes. SMTs allow for more complex filtering conditions than Pub/Sub's built-in filters.
Simple data transformations: Perform basic data manipulation tasks, such as string manipulation, date formatting, or mathematical operations.
Sample message flow for SMTs
The image shows an example Pub/Sub system with SMTs applied at both the
topic and subscription levels.
The following procedure shows how the messages flow in the Pub/Sub system:
The publisher applications Publisher 1 and Publisher 2 publish messages A and B respectively to the Pub/Sub topic.
The topic's SMTs transform messages A and B into messages A' and B', respectively.
If a schema is attached to the topic, the transformed messages A' and B' are validated against the schema. If, for example, A' does not match the schema, the publish of message A fails with an error.
The transformed messages A' and B' are written to Pub/Sub storage.
Pub/Sub delivers messages A' and B' to all attached subscriptions, which are Subscription 1 and Subscription 2 as shown in the image.
If Subscription 1 has a filter configured, messages A' and B' are evaluated against the filter. Only messages matching the filter proceed to the next step. Other messages are automatically acknowledged by Pub/Sub.
If Subscription 2 has a filter configured, messages A' and B' are evaluated against the filter. Only messages matching the filter proceed to the next step. Other messages are automatically acknowledged by Pub/Sub.
Subscription 1's SMTs transform messages A' and B'. A' becomes A'' and B' becomes B''.
Subscription 2's SMTs transform messages A' and B'. A' remains as A' and B' is filtered out.
If Subscription 1 is a push subscription with payload unwrapping enabled, messages A'' and B'' are unwrapped. If Subscription 2 is a push subscription with payload unwrapping enabled, A' is unwrapped.
Subscriber 1 receives message B'', Subscriber 2 receives message A'', and Subscriber 3 receives message A'.
Subscribers acknowledge the received messages.
Pub/Sub deletes the acknowledged messages from storage.
Important information about SMTs
SMTs are integrated into the Pub/Sub API, allowing you to manage them as part of your topic or subscription configurations.
Up to 5 SMTs can be enabled on a topic or subscription.
SMTs operate on a single Pub/Sub message. They cannot aggregate multiple Pub/Sub messages.
When a SMT is run, it takes as input the Pub/Sub message, including its data and attributes. The output is a transformed Pub/Sub message, with modifications to its data or attributes.
If you have an SMT defined on a subscription that has ordering enabled and executing the SMT on any message throws an error, the subsequent messages for the same ordering key are not delivered to the subscriber. Set up a dead-letter topic on the subscription to remove such a message that throws an error from the backlog of messages so subsequent messages can be delivered.