Data sinks configurations
This guide describes how to configure data sinks in Manufacturing Data Engine (MDE).
MDE supports five data sinks for records:
- BigQuery
- Cloud Storage
- Bigtable and Federation API
- Pub/Sub (JSON format)
- Pub/Sub (Protobuf format)
Record persistence
The following is an example for how you can configure individual sinks when creating a type:
REST
POST /configuration/v1/types
{
"archetype": "<ARCHETYPE NAME>",
"name": "<TYPE_NAME>",
"metadataBuckets": [
{
"bucketName": "<BUCKET_NAME>",
"version": "<BUCKET_VERSION>"
}
],
"storageSpecs": [
{
"sink": "BIG_QUERY",
"disabled": "false",
"materializeCloudMetadata": "false"
},
{
"sink": "BIG_TABLE",
"disabled": "false",
"materializeCloudMetadata": "false"
},
{
"sink": "GCS",
"disabled": "false",
"materializeCloudMetadata": "false"
},
{
"sink": "PUBSUB_PROTO",
"disabled": "false",
"materializeCloudMetadata": "false"
},
{
"sink": "PUBSUB_JSON",
"disabled": "false",
"materializeCloudMetadata": "false"
}
]
}
Replace the following:
ARCHETYPE
: name of the archetype. One ofDISCRETE_DATA_SERIES
,CLUSTERED_DISCRETE_DATA_SERIES
,NUMERIC_DATA_SERIES
,CLUSTERED_NUMERIC_DATA_SERIES
,CONTINUOUS_DATA_SERIES
,CLUSTERED_CONTINUOUS_DATA_SERIES
.TYPE_NAME
: the name of the type to be created.BUCKET_NAME
: name of the bucket to be associated to this type.BUCKET_VERSION
: the version of the bucket to be associated to this type.
Console
Open the 'CLOUD TAGS' section in the web interface to modify the Record persistence setting for a given Tag:
Click the 'Actions' icon for the Tag you want to change the persistence settings.
Chose the 'View / Edit' option. The following side panel appears:
Expand the section named 'Storage Settings' and the list of all available persistence sinks will appear with a toggle button by the side of each of them:
Select the sinks that are active for the Tag. All records ingested for this tag will be persisted in all the selected sinks.
Similarly, you can configure sinks when creating a new version on an existing type, like the following example:
REST
POST /configuration/v1/types/TYPE_NAME/versions
{
"metadataBuckets": [
{
"bucketName": "EXISTING_BUCKET_NAME",
"version": "EXISTING_BUCKET_VERSION"
},
{
"bucketName": "NEW_BUCKET_NAME",
"version": "NEW_BUCKET_VERSION"
}
],
"storageSpecs": [
{
"sink": "BIG_QUERY",
"disabled": "false",
"materializeCloudMetadata": "true"
},
{
"sink": "BIG_TABLE",
"disabled": "false",
"materializeCloudMetadata": "false"
},
{
"sink": "GCS",
"disabled": "false",
"materializeCloudMetadata": "false"
},
{
"sink": "PUBSUB_PROTO",
"disabled": "false",
"materializeCloudMetadata": "false"
},
{
"sink": "PUBSUB_JSON",
"disabled": "false",
"materializeCloudMetadata": "false"
}
]
}
Replace the following:
TYPE_NAME
: the name of the type for which a new version is createdEXISTING_BUCKET_NAME
: name of the existing bucket already associated to this type.EXISTING_BUCKET_VERSION
: the version of the existing bucket already associated to this type.NEW_BUCKET_NAME
: name of the new bucket.NEW_BUCKET_VERSION
: the version of the new bucket.
Console
Open the 'TYPES' section of the top menu to change the Storage Settings at Type level:
Select in the 'Actions' icon of the Type you want to edit the 'View / Edit' option. The 'Edit Type Version' side menu is displayed.
Expand the 'Storage Settings' panel to access the persistence settings for the Type:
Select each sink where the Records of the Tags belonging to this Type will be persisted by default. The Storage Specification is inherited by the Tags when they are created. The Storage Specification can be modified at the Tag level at any time.
Metadata instance materialization
Metadata instance materialization can be configured for each sink individually. for more information on metadata instance materialization, see instance materialization.
Overriding default type version storage specifications for individual tags
You can also override the default storage specifications for a type version for individual tags. That means, tag configuration settings for a type version override the default settings for that type version.
Sink-specific considerations
The following sections outline some sink-specific considerations:
BigQuery
When you create a data type, MDE automatically creates a
new table for it in the mde_data
dataset. You can configure whether records
are persisted to this table by enabling the BigQuery sink on a
type version or tag for a type version.
Bigtable Federation API
If you have provisioned a Bigtable cluster, you can configure whether records are persisted to Bigtable so that you can access them using the Federation API.
Cloud metadata instances are not persisted standalone in Bigtable.
If you need to have Cloud metadata in records that you retrieve from
Bigtable using Federation API, you should set the value of
materializeCloudMetadata
to true
for the Bigtable sink.
Cloud Storage
Records are stored in Cloud Storage bucket called
<project_id>-gcs-ingestion
in AVRO files using Hive partitioning using a ten
minute window and ten partitions per window. Records are grouped in folders by
type. Cloud metadata instances are not sent to the Cloud Storage sink
standalone. If you need have Cloud metadata in records in Cloud Storage,
you should set the value of materializeCloudMetadata
to true
for the
Cloud Storage sink.
Pub/Sub: Proto and JSON
MDE provides two flavors of the Pub/Sub sink - one that provides records in JSON data format and another that provides messages in Protobuf data format. See the reference section for the respective data schemas.
The JSON formatted record stream is sent to the mde-tag-stream-json
Pub/Sub topic, and the Protobuf formatted record stream is sent to
the mde-tag-stream-proto
. Cloud metadata instances are not sent to the
respective Pub/Sub topics. If you need have Cloud metadata in
records in sent to Pub/Sub, you should set the value of
materializeCloudMetadata
to true
for the Pub/Sub sinks.
If not explicitly enabled on a type, the Pub/Sub sinks are
disabled by default.
Persistence of metadata, logs and configuration data
MDE always persists metadata instances, logs and
configuration data in the mde_dimension
and mde_system
datasets in
BigQuery. Additionally, MDE always
persists raw source messages in Cloud Storage. The persistence of
this data can't be disabled.