Query in Cloud Storage

This guide explains how data is stored in Cloud Storage, both in raw and processed formats.

Raw data

Manufacturing Data Engine (MDE) stores all messages received in raw format without being processed. This is useful both for archival reasons, and to reprocess data in case of configuration or system errors.

Raw data is stored in AVRO format using the following schema:

{
  "type" : "record",
  "name" : "AvroPubsubMessageRecord",
  "namespace" : "com.google.cloud.industry.manufacturing.sfp.datalake.core",
  "fields" : [ {
    "name" : "attributes",
    "type" : {
      "type" : "map",
      "values" : "string"
    }
  }, {
    "name" : "message",
    "type" : {
      "type" : "bytes",
      "java-class" : "[B"
    }
  }, {
    "name" : "messageId",
    "type" : "string"
  }, {
    "name" : "timestamp",
    "type" : "long"
  } ]

The schema is composed by the following values:

attributes: Stores the Pub/Sub message attribute map.
message: Stores the raw message as received in the Pub/Sub topic input-messages.
messageId: ID set by Pub/Sub when receiving the message, it's written into all the sinks in order to have lineage of where the data came from.
timestamp: When the message was received by Pub/Sub.

The data is stored by default in a Cloud Storage bucket named <project-id>-raw. And then under a folder for the version, in this case 1.3.

Date partitioning is then used to separate files by day with the format dt=YYYY-MM-DD.

Messages are grouped in 10 minute windows and 10 files are written in each of the windows, this data is used to produce the filenames. The following scripts shows an example of this:

gs://<project-id>-raw/v1.3/dt=2023-08-08/gcsoutput2023-08-08T09:20:00.000Z-2023-08-08T09:30:00.000Z-pane-0-last-00-of-10avro
gs://<project-id>-raw/v1.3/dt=2023-08-08/gcsoutput2023-08-08T09:20:00.000Z-2023-08-08T09:30:00.000Z-pane-0-last-01-of-10avro
gs://<project-id>-raw/v1.3/dt=2023-08-08/gcsoutput2023-08-08T09:20:00.000Z-2023-08-08T09:30:00.000Z-pane-0-last-02-of-10avro
gs://<project-id>-raw/v1.3/dt=2023-08-08/gcsoutput2023-08-08T09:20:00.000Z-2023-08-08T09:30:00.000Z-pane-0-last-03-of-10avro
gs://<project-id>-raw/v1.3/dt=2023-08-08/gcsoutput2023-08-08T09:20:00.000Z-2023-08-08T09:30:00.000Z-pane-0-last-04-of-10avro
gs://<project-id>-raw/v1.3/dt=2023-08-08/gcsoutput2023-08-08T09:20:00.000Z-2023-08-08T09:30:00.000Z-pane-0-last-05-of-10avro
gs://<project-id>-raw/v1.3/dt=2023-08-08/gcsoutput2023-08-08T09:20:00.000Z-2023-08-08T09:30:00.000Z-pane-0-last-06-of-10avro
gs://<project-id>-raw/v1.3/dt=2023-08-08/gcsoutput2023-08-08T09:20:00.000Z-2023-08-08T09:30:00.000Z-pane-0-last-07-of-10avro
gs://<project-id>-raw/v1.3/dt=2023-08-08/gcsoutput2023-08-08T09:20:00.000Z-2023-08-08T09:30:00.000Z-pane-0-last-08-of-10avro
gs://<project-id>-raw/v1.3/dt=2023-08-08/gcsoutput2023-08-08T09:20:00.000Z-2023-08-08T09:30:00.000Z-pane-0-last-09-of-10avro

Reading from BigQuery

The following sections describe the data reading procedure from BigQuery.

Processed data

Once data is processed by MDE it can be configured to be stored in Cloud Storage.

By default the data is saved in a bucket named <project-id>-gcs-ingestion. Then it's stored under a folder for the version, in this case v1.3, then each type is stored in its own folder (default-discrete-records, default-numeric-records, etc) and finally date partitioning is used to separate files by day with the format dt=YYYY-MM-DD.

As with Raw data, messages are grouped in 10 minute windows and 10 files are written in each of the windows, this data is used to produce the filenames, ie:

gs://<project-id>-gcs-ingestion/v1.3/default-discrete-records/dt=2023-08-08/gcsoutput2023-08-08T09:40:00.000Z-2023-08-08T09:50:00.000Z-00000-of-00010.avro
gs://<project-id>-gcs-ingestion/v1.3/default-discrete-records/dt=2023-08-08/gcsoutput2023-08-08T09:40:00.000Z-2023-08-08T09:50:00.000Z-00001-of-00010.avro
gs://<project-id>-gcs-ingestion/v1.3/default-discrete-records/dt=2023-08-08/gcsoutput2023-08-08T09:40:00.000Z-2023-08-08T09:50:00.000Z-00002-of-00010.avro
gs://<project-id>-gcs-ingestion/v1.3/default-discrete-records/dt=2023-08-08/gcsoutput2023-08-08T09:40:00.000Z-2023-08-08T09:50:00.000Z-00003-of-00010.avro
gs://<project-id>-gcs-ingestion/v1.3/default-discrete-records/dt=2023-08-08/gcsoutput2023-08-08T09:40:00.000Z-2023-08-08T09:50:00.000Z-00004-of-00010.avro
gs://<project-id>-gcs-ingestion/v1.3/default-discrete-records/dt=2023-08-08/gcsoutput2023-08-08T09:40:00.000Z-2023-08-08T09:50:00.000Z-00005-of-00010.avro
gs://<project-id>-gcs-ingestion/v1.3/default-discrete-records/dt=2023-08-08/gcsoutput2023-08-08T09:40:00.000Z-2023-08-08T09:50:00.000Z-00006-of-00010.avro
gs://<project-id>-gcs-ingestion/v1.3/default-discrete-records/dt=2023-08-08/gcsoutput2023-08-08T09:40:00.000Z-2023-08-08T09:50:00.000Z-00007-of-00010.avro
gs://<project-id>-gcs-ingestion/v1.3/default-discrete-records/dt=2023-08-08/gcsoutput2023-08-08T09:40:00.000Z-2023-08-08T09:50:00.000Z-00008-of-00010.avro
gs://<project-id>-gcs-ingestion/v1.3/default-discrete-records/dt=2023-08-08/gcsoutput2023-08-08T09:40:00.000Z-2023-08-08T09:50:00.000Z-00009-of-00010.avro

There's a different schema which only vary slightly for each Archetype, and so each type uses that corresponding schema. See the following examples:

NumericDataSeries archetype schema:

{
  "type": "record",
  "namespace": "com.google.cloud.industry.manufacturing.sfp.datalake.storage.gcs",
  "name": "NumericDataSeriesGCSObject",
  "fields": [
    {
      "name": "id",
      "type": "string",
      "doc": "Unique record id"
    },
    {
      "name": "tag_name",
      "type": "string",
      "doc": "Name of the tag"
    },
    {
      "name": "type_name",
      "type": "string"
    },
    {
      "name": "type_version",
      "type": "int"
    },
    {
      "name": "embedded_metadata",
      "type": ["null", "string"],
      "default": null
    },
    {
      "name": "materialized_cloud_metadata",
      "type": ["null", "string"],
      "default": null
    },
    {
      "name": "cloud_metadata_ref",
      "type": ["null", "string"],
      "default": null
    },
    {
      "name": "source_message_id",
      "type": "string"
    },
    {
      "name": "event_timestamp",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      }
    },
    {
      "name": "value",
      "type": "double"
    }
  ]
}

DiscreteDataSeries archetype schema:

{
  "type": "record",
  "name": "DiscreteDataSeriesGCSObject",
  "namespace": "com.google.cloud.industry.manufacturing.sfp.datalake.storage.gcs",
  "fields": [
    {
      "name": "id",
      "type": {
        "type": "string",
        "avro.java.string": "String"
      },
      "doc": "Unique record id"
    },
    {
      "name": "tag_name",
      "type": {
        "type": "string",
        "avro.java.string": "String"
      },
      "doc": "Name of the tag"
    },
    {
      "name": "type_name",
      "type": {
        "type": "string",
        "avro.java.string": "String"
      }
    },
    {
      "name": "type_version",
      "type": "int"
    },
    {
      "name": "embedded_metadata",
      "type": [
        "null",
        {
          "type": "string",
          "avro.java.string": "String"
        }
      ],
      "default": null
    },
    {
      "name": "materialized_cloud_metadata",
      "type": [
        "null",
        {
          "type": "string",
          "avro.java.string": "String"
        }
      ],
      "default": null
    },
    {
      "name": "cloud_metadata_ref",
      "type": [
        "null",
        {
          "type": "string",
          "avro.java.string": "String"
        }
      ],
      "default": null
    },
    {
      "name": "source_message_id",
      "type": {
        "type": "string",
        "avro.java.string": "String"
      }
    },
    {
      "name": "event_timestamp",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      }
    },
    {
      "name": "data",
      "type": {
        "type": "string",
        "avro.java.string": "String"
      }
    }
  ]
}

ContinuousDataSeries archetype schema:

{
  "type": "record",
  "namespace": "com.google.cloud.industry.manufacturing.sfp.datalake.storage.gcs",
  "name": "ContinuousDataSeriesGCSObject",
  "fields": [
    {
      "name": "id",
      "type": "string",
      "doc": "Unique record id"
    },
    {
      "name": "tag_name",
      "type": "string",
      "doc": "Name of the tag"
    },
    {
      "name": "type_name",
      "type": "string"
    },
    {
      "name": "type_version",
      "type": "int"
    },
    {
      "name": "embedded_metadata",
      "type": ["null", "string"],
      "default": null
    },
    {
      "name": "materialized_cloud_metadata",
      "type": ["null", "string"],
      "default": null
    },
    {
      "name": "cloud_metadata_ref",
      "type": ["null", "string"],
      "default": null
    },
    {
      "name": "source_message_id",
      "type": "string"
    },
    {
      "name": "event_timestamp_start",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      }
    },
    {
      "name": "event_timestamp_end",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      }
    },
    {
      "name": "data",
      "type": "string"
    },
    {
      "name": "duration",
      "type": "long"
    }
  ]
}