Stay organized with collections
Save and categorize content based on your preferences.
The ML.TFDV_VALIDATE function
This document describes the ML.TFDV_VALIDATE function, which you can use to
compare the statistics for training and serving data, or two sets of
serving data, in order to identify anomalous differences between the two data
sets. Calling this function provides the same behavior as calling the
TensorFlow
validate_statistics API.
You can use the data output by this function for
model monitoring.
base_statistics: the statistics of the training or serving data
that you want to use as the baseline for comparison. This must be
a TensorFlow
DatasetFeatureStatisticsList protocol buffer
in JSON format. You can generate a protocol buffer in the correct
format by running the
ML.TFDV_DESCRIBE function,
or you can load it from outside of BigQuery.
study_statistics: the statistics of the training or serving data
that you want to compare to the baseline. This must be
a TensorFlow DatasetFeatureStatisticsList protocol buffer
in JSON format. You can generate a protocol buffer in the correct format by
running the ML.TFDV_DESCRIBE function, or you can load it from outside of
BigQuery.
detection_type: a STRING value that specifies the type of comparison that
you want to make. Valid values are as follows:
SKEW: returns the data skew, which represents the statistical variation
between training and serving data.
DRIFT: returns the data drift, which represents the statistical
variation between two different sets of serving data.
categorical_default_threshold: a FLOAT64 value that specifies the custom
threshold to use for anomaly detection for categorical and
ARRAY<categorical> features. The value
must be in the range [0, 1). The default value is 0.3.
categorical_metric_type: a STRING value that specifies the metric used
to compare statistics for categorical and ARRAY<categorical>features.
Valid values are as follows:
numerical_default_threshold: a FLOAT64 value that specifies the custom
threshold to use for anomaly detection for numerical,
ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> features. The value
must be in the range [0, 1). The default value is 0.3.
numerical_metric_type: a STRING value that specifies the metric used
to compare statistics for numerical, ARRAY<numerical>, and
ARRAY<STRUCT<INT64, numerical>> features. The only valid value is
JENSEN_SHANNON_DIVERGENCE.
thresholds: an ARRAY<STRUCT<STRING, FLOAT64>> value
that specifies the anomaly detection thresholds for one or more columns
for which you don't want to use the default threshold. The STRING value in
the struct specifies the column name, and the FLOAT64 value specifies the
threshold. The FLOAT64 value must be in the range [0, 1). For example,
[('col_a', 0.1), ('col_b', 0.8)].
ML.TFDV_VALIDATE uses positional arguments, so if you specify an
optional argument, you must also specify all arguments prior to that argument.
For more information on argument types, see
Named arguments.
The following example returns the skew between training and serving data
and also sets custom anomaly detection thresholds for two of the feature
columns:
The ML.TFDV_VALIDATE function doesn't conduct schema validation.
ML.TFDV_VALIDATE handles type mismatch as follows:
If you specify JENSEN_SHANNON_DIVERGENCE for the
categorical_default_threshold or numerical_default_threshold
argument, the feature isn't included in the final anomaly report.
If you specify L_INFTY for the categorical_default_threshold
argument, the function outputs the computed feature distance as expected.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[[["\u003cp\u003eThe \u003ccode\u003eML.TFDV_VALIDATE\u003c/code\u003e function compares statistics between two datasets, such as training and serving data, to identify statistical differences.\u003c/p\u003e\n"],["\u003cp\u003eThis function can detect \u003ccode\u003eSKEW\u003c/code\u003e (variations between training and serving data) or \u003ccode\u003eDRIFT\u003c/code\u003e (variations between two sets of serving data).\u003c/p\u003e\n"],["\u003cp\u003eIt uses TensorFlow \u003ccode\u003eDatasetFeatureStatisticsList\u003c/code\u003e protocol buffers as input, which can be generated using the \u003ccode\u003eML.TFDV_DESCRIBE\u003c/code\u003e function.\u003c/p\u003e\n"],["\u003cp\u003eThe function supports customizable anomaly detection thresholds and metric types for categorical and numerical features, providing flexibility in how anomalies are identified.\u003c/p\u003e\n"],["\u003cp\u003eThe function returns a TensorFlow \u003ccode\u003eAnomalies\u003c/code\u003e protocol buffer, and it does not perform schema validation, while also handling type mismatch in specific ways.\u003c/p\u003e\n"]]],[],null,["# The ML.TFDV_VALIDATE function\n=============================\n\nThis document describes the `ML.TFDV_VALIDATE` function, which you can use to\ncompare the statistics for training and serving data, or two sets of\nserving data, in order to identify anomalous differences between the two data\nsets. Calling this function provides the same behavior as calling the\nTensorFlow\n[`validate_statistics` API](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/validate_statistics).\nYou can use the data output by this function for\n[model monitoring](/bigquery/docs/model-monitoring-overview).\n\nSyntax\n------\n\n```sql\nML.TFDV_VALIDATE(\n base_statistics,\n study_statistics\n [, detection_type]\n [, categorical_default_threshold]\n [, categorical_metric_type]\n [, numerical_default_threshold]\n [, numerical_metric_type]\n [, thresholds]\n)\n```\n\n### Arguments\n\n`ML.TFDV_VALIDATE` takes the following arguments:\n\n- `base_statistics`: the statistics of the training or serving data that you want to use as the baseline for comparison. This must be a TensorFlow [`DatasetFeatureStatisticsList` protocol buffer](https://www.tensorflow.org/tfx/tf_metadata/api_docs/python/tfmd/proto/statistics_pb2/DatasetFeatureStatisticsList) in JSON format. You can generate a protocol buffer in the correct format by running the [`ML.TFDV_DESCRIBE` function](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-tfdv-describe), or you can load it from outside of BigQuery.\n- `study_statistics`: the statistics of the training or serving data that you want to compare to the baseline. This must be a TensorFlow `DatasetFeatureStatisticsList` protocol buffer in JSON format. You can generate a protocol buffer in the correct format by running the `ML.TFDV_DESCRIBE` function, or you can load it from outside of BigQuery.\n- `detection_type`: a `STRING` value that specifies the type of comparison that you want to make. Valid values are as follows:\n - `SKEW`: returns the data skew, which represents the statistical variation between training and serving data.\n - `DRIFT`: returns the data drift, which represents the statistical variation between two different sets of serving data.\n- `categorical_default_threshold`: a `FLOAT64` value that specifies the custom threshold to use for anomaly detection for categorical and `ARRAY\u003ccategorical\u003e` features. The value must be in the range `[0, 1)`. The default value is `0.3`.\n- `categorical_metric_type`: a `STRING` value that specifies the metric used to compare statistics for categorical and `ARRAY\u003ccategorical\u003e`features. Valid values are as follows:\n - `L_INFTY`: use [L-infinity distance](https://en.wikipedia.org/wiki/Chebyshev_distance). This value is the default.\n - `JENSEN_SHANNON_DIVERGENCE`: use [Jensen--Shannon divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence).\n- `numerical_default_threshold`: a `FLOAT64` value that specifies the custom threshold to use for anomaly detection for numerical, `ARRAY\u003cnumerical\u003e`, and `ARRAY\u003cSTRUCT\u003cINT64, numerical\u003e\u003e` features. The value must be in the range `[0, 1)`. The default value is `0.3`.\n- `numerical_metric_type`: a `STRING` value that specifies the metric used to compare statistics for numerical, `ARRAY\u003cnumerical\u003e`, and `ARRAY\u003cSTRUCT\u003cINT64, numerical\u003e\u003e` features. The only valid value is `JENSEN_SHANNON_DIVERGENCE`.\n- `thresholds`: an `ARRAY\u003cSTRUCT\u003cSTRING, FLOAT64\u003e\u003e` value that specifies the anomaly detection thresholds for one or more columns for which you don't want to use the default threshold. The `STRING` value in the struct specifies the column name, and the `FLOAT64` value specifies the threshold. The `FLOAT64` value must be in the range `[0, 1)`. For example, `[('col_a', 0.1), ('col_b', 0.8)]`.\n\n`ML.TFDV_VALIDATE` uses positional arguments, so if you specify an\noptional argument, you must also specify all arguments prior to that argument.\nFor more information on argument types, see\n[Named arguments](/bigquery/docs/reference/standard-sql/functions-reference#named_arguments).\n\nOutput\n------\n\n`ML.TFDV_VALIDATE` returns a TensorFlow\n[`Anomalies` protocol buffer](https://www.tensorflow.org/tfx/tf_metadata/api_docs/python/tfmd/proto/anomalies_pb2/Anomalies)\nin JSON format.\n\nExamples\n--------\n\nThe following example returns the skew between training and serving data\nand also sets custom anomaly detection thresholds for two of the feature\ncolumns: \n\n```sql\nDECLARE stats1 JSON;\nDECLARE stats2 JSON;\n\nSET stats1 = (SELECT * FROM ML.TFDV_DESCRIBE(TABLE `myproject.mydataset.training`));\n\nSET stats2 = (SELECT * FROM ML.TFDV_DESCRIBE(TABLE `myproject.mydataset.serving`));\n\nSELECT ML.TFDV_VALIDATE(\n stats1, stats2, 'SKEW', .3, 'L_INFTY', .3, 'JENSEN_SHANNON_DIVERGENCE', [('feature1', 0.2), ('feature2', 0.5)]\n);\n\nINSERT `myproject.mydataset.serve_stats`\n (t, dataset_feature_statistics_list)\nSELECT CURRENT_TIMESTAMP() AS t, stats1;\n```\n\nThe following example returns the drift between two sets of serving data: \n\n```sql\nSELECT ML.TFDV_VALIDATE(\n (SELECT dataset_feature_statistics_list FROM `myproject.mydataset.servingJan24`),\n (SELECT * FROM ML.TFDV_DESCRIBE(TABLE `myproject.mydataset.serving`)),\n 'DRIFT'\n);\n```\n\nLimitations\n-----------\n\nThe `ML.TFDV_VALIDATE` function doesn't conduct schema validation.\n\n`ML.TFDV_VALIDATE` handles type mismatch as follows:\n\n- If you specify `JENSEN_SHANNON_DIVERGENCE` for the `categorical_default_threshold` or `numerical_default_threshold` argument, the feature isn't included in the final anomaly report.\n- If you specify `L_INFTY` for the `categorical_default_threshold` argument, the function outputs the computed feature distance as expected."]]