Stay organized with collections
Save and categorize content based on your preferences.
The ML.TFDV_DESCRIBE function
This document describes the ML.TFDV_DESCRIBE function, which you can use
to generate fine-grained statistics for the columns in a table. For example, you
might want to know statistics for a table of training or serving data
statistics that you plan to use with a machine learning (ML) model. Calling
this function provides the same behavior as calling the TensorFlow
TensorFlow tfdv.generate_statistics_from_csv API.
You can use the data output by this function for such purposes as
feature preprocessing or
model monitoring.
DATASET: the BigQuery dataset that contains
the table.
TABLE_NAME: the name of the input table that contains
the training or serving data to calculate statistics for.
QUERY_STATEMENT: a query that generates the training
or serving data to calculate statistics for. For the supported SQL syntax of
the QUERY_STATEMENT clause, see GoogleSQL query
syntax.
NUM_HISTOGRAM_BUCKETS: an INT64 value that specifies
the number of buckets to use for a histogram with equal-width buckets. Only
applies to numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64,
numerical>> columns. The num_histogram_buckets value must be in the range
[1, 1,000]. The default value is 10.
NUM_QUANTILES_HISTOGRAM_BUCKETS: an INT64 value that
specifies the number of buckets to use for a
quantiles histogram. Only applies to
numerical, ARRAY<numerical>, and ARRAY<STRUCT<INT64, numerical>> columns.
The num_quantiles_histogram_buckets value must be in the range [1, 1,000].
The default value is 10.
NUM_VALUES_HISTOGRAM_BUCKETS: an INT64 value that
specifies the number of buckets to use for a quantiles histogram. Only applies
to ARRAY columns. The num_values_histogram_buckets value must be in the
range [1, 1,000]. The default value is 10.
NUM_RANK_HISTOGRAM_BUCKETS: an INT64 value that
specifies the number of buckets to use for a
rank histogram. Only
applies to categorical and ARRAY<categorical> columns. The
num_rank_histogram_buckets value must be in the range [1, 10,000]. The
default value is 50.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[[["\u003cp\u003eThe \u003ccode\u003eML.TFDV_DESCRIBE\u003c/code\u003e function generates detailed statistics for columns in a table, which is useful for training or serving data in machine learning models.\u003c/p\u003e\n"],["\u003cp\u003eThis function mirrors the behavior of TensorFlow's \u003ccode\u003etfdv.generate_statistics_from_csv\u003c/code\u003e API, providing similar statistical insights.\u003c/p\u003e\n"],["\u003cp\u003eUsers can configure the function with parameters such as \u003ccode\u003enum_histogram_buckets\u003c/code\u003e, \u003ccode\u003enum_quantiles_histogram_buckets\u003c/code\u003e, \u003ccode\u003enum_values_histogram_buckets\u003c/code\u003e, and \u003ccode\u003enum_rank_histogram_buckets\u003c/code\u003e to control the granularity of the generated histograms.\u003c/p\u003e\n"],["\u003cp\u003eThe output is a TensorFlow \u003ccode\u003eDatasetFeatureStatisticsList\u003c/code\u003e protocol buffer in JSON format, which is useful for tasks such as feature preprocessing and model monitoring.\u003c/p\u003e\n"],["\u003cp\u003eThe function's input is limited to certain data types such as numeric, string, boolean, byte, date, datetime, time, timestamp, array and struct types containing compatible data.\u003c/p\u003e\n"]]],[],null,["# The ML.TFDV_DESCRIBE function\n=============================\n\nThis document describes the `ML.TFDV_DESCRIBE` function, which you can use\nto generate fine-grained statistics for the columns in a table. For example, you\nmight want to know statistics for a table of training or serving data\nstatistics that you plan to use with a machine learning (ML) model. Calling\nthis function provides the same behavior as calling the TensorFlow\n[TensorFlow `tfdv.generate_statistics_from_csv` API](https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_csv).\nYou can use the data output by this function for such purposes as\n[feature preprocessing](/bigquery/docs/manual-preprocessing) or\n[model monitoring](/bigquery/docs/model-monitoring-overview).\n\nSyntax\n------\n\n```sql\nML.TFDV_DESCRIBE(\n { TABLE `PROJECT_ID.DATASET.TABLE_NAME` | (QUERY_STATEMENT) },\n STRUCT(\n [NUM_HISTOGRAM_BUCKETS AS num_histogram_buckets]\n [, NUM_QUANTILES_HISTOGRAM_BUCKETS AS num_quantiles_histogram_buckets]\n [, NUM_VALUES_HISTOGRAM_BUCKETS AS num_values_histogram_buckets]\n [, NUM_RANK_HISTOGRAM_BUCKETS AS num_rank_histogram_buckets])\n)\n```\n\n### Arguments\n\n`ML.TFDV_DESCRIBE` takes the following arguments:\n\n- \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: your project ID.\n- \u003cvar translate=\"no\"\u003eDATASET\u003c/var\u003e: the BigQuery dataset that contains the table.\n- \u003cvar translate=\"no\"\u003eTABLE_NAME\u003c/var\u003e: the name of the input table that contains the training or serving data to calculate statistics for.\n- \u003cvar translate=\"no\"\u003eQUERY_STATEMENT\u003c/var\u003e: a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of the `QUERY_STATEMENT` clause, see [GoogleSQL query\n syntax](/bigquery/docs/reference/standard-sql/query-syntax#sql_syntax).\n- \u003cvar translate=\"no\"\u003eNUM_HISTOGRAM_BUCKETS\u003c/var\u003e: an `INT64` value that specifies the number of buckets to use for a histogram with equal-width buckets. Only applies to numerical, `ARRAY\u003cnumerical\u003e`, and `ARRAY\u003cSTRUCT\u003cINT64,\n numerical\u003e\u003e` columns. The `num_histogram_buckets` value must be in the range `[1, 1,000]`. The default value is `10`.\n- \u003cvar translate=\"no\"\u003eNUM_QUANTILES_HISTOGRAM_BUCKETS\u003c/var\u003e: an `INT64` value that specifies the number of buckets to use for a [quantiles](https://en.wikipedia.org/wiki/Quantile) histogram. Only applies to numerical, `ARRAY\u003cnumerical\u003e`, and `ARRAY\u003cSTRUCT\u003cINT64, numerical\u003e\u003e` columns. The `num_quantiles_histogram_buckets` value must be in the range `[1, 1,000]`. The default value is `10`.\n- \u003cvar translate=\"no\"\u003eNUM_VALUES_HISTOGRAM_BUCKETS\u003c/var\u003e: an `INT64` value that specifies the number of buckets to use for a quantiles histogram. Only applies to `ARRAY` columns. The `num_values_histogram_buckets` value must be in the range `[1, 1,000]`. The default value is `10`.\n- \u003cvar translate=\"no\"\u003eNUM_RANK_HISTOGRAM_BUCKETS\u003c/var\u003e: an `INT64` value that specifies the number of buckets to use for a [rank](https://en.wikipedia.org/wiki/Ranking_(statistics)) histogram. Only applies to categorical and `ARRAY\u003ccategorical\u003e` columns. The `num_rank_histogram_buckets` value must be in the range `[1, 10,000]`. The default value is `50`.\n\nOutput\n------\n\n`ML.TFDV_DESCRIBE` returns a column named `dataset_feature_statistics_list`\nthat contains a TensorFlow\n[`DatasetFeatureStatisticsList` protocol buffer](https://www.tensorflow.org/tfx/tf_metadata/api_docs/python/tfmd/proto/statistics_pb2/DatasetFeatureStatisticsList)\nin JSON format.\n\nExample\n-------\n\nThe following example returns statistics for the `penguins` public dataset and\nuses 20 buckets for rank histograms for string values: \n\n```sql\nSELECT * FROM ML.TFDV_DESCRIBE(\n TABLE `bigquery-public-data.ml_datasets.penguins`,\n STRUCT(20 AS num_rank_histogram_buckets)\n);\n```\n\nLimitations\n-----------\n\nInput data for the `ML.TFDV_DESCRIBE` function can only contain columns of the\nfollowing data types:\n\n- [Numeric](/bigquery/docs/reference/standard-sql/data-types#numeric_types) types\n- `STRING`\n- `BOOL`\n- `BYTE`\n- `DATE`\n- `DATETIME`\n- `TIME`\n- `TIMESTAMP`\n- `ARRAY\u003cSTRUCT\u003cINT64, FLOAT64\u003e\u003e` (a sparse tensor)\n- `STRUCT` columns that contain any of the following types:\n - Numeric types\n - `STRING`\n - `BOOL`\n - `BYTE`\n - `DATE`\n - `DATETIME`\n - `TIME`\n - `TIMESTAMP`\n- `ARRAY` columns that contain any of the following types:\n - Numeric types\n - `STRING`\n - `BOOL`\n - `BYTE`\n - `DATE`\n - `DATETIME`\n - `TIME`\n - `TIMESTAMP`"]]