Stay organized with collections
Save and categorize content based on your preferences.
The ML.ENTITY_FEATURES_AT_TIME function
This document describes the ML.ENTITY_FEATURES_AT_TIME function, which lets
you use multiple point-in-time cutoffs for multiple entities when retrieving
features, because features can have time dependencies if they include
time-sensitive data. To avoid
data leakage, use
point-in-time features when training models and running inference.
Use this function to retrieve features from multiple entities for multiple
points in time. For example, you could retrieve features created at or before
three different points in time for entity 1, and features created at or before
yet another different point in time for entity 2. Use the
ML.FEATURES_AT_TIME function
to use the same point-in-time cutoff for all entities when retrieving features.
ML.ENTITY_FEATURES_AT_TIME takes the following arguments:
PROJECT_ID: the project that contains the
resource.
DATASET: the BigQuery dataset that
contains the resource.
FEATURE_TABLE_NAME: a STRING value that specifies
the name of the BigQuery table that contains the feature data.
The feature table must contain the following columns:
entity_id: a STRING column that contains the ID of the entity
related to the features.
One or more feature columns.
feature_timestamp: a TIMESTAMP column that identifies when the
feature data was last updated.
The column names are case-insensitive. For example, you can use a column
named Entity_ID instead of entity_id.
The feature table must be in
wide format,
with one column for each feature.
FEATURE_QUERY_STATEMENT: a STRING value that
specifies a GoogleSQL query that returns the feature data. This
query must return the same columns as feature_table. See
GoogleSQL query syntax
for the supported SQL syntax of the FEATURE_QUERY_STATEMENT clause.
ENTITY_TIME_TABLE_NAME: a STRING value that
specifies the name of the BigQuery table that maps entity IDs
to feature lookup times. The entity time table must contain the
following columns:
entity_id: a STRING column that contains the entity ID.
time: a TIMESTAMP column that identifies a point in time to use as a
cutoff time when selecting features for the entity represented by the
entity ID.
The column names are case-insensitive. For example, you can use a column
named Entity_ID instead of entity_id.
The table identified by the ENTITY_TIME_TABLE_NAME value must be no
larger than 100 MB.
ENTITY_TIME_QUERY_STATEMENT: a STRING value that
specifies a GoogleSQL query that returns the entity time data. This
query must return the same columns as the ENTITY_TIME_TABLE_NAME argument.
See
GoogleSQL query syntax
for the supported SQL syntax of the ENTITY_TIME_QUERY_STATEMENT clause.
NUM_ROWS: an INT64 value that specifies the number
of rows to return for each row in ENTITY_TIME_TABLE_NAME. Defaults to 1.
IGNORE_FEATURE_NULLS: a BOOL value that indicates
whether to replace a NULL value in a feature column with the feature column
value from the row for the same entity that immediately precedes it in time.
For example, for the following feature table and entity time table:
Results in the following output, where the f2 value from the
row for entity ID 2 that is timestamped '2022-06-10 12:00:00+00' is
substituted for the NULL value in the row timestamped
'2022-06-11 10:00:00+00':
If there is no available replacement value, for example, where there is no
earlier row for that entity ID, a NULL value is returned.
Defaults to FALSE.
Output
ML.ENTITY_FEATURES_AT_TIME returns the input table rows that meet the
point-in-time cutoff criteria, with the feature_timestamp column showing the
timestamp from the time column of the entity time table.
Because you can specify multiple points in time from which to retrieve features
for the same entity, it is possible to return duplicate rows, depending on the
timestamps in the feature and entity time tables, and the num_rows value
you specify. For example, if the only row in the feature table for entity ID 1
is timestamped 2022-06-11 10:00:00+00, and you have two rows for entity ID 1
in the entity time table that both have later timestamps, the function output
has 2 rows with the same feature data for entity ID 1.
If either of the following conditions are true:
No entity ids from the entity time table are found in the feature table.
The rows in the feature table whose entity ids match those in
the entity time table don't meet the point-in-time criteria.
Then the function doesn't return any output for that entity
time table row.
Examples
Example 1
This example shows a how to retrain a model using only features that were
created or updated before the timestamps identified in
mydataset.entity_time_table:
This example shows a how to get predictions from a model based on features
that were created or updated before the timestamps identified in
mydataset.entity_time_table:
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[[["\u003cp\u003eThe \u003ccode\u003eML.ENTITY_FEATURES_AT_TIME\u003c/code\u003e function retrieves features for multiple entities at different points in time, which is crucial for preventing data leakage in time-sensitive machine learning tasks.\u003c/p\u003e\n"],["\u003cp\u003eThis function accepts a feature table (or query) and an entity time table (or query) to gather features updated at or before specific timestamps for each entity.\u003c/p\u003e\n"],["\u003cp\u003eThe function requires \u003ccode\u003eentity_id\u003c/code\u003e, feature columns, and \u003ccode\u003efeature_timestamp\u003c/code\u003e in the feature table, and \u003ccode\u003eentity_id\u003c/code\u003e and \u003ccode\u003etime\u003c/code\u003e columns in the entity time table, all of which are case-insensitive.\u003c/p\u003e\n"],["\u003cp\u003eYou can use \u003ccode\u003enum_rows\u003c/code\u003e to specify how many rows to return for each entry in the \u003ccode\u003eentity_time_table\u003c/code\u003e, and \u003ccode\u003eignore_feature_nulls\u003c/code\u003e to replace null feature values with previous non-null values for the same entity.\u003c/p\u003e\n"],["\u003cp\u003eThe output of \u003ccode\u003eML.ENTITY_FEATURES_AT_TIME\u003c/code\u003e may contain duplicate rows if multiple timestamps in the entity time table match the same feature data, and can be used to retrain models or obtain predictions based on point-in-time correct features.\u003c/p\u003e\n"]]],[],null,["# The ML.ENTITY_FEATURES_AT_TIME function\n=======================================\n\nThis document describes the `ML.ENTITY_FEATURES_AT_TIME` function, which lets\nyou use multiple point-in-time cutoffs for multiple entities when retrieving\nfeatures, because features can have time dependencies if they include\ntime-sensitive data. To avoid\n[data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)), use\npoint-in-time features when training models and running inference.\n\nUse this function to retrieve features from multiple entities for multiple\npoints in time. For example, you could retrieve features created at or before\nthree different points in time for entity 1, and features created at or before\nyet another different point in time for entity 2. Use the\n[`ML.FEATURES_AT_TIME` function](/bigquery/docs/reference/standard-sql/bigqueryml-syntax-feature-time)\nto use the same point-in-time cutoff for all entities when retrieving features.\n\nSyntax\n------\n\n```sql\nML.ENTITY_FEATURES_AT_TIME(\n { TABLE `PROJECT_ID.DATASET.FEATURE_TABLE_NAME` | (FEATURE_QUERY_STATEMENT) },\n { TABLE `PROJECT_ID.DATASET.ENTITY_TIME_TABLE_NAME` | (ENTITY_TIME_QUERY_STATEMENT) },\n [, NUM_ROWS =\u003e INT64][, IGNORE_FEATURE_NULLS =\u003e BOOL])\n```\n\n### Arguments\n\n`ML.ENTITY_FEATURES_AT_TIME` takes the following arguments:\n\n- \u003cvar translate=\"no\"\u003ePROJECT_ID\u003c/var\u003e: the project that contains the resource.\n- \u003cvar translate=\"no\"\u003eDATASET\u003c/var\u003e: the BigQuery dataset that contains the resource.\n- \u003cvar translate=\"no\"\u003eFEATURE_TABLE_NAME\u003c/var\u003e: a `STRING` value that specifies\n the name of the BigQuery table that contains the feature data.\n The feature table must contain the following columns:\n\n - `entity_id`: a `STRING` column that contains the ID of the entity related to the features.\n - One or more feature columns.\n - `feature_timestamp`: a `TIMESTAMP` column that identifies when the feature data was last updated.\n\n The column names are case-insensitive. For example, you can use a column\n named `Entity_ID` instead of `entity_id`.\n\n The feature table must be in\n [wide](https://en.wikipedia.org/wiki/Wide_and_narrow_data#Wide) format,\n with one column for each feature.\n- \u003cvar translate=\"no\"\u003eFEATURE_QUERY_STATEMENT\u003c/var\u003e: a `STRING` value that\n specifies a GoogleSQL query that returns the feature data. This\n query must return the same columns as `feature_table`. See\n [GoogleSQL query syntax](/bigquery/docs/reference/standard-sql/query-syntax#sql_syntax)\n for the supported SQL syntax of the `FEATURE_QUERY_STATEMENT` clause.\n\n- \u003cvar translate=\"no\"\u003eENTITY_TIME_TABLE_NAME\u003c/var\u003e: a `STRING` value that\n specifies the name of the BigQuery table that maps entity IDs\n to feature lookup times. The entity time table must contain the\n following columns:\n\n - `entity_id`: a `STRING` column that contains the entity ID.\n - `time`: a `TIMESTAMP` column that identifies a point in time to use as a cutoff time when selecting features for the entity represented by the entity ID.\n\n The column names are case-insensitive. For example, you can use a column\n named `Entity_ID` instead of `entity_id`.\n\n The table identified by the `ENTITY_TIME_TABLE_NAME` value must be no\n larger than 100 MB.\n- \u003cvar translate=\"no\"\u003eENTITY_TIME_QUERY_STATEMENT\u003c/var\u003e: a `STRING` value that\n specifies a GoogleSQL query that returns the entity time data. This\n query must return the same columns as the `ENTITY_TIME_TABLE_NAME` argument.\n See\n [GoogleSQL query syntax](/bigquery/docs/reference/standard-sql/query-syntax#sql_syntax)\n for the supported SQL syntax of the `ENTITY_TIME_QUERY_STATEMENT` clause.\n\n- \u003cvar translate=\"no\"\u003eNUM_ROWS\u003c/var\u003e: an `INT64` value that specifies the number\n of rows to return for each row in `ENTITY_TIME_TABLE_NAME`. Defaults to `1`.\n\n- \u003cvar translate=\"no\"\u003eIGNORE_FEATURE_NULLS\u003c/var\u003e: a `BOOL` value that indicates\n whether to replace a `NULL` value in a feature column with the feature column\n value from the row for the same entity that immediately precedes it in time.\n For example, for the following feature table and entity time table:\n\n **Feature table** \n\n ```\n +-----------+------+------+--------------------------+\n | entity_id | f1 | f2 | feature_timestamp |\n +-----------+------+------+--------------------------+\n | '2' | 5.0 | 8.0 | '2022-06-10 09:00:00+00' |\n +-----------+------+------+--------------------------+\n | '2' | 2.0 | 4.0 | '2022-06-10 12:00:00+00' |\n +-----------+------+------+--------------------------+\n | '2' | 7.0 | NULL | '2022-06-11 10:00:00+00' |\n +-----------+------+------+--------------------------+\n ```\n\n \u003cbr /\u003e\n\n **Entity time table** \n\n ```\n +-----------+--------------------------+\n | entity_id | time |\n +-----------+--------------------------+\n | '2' | '2022-06-11 10:00:00+00' |\n +-----------+--------------------------+\n ```\n\n \u003cbr /\u003e\n\n Running this query: \n\n ```sql\n SELECT *\n FROM\n ML.ENTITY_FEATURES_AT_TIME(\n TABLE mydataset.feature_table,\n TABLE mydataset.entity_time_table,\n num_rows =\u003e 1,\n ignore_feature_nulls =\u003e TRUE);\n ```\n\n Results in the following output, where the `f2` value from the\n row for entity ID 2 that is timestamped `'2022-06-10 12:00:00+00'` is\n substituted for the `NULL` value in the row timestamped\n `'2022-06-11 10:00:00+00'`: \n\n ```\n +-----------+------+------+--------------------------+\n | entity_id | f1 | f2 | feature_timestamp |\n +-----------+------+------+--------------------------+\n | '2' | 7.0 | 4.0 | '2022-06-11 10:00:00+00' |\n +-----------+------+------+--------------------------+\n ```\n\n If there is no available replacement value, for example, where there is no\n earlier row for that entity ID, a `NULL` value is returned.\n\n Defaults to `FALSE`.\n\nOutput\n------\n\n`ML.ENTITY_FEATURES_AT_TIME` returns the input table rows that meet the\npoint-in-time cutoff criteria, with the `feature_timestamp` column showing the\ntimestamp from the `time` column of the entity time table.\n\nBecause you can specify multiple points in time from which to retrieve features\nfor the same entity, it is possible to return duplicate rows, depending on the\ntimestamps in the feature and entity time tables, and the `num_rows` value\nyou specify. For example, if the only row in the feature table for entity ID 1\nis timestamped `2022-06-11 10:00:00+00`, and you have two rows for entity ID 1\nin the entity time table that both have later timestamps, the function output\nhas 2 rows with the same feature data for entity ID 1.\n\nIf either of the following conditions are true:\n\n- No entity ids from the entity time table are found in the feature table.\n- The rows in the feature table whose entity ids match those in the entity time table don't meet the point-in-time criteria.\n\nThen the function doesn't return any output for that entity\ntime table row.\n\nExamples\n--------\n\n**Example 1**\n\nThis example shows a how to retrain a model using only features that were\ncreated or updated before the timestamps identified in\n`mydataset.entity_time_table`: \n\n```sql\nCREATE OR REPLACE\n `mydataset.mymodel` OPTIONS (WARM_START = TRUE)\nAS\nSELECT * EXCEPT (feature_timestamp, entity_id)\nFROM\n ML.ENTITY_FEATURES_AT_TIME(\n TABLE `mydataset.feature_table`,\n TABLE `mydataset.entity_time_table`,\n num_rows =\u003e 1,\n ignore_feature_nulls =\u003e TRUE);\n```\n\n**Example 2**\n\nThis example shows a how to get predictions from a model based on features\nthat were created or updated before the timestamps identified in\n`mydataset.entity_time_table`: \n\n```sql\nSELECT\n *\nFROM\n ML.PREDICT(\n MODEL `mydataset.mymodel`,\n (\n SELECT * EXCEPT (feature_timestamp, entity_id)\n FROM\n ML.ENTITY_FEATURES_AT_TIME(\n TABLE `mydataset.feature_table`,\n TABLE `mydataset.entity_time_table`,\n num_rows =\u003e 1,\n ignore_feature_nulls =\u003e TRUE)\n )\n );\n```\n\n**Example 3**\n\nThis is a contrived example that you can use to see the output of the\nfunction: \n\n```sql\nWITH\n feature_table AS (\n SELECT * FROM UNNEST(\n ARRAY\u003cSTRUCT\u003centity_id STRING, f_1 FLOAT64, f_2 FLOAT64, feature_timestamp TIMESTAMP\u003e\u003e[\n ('id1', 1.0, 1.0, TIMESTAMP '2022-06-10 12:00:00+00'),\n ('id2', 12.0, 24.0, TIMESTAMP '2022-06-11 12:00:00+00'),\n ('id1', 11.0, NULL, TIMESTAMP '2022-06-11 12:00:00+00'),\n ('id1', 6.0, 12.0, TIMESTAMP '2022-06-11 10:00:00+00'),\n ('id2', 2.0, 4.0, TIMESTAMP '2022-06-10 12:00:00+00'),\n ('id2', 7.0, NULL, TIMESTAMP '2022-06-11 10:00:00+00')])\n ),\n entity_time_table AS (\n SELECT * FROM UNNEST(\n ARRAY\u003cSTRUCT\u003centity_id STRING, time TIMESTAMP\u003e\u003e[\n ('id1', TIMESTAMP '2022-06-12 12:00:00+00'),\n ('id2', TIMESTAMP '2022-06-11 10:00:00+00'),\n ('id1', TIMESTAMP '2022-06-10 13:00:00+00')])\n )\nSELECT *\nFROM\n ML.ENTITY_FEATURES_AT_TIME(\n TABLE feature_table, TABLE entity_time_table, num_rows =\u003e 1, ignore_feature_nulls =\u003e TRUE);\n```"]]