The ML.ENTITY_FEATURES_AT_TIME function
This document describes the ML.ENTITY_FEATURES_AT_TIME function, which lets
you use multiple point-in-time cutoffs for multiple entities when retrieving
features, because features can have time dependencies if they include
time-sensitive data. To avoid
data leakage, use
point-in-time features when training models and running inference.
Use this function to retrieve features from multiple entities for multiple
points in time. For example, you could retrieve features created at or before
three different points in time for entity 1, and features created at or before
yet another different point in time for entity 2. Use the
ML.FEATURES_AT_TIME function
to use the same point-in-time cutoff for all entities when retrieving features.
Syntax
ML.ENTITY_FEATURES_AT_TIME(
   { TABLE `PROJECT_ID.DATASET.FEATURE_TABLE_NAME` | (FEATURE_QUERY_STATEMENT) },
   { TABLE `PROJECT_ID.DATASET.ENTITY_TIME_TABLE_NAME` | (ENTITY_TIME_QUERY_STATEMENT) },
   [, NUM_ROWS => INT64][, IGNORE_FEATURE_NULLS => BOOL])
Arguments
ML.ENTITY_FEATURES_AT_TIME takes the following arguments:
- PROJECT_ID: the project that contains the resource.
- DATASET: the BigQuery dataset that contains the resource.
- FEATURE_TABLE_NAME: a- STRINGvalue that specifies the name of the BigQuery table that contains the feature data. The feature table must contain the following columns:- entity_id: a- STRINGcolumn that contains the ID of the entity related to the features.
- One or more feature columns.
- feature_timestamp: a- TIMESTAMPcolumn that identifies when the feature data was last updated.
 - The column names are case-insensitive. For example, you can use a column named - Entity_IDinstead of- entity_id.- The feature table must be in wide format, with one column for each feature. 
- FEATURE_QUERY_STATEMENT: a- STRINGvalue that specifies a GoogleSQL query that returns the feature data. This query must return the same columns as- feature_table. See GoogleSQL query syntax for the supported SQL syntax of the- FEATURE_QUERY_STATEMENTclause.
- ENTITY_TIME_TABLE_NAME: a- STRINGvalue that specifies the name of the BigQuery table that maps entity IDs to feature lookup times. The entity time table must contain the following columns:- entity_id: a- STRINGcolumn that contains the entity ID.
- time: a- TIMESTAMPcolumn that identifies a point in time to use as a cutoff time when selecting features for the entity represented by the entity ID.
 - The column names are case-insensitive. For example, you can use a column named - Entity_IDinstead of- entity_id.- The table identified by the - ENTITY_TIME_TABLE_NAMEvalue must be no larger than 100 MB.
- ENTITY_TIME_QUERY_STATEMENT: a- STRINGvalue that specifies a GoogleSQL query that returns the entity time data. This query must return the same columns as the- ENTITY_TIME_TABLE_NAMEargument. See GoogleSQL query syntax for the supported SQL syntax of the- ENTITY_TIME_QUERY_STATEMENTclause.
- NUM_ROWS: an- INT64value that specifies the number of rows to return for each row in- ENTITY_TIME_TABLE_NAME. Defaults to- 1.
- IGNORE_FEATURE_NULLS: a- BOOLvalue that indicates whether to replace a- NULLvalue in a feature column with the feature column value from the row for the same entity that immediately precedes it in time. For example, for the following feature table and entity time table:- Feature table - +-----------+------+------+--------------------------+ | entity_id | f1 | f2 | feature_timestamp | +-----------+------+------+--------------------------+ | '2' | 5.0 | 8.0 | '2022-06-10 09:00:00+00' | +-----------+------+------+--------------------------+ | '2' | 2.0 | 4.0 | '2022-06-10 12:00:00+00' | +-----------+------+------+--------------------------+ | '2' | 7.0 | NULL | '2022-06-11 10:00:00+00' | +-----------+------+------+--------------------------+ - Entity time table - +-----------+--------------------------+ | entity_id | time | +-----------+--------------------------+ | '2' | '2022-06-11 10:00:00+00' | +-----------+--------------------------+ - Running this query: - SELECT * FROM ML.ENTITY_FEATURES_AT_TIME( TABLE mydataset.feature_table, TABLE mydataset.entity_time_table, num_rows => 1, ignore_feature_nulls => TRUE); - Results in the following output, where the - f2value from the row for entity ID 2 that is timestamped- '2022-06-10 12:00:00+00'is substituted for the- NULLvalue in the row timestamped- '2022-06-11 10:00:00+00':- +-----------+------+------+--------------------------+ | entity_id | f1 | f2 | feature_timestamp | +-----------+------+------+--------------------------+ | '2' | 7.0 | 4.0 | '2022-06-11 10:00:00+00' | +-----------+------+------+--------------------------+ - If there is no available replacement value, for example, where there is no earlier row for that entity ID, a - NULLvalue is returned.- Defaults to - FALSE.
Output
ML.ENTITY_FEATURES_AT_TIME returns the input table rows that meet the
point-in-time cutoff criteria, with the feature_timestamp column showing the
timestamp from the time column of the entity time table.
Because you can specify multiple points in time from which to retrieve features
for the same entity, it is possible to return duplicate rows, depending on the
timestamps in the feature and entity time tables, and the num_rows value
you specify. For example, if the only row in the feature table for entity ID 1
is timestamped 2022-06-11 10:00:00+00, and you have two rows for entity ID 1
in the entity time table that both have later timestamps, the function output
has 2 rows with the same feature data for entity ID 1.
If either of the following conditions are true:
- No entity ids from the entity time table are found in the feature table.
- The rows in the feature table whose entity ids match those in the entity time table don't meet the point-in-time criteria.
Then the function doesn't return any output for that entity time table row.
Examples
Example 1
This example shows a how to retrain a model using only features that were
created or updated before the timestamps identified in
mydataset.entity_time_table:
CREATE OR REPLACE `mydataset.mymodel` OPTIONS (WARM_START = TRUE) AS SELECT * EXCEPT (feature_timestamp, entity_id) FROM ML.ENTITY_FEATURES_AT_TIME( TABLE `mydataset.feature_table`, TABLE `mydataset.entity_time_table`, num_rows => 1, ignore_feature_nulls => TRUE);
Example 2
This example shows a how to get predictions from a model based on features
that were created or updated before the timestamps identified in
mydataset.entity_time_table:
SELECT * FROM ML.PREDICT( MODEL `mydataset.mymodel`, ( SELECT * EXCEPT (feature_timestamp, entity_id) FROM ML.ENTITY_FEATURES_AT_TIME( TABLE `mydataset.feature_table`, TABLE `mydataset.entity_time_table`, num_rows => 1, ignore_feature_nulls => TRUE) ) );
Example 3
This is a contrived example that you can use to see the output of the function:
WITH feature_table AS ( SELECT * FROM UNNEST( ARRAY<STRUCT<entity_id STRING, f_1 FLOAT64, f_2 FLOAT64, feature_timestamp TIMESTAMP>>[ ('id1', 1.0, 1.0, TIMESTAMP '2022-06-10 12:00:00+00'), ('id2', 12.0, 24.0, TIMESTAMP '2022-06-11 12:00:00+00'), ('id1', 11.0, NULL, TIMESTAMP '2022-06-11 12:00:00+00'), ('id1', 6.0, 12.0, TIMESTAMP '2022-06-11 10:00:00+00'), ('id2', 2.0, 4.0, TIMESTAMP '2022-06-10 12:00:00+00'), ('id2', 7.0, NULL, TIMESTAMP '2022-06-11 10:00:00+00')]) ), entity_time_table AS ( SELECT * FROM UNNEST( ARRAY<STRUCT<entity_id STRING, time TIMESTAMP>>[ ('id1', TIMESTAMP '2022-06-12 12:00:00+00'), ('id2', TIMESTAMP '2022-06-11 10:00:00+00'), ('id1', TIMESTAMP '2022-06-10 13:00:00+00')]) ) SELECT * FROM ML.ENTITY_FEATURES_AT_TIME( TABLE feature_table, TABLE entity_time_table, num_rows => 1, ignore_feature_nulls => TRUE);