Feature engineering

This document describes how Feature Transform Engine performs feature engineering. Feature Transform Engine performs feature selection and feature transformations. If feature selection is enabled, Feature Transform Engine creates a ranked set of important features. If feature transformations are enabled, Feature Transform Engine processes the features to ensure that the input for model training and model serving is consistent. Feature Transform Engine can be used on its own or together with any of the tabular training workflows. It supports both TensorFlow and non-TensorFlow frameworks.

Inputs

Provide the following inputs to Feature Transform Engine:

Raw data (BigQuery or CSV dataset).
Data split configuration.
Feature selection configuration.
Feature transformation configuration.

Outputs

Feature Transform Engine generates the following outputs:

dataset_stats: Statistics that describe the raw dataset. For example, dataset_stats gives the number of rows in the dataset.
feature_importance: The importance score of the features. This output is generated if feature selection is enabled.
materialized_data, which is the transformed version of a data split group containing the training split, the evaluation split, and the test split.
training_schema: Training data schema in OpenAPI specification, which describes the data types of the training data.
instance_schema: Instance schema in OpenAPI specification, which describes the data types of the inference data.
transform_output: Metadata of the transformation. If you use TensorFlow for transformation, the metadata includes the TensorFlow graph.

Processing steps

Feature Transform Engine performs the following steps:

Generate dataset splits for training, evaluation, and testing.
Generate input dataset statistics dataset_stats that describe the raw dataset.
Perform feature selection.
Process the transform configuration using the dataset statistics, resolving automatic transformation parameters into manual transformation parameters.
Transform raw features into engineered features. Different transformations are done for different types of features.

Feature selection

The main purpose of feature selection is to reduce the number of features used in the model. The reduced feature set captures most of the label's information in a more compact manner. Feature selection allows you to reduce the cost of training and serving models without significantly impacting model quality.

If you enable feature selection, Feature Transform Engine assigns an importance score to each feature. You can choose to output the importance scores of the full set of features or of a reduced subset of the most important features.

Vertex AI offers the following feature selection algorithms:

Adjusted Mutual Information (AMI)
Conditional Mutual Information Maximization (CMIM)
Joint Mutual Information Maximization (JMIM)
Maximum Relevance Minimum Redundancy (MRMR)

Note that no feature selection algorithm always works best on all datasets and for all purposes. If possible, run all the algorithms and combine the results.

Adjusted Mutual Information (AMI)

AMI is an adjustment of the Mutual Information (MI) score to account for chance. It accounts for the fact that the MI is generally higher for two clusterings with a larger number of clusters, regardless of whether there is actually more information shared.

AMI is good at detecting the relevance of features and the label, but it is insensitive to feature redundancy. Consider AMI if there are many features (for example, more than 2000) and not much feature redundancy. It is faster than the other algorithms described here, but it could pick up redundant features.

Conditional Mutual Information Maximization (CMIM)

CMIM is a greedy algorithm that chooses features iteratively based on conditional mutual information of candidate features with respect to selected features. In each iteration, it selects the feature that maximizes the minimum mutual information with the label that hasn't been captured by selected features yet.

CMIM is robust in dealing with feature redundancy, and it works well in typical cases.

Joint Mutual Information Maximization (JMIM)

JMIM is a greedy algorithm that is similar to CMIM. JMIM selects the feature that maximizes the joint mutual information of the new one and pre-selected features with the label, while CMIM takes redundancy more into account.

JMIM is a high-quality feature selection algorithm.

Maximum Relevance Minimum Redundancy (MRMR)

MRMR is a greedy algorithm that works iteratively. It is similar to CMIM. Each iteration chooses the feature that maximizes relevance with respect to the label while minimizing pair-wise redundancy with respect to the selected features in previous iterations.

MRMR is a high-quality feature selection algorithm.

What's next

After performing feature engineering, you can train a model for classification or regression:

Train a model with End-to-End AutoML.
Train a model with TabNet.
Train a model with Wide & Deep.