This tutorial teaches you how to use a boosted tree classifier model to predict the income range of individuals based on their demographic data. The model predicts whether a value falls into one of two categories, in this case whether an individual's annual income falls above or below $50,000.
This tutorial uses the
bigquery-public-data.ml_datasets.census_adult_income
dataset. This dataset contains the demographic and income information of US
residents from 2000 and 2010.
Objectives
This tutorial guides you through completing the following tasks:
- Creating a boosted tree model to predict census respondents' income bracket
by using the
CREATE MODEL
statement. - Evaluating the model by using the
ML.EVALUATE
function. - Getting predictions from the model by using the
ML.PREDICT
function.
Costs
This tutorial uses billable components of Google Cloud, including the following:
- BigQuery
- BigQuery ML
For more information about BigQuery costs, see the BigQuery pricing page.
For more information about BigQuery ML costs, see BigQuery ML pricing.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Required Permissions
- To create the dataset, you need the
bigquery.datasets.create
IAM permission. To create the connection resource, you need the following permissions:
bigquery.connections.create
bigquery.connections.get
To create the model, you need the following permissions:
bigquery.jobs.create
bigquery.models.create
bigquery.models.getData
bigquery.models.updateData
bigquery.connections.delegate
To run inference, you need the following permissions:
bigquery.models.getData
bigquery.jobs.create
For more information about IAM roles and permissions in BigQuery, see Introduction to IAM.
Create a dataset
Create a BigQuery dataset to store your ML model:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region. For simplicity, store your dataset in the same location.Leave the remaining default settings as they are, and click Create dataset.
Prepare the sample data
The model you create in this tutorial predicts the income bracket for census respondents, based on the following features:
- Age
- Type of work performed
- Marital status
- Level of education
- Occupation
- Hours worked per week
The education
column isn't included in the training data, because
the education
and education_num
columns both express the respondent's level
of education in different formats.
You separate the data into training, evaluation, and prediction sets by creating
a new dataframe
column that is derived from the functional_weight
column.
Eighty percent of the data is used for training the model, and the remaining
twenty percent of the data is used for evaluation and prediction.
SQL
To prepare your sample data, create a view to
contain the training data. This view is used by the CREATE MODEL
statement
later in this tutorial.
Run the query that prepares the sample data:
In the Google Cloud console, go to the BigQuery page.
In the query editor, run the following query:
CREATE OR REPLACE VIEW `bqml_tutorial.input_data` AS SELECT age, workclass, marital_status, education_num, occupation, hours_per_week, income_bracket, CASE WHEN MOD(functional_weight, 10) < 8 THEN 'training' WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation' WHEN MOD(functional_weight, 10) = 9 THEN 'prediction' END AS dataframe FROM `bigquery-public-data.ml_datasets.census_adult_income`;
In the Explorer pane, expand the
bqml_tutorial
dataset and locate theinput_data
view.Click the view name to open the information pane. The view schema appears in the Schema tab.
BigQuery DataFrames
Create a DataFrame called input_data
. You use input_data
later in this tutorial to use to train the model, evaluate it, and make predictions.
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Create the boosted tree model
Create a boosted tree model to predict census respondents' income bracket, and train it on the census data. The query takes about 30 minutes to complete.
SQL
Follow these steps to create the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
CREATE MODEL `bqml_tutorial.tree_model` OPTIONS(MODEL_TYPE='BOOSTED_TREE_CLASSIFIER', BOOSTER_TYPE = 'GBTREE', NUM_PARALLEL_TREE = 1, MAX_ITERATIONS = 50, TREE_METHOD = 'HIST', EARLY_STOP = FALSE, SUBSAMPLE = 0.85, INPUT_LABEL_COLS = ['income_bracket']) AS SELECT * EXCEPT(dataframe) FROM `bqml_tutorial.input_data` WHERE dataframe = 'training';
After the query completes, the
tree_model
model appears in the Explorer pane. Because the query uses aCREATE MODEL
statement to create a model, you don't see query results.
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Evaluate the model
SQL
Follow these steps to evaluate the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.EVALUATE (MODEL `bqml_tutorial.tree_model`, ( SELECT * FROM `bqml_tutorial.input_data` WHERE dataframe = 'evaluation' ) );
The results should look similar to the following:
+---------------------+---------------------+---------------------+-------------------+---------------------+---------------------+ | precision | recall | accuracy | f1_score | log_loss | roc_auc | +---------------------+---------------------+---------------------+-------------------+-------------------------------------------+ | 0.67192429022082023 | 0.57880434782608692 | 0.83942963422194672 | 0.621897810218978 | 0.34405456040833338 | 0.88733566433566435 | +---------------------+---------------------+ --------------------+-------------------+---------------------+---------------------+
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
The evaluation metrics indicate good model performance, in particular,
the fact that the
roc_auc
score is greater than 0.8
.
For more information about the evaluation metrics, see Classification models.
Use the model to predict classifications
SQL
Follow these steps to forecast data with the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.PREDICT (MODEL `bqml_tutorial.tree_model`, ( SELECT * FROM `bqml_tutorial.input_data` WHERE dataframe = 'prediction' ) );
The first few columns of the results should look similar to the following:
+---------------------------+--------------------------------------+-------------------------------------+ | predicted_income_bracket | predicted_income_bracket_probs.label | predicted_income_bracket_probs.prob | +---------------------------+--------------------------------------+-------------------------------------+ | <=50K | >50K | 0.05183430016040802 | +---------------------------+--------------------------------------+-------------------------------------+ | | <50K | 0.94816571474075317 | +---------------------------+--------------------------------------+-------------------------------------+ | <=50K | >50K | 0.00365859130397439 | +---------------------------+--------------------------------------+-------------------------------------+ | | <50K | 0.99634140729904175 | +---------------------------+--------------------------------------+-------------------------------------+ | <=50K | >50K | 0.037775970995426178 | +---------------------------+--------------------------------------+-------------------------------------+ | | <50K | 0.96222406625747681 | +---------------------------+--------------------------------------+-------------------------------------+
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
The predicted_income_bracket
contains the predicted value from the model.
The predicted_income_bracket_probs.label
shows the two labels that the
model had to choose between, and the predicted_income_bracket_probs.prob
column shows the probability of the given label being the
correct one.
For more information about the output columns, see Classification models.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
In the navigation, click the bqml_tutorial dataset you created.
Click Delete dataset on the right side of the window. This action deletes the dataset, the table, and all the data.
In the Delete dataset dialog, confirm the delete command by typing the name of your dataset (
bqml_tutorial
) and then click Delete.
Delete your project
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Learn how to create a logistic regression classification model.
- For an overview of BigQuery ML, see Introduction to AI and ML in BigQuery.