Get batch predictions from a custom trained model

This page shows you how to get batch predictions from your custom trained models using the Google Cloud console or the Vertex AI API.

To make a batch prediction request, you specify an input source and an output location, either Cloud Storage or BigQuery, where Vertex AI stores predictions results.

To minimize processing time, your input and output locations must be in the same region or multi-region. For example, if your input is in us-central1, then your output can be in us-central1 or US, but not europe-west4. To learn more, see Cloud Storage locations and BigQuery locations.

Your input and output must also be in the same region or multi-region as your model.

Input data requirements

The input for batch requests specifies the items to send to your model for prediction. We support the following input formats:

JSON Lines

Use a JSON Lines file to specify a list of input instances to make predictions about. Store the JSON Lines file in a Cloud Storage bucket.

Example 1

The following example shows a JSON Lines file where each line contains an array:

[1, 2, 3, 4]
[5, 6, 7, 8]

Here is what is sent to the prediction container in the HTTP request body:

All other containers

{"instances": [ [1, 2, 3, 4], [5, 6, 7, 8] ]}

PyTorch containers

{"instances": [
{ "data": [1, 2, 3, 4] },
{ "data": [5, 6, 7, 8] } ]}

Example 2

The following example shows a JSON Lines file where each line contains an object.

{ "values": [1, 2, 3, 4], "key": 1 }
{ "values": [5, 6, 7, 8], "key": 2 }

Here is what is sent to the prediction container in the HTTP request body. Note that the same request body is sent to all containers.

{"instances": [
  { "values": [1, 2, 3, 4], "key": 1 },
  { "values": [5, 6, 7, 8], "key": 2 }

Example 3

For PyTorch prebuilt containers, make sure that you wrap each instance in a data field as required by TorchServe's default handler; Vertex AI will not wrap your instances for you. For example:

{ "data": { "values": [1, 2, 3, 4], "key": 1 } }
{ "data": { "values": [5, 6, 7, 8], "key": 2 } }

Here is what is sent to the prediction container in the HTTP request body:

{"instances": [
  { "data": { "values": [1, 2, 3, 4], "key": 1 } },
  { "data": { "values": [5, 6, 7, 8], "key": 2 } }


Save input instances in the TFRecord format. You can optionally compress the TFRecord files with Gzip. Store the TFRecord files in a Cloud Storage bucket.

Vertex AI reads each instance in your TFRecord files as binary, then base64-encodes the instance as JSON object with a single key named b64.

Here is what is sent to the prediction container in the HTTP request body:

All other containers

{"instances": [
{ "b64": "b64EncodedASCIIString" },
{ "b64": "b64EncodedASCIIString" } ]}

PyTorch containers

{"instances": [ { "data": {"b64": "b64EncodedASCIIString" } }, { "data": {"b64": "b64EncodedASCIIString" } }

Make sure your prediction container knows how to decode the instance.


Specify one input instance per row in a CSV file. The first row must be a header row. You must enclose all strings in double quotation marks ("). We do not accept cell value including newline character. Non-quoted values are read as floating point numbers.

The following example shows a CSV file with two input instances:


Here is what is sent to the prediction container in the HTTP request body:

All other containers

{"instances": [ [0.1,1.2,"cat1"], [4.0,5.0,"cat2"] ]}

PyTorch containers

{"instances": [
{ "data": [0.1,1.2,"cat1"] },
{ "data": [4.0,5.0,"cat2"] } ]}

File list

Create a text file where each row is the Cloud Storage URI to a file. Vertex AI reads the contents of each file as binary, then base64- encodes the instance as JSON object with a single key named b64.

If you plan to use the Google Cloud console to get batch predictions, paste your file list directly into the Google Cloud console. Otherwise save your file list in a Cloud Storage bucket.

The following example shows a file list with two input instances:


Here is what is sent to the prediction container in the HTTP request body:

All other containers

{ "instances": [
{ "b64": "b64EncodedASCIIString" },
{ "b64": "b64EncodedASCIIString" } ]}

PyTorch containers

{ "instances": [ { "data": { "b64": "b64EncodedASCIIString" } }, { "data": { "b64": "b64EncodedASCIIString" } }

Make sure your prediction container knows how to decode the instance.


Specify a BigQuery table as projectId.datasetId.tableId. Vertex AI transforms each row from the table to a JSON instance.

For example, if your table contains the following:

Column 1 Column 2 Column 3
1.0 3.0 "Cat1"
2.0 4.0 "Cat2"

Here is what is sent to the prediction container in the HTTP request body:

All other containers

{"instances": [ [1.0,3.0,"cat1"], [2.0,4.0,"cat2"] ]}

PyTorch containers

{"instances": [
{ "data": [1.0,3.0,"cat1"] },
{ "data": [2.0,4.0,"cat2"] } ]}

Here is how BigQuery data types are converted to JSON:

BigQuery Type JSON Type Example value
String String "abc"
Integer Integer 1
Float Float 1.2
Numeric Float 4925.000000000
Boolean Boolean true
TimeStamp String "2019-01-01 23:59:59.999999+00:00"
Date String "2018-12-31"
Time String "23:59:59.999999"
DateTime String "2019-01-01T00:00:00"
Record Object { "A": 1,"B": 2}
Repeated Type Array[Type] [1, 2]
Nested Record Object {"A": {"a": 0}, "B": 1}

Partition data

Batch prediction uses MapReduce to shard the input to each replica. To make use of the MapReduce features, the input should be partitionable.

Vertex AI automatically partitions BigQuery, file list, and JSON lines input.

Vertex AI does not automatically partition CSV files because they are not naturally partition-friendly. Rows in CSV files are not self-descriptive, typed, and may contain new lines. We recommend against using CSV input for throughput sensitive applications.

For TFRecord input, make sure you manually partition the data by splitting the instances into smaller files and passing the files to the job with a wildcard (for example, gs://my-bucket/*.tfrecord). The number of files should be at least the number of replicas specified.

Filter and transform input data

You can filter and/or transform your batch input by specifying instanceConfig in your BatchPredictionJob request.

Filtering lets you either exclude certain fields that are in the input data from your prediction request, or include only a subset of fields from the input data in your prediction request, without having to do any custom pre/post-processing in the prediction container. This is useful when your input data file has extra columns that model doesn't need, such as keys or additional data.

Transforming lets you send the instances to your prediction container in either a JSON array or object format. See instanceType for more information.

For example, if your input table contains the following:

customerId col1 col2
1001 1 2
1002 5 6

and you specify the following instanceConfig:

  "name": "batchJob1",
  "instanceConfig": {

Then, the instances in your prediction request are sent as JSON objects, and the customerId column is excluded:


Note that specifying the following instanceConfig would yield the same result:

  "name": "batchJob1",
  "instanceConfig": {
    "includedFields": ["col1","col2"]

For a demonstration on how to use feature filters, see the Custom model batch prediction with feature filtering notebook.

Request a batch prediction

For batch prediction requests, you can use the Google Cloud console or the Vertex AI API. Depending on the number of input items that you've submitted, a batch prediction task can take some time to complete.

When you request a batch prediction, the prediction container runs as the user-provided custom service account. The read/write operations, such as reading the prediction instances from the data source or writing the prediction results, are done using the Vertex AI service agent, which by default, has access to BigQuery and Cloud Storage.

Google Cloud console

Use the Google Cloud console to request a batch prediction.

  1. In the Google Cloud console, in the Vertex AI section, go to the Batch predictions page.

Go to the Batch predictions page

  1. Click Create to open the New batch prediction window.

  2. For Define your batch prediction, complete the following steps:

    1. Enter a name for the batch prediction.

    2. For Model name, select the name of the model to use for this batch prediction.

    3. For Select source, select the source that applies to your input data:

      • If you have formatted your input as JSON Lines, CSV, or TFRecord, select File on Cloud Storage (JSON Lines, CSV, TFRecord, TFRecord Gzip). Then specify your input file in the Source path field.
      • If you are using a file list as input, select Files on Cloud Storage (other) and paste your file list into the following text box.
      • For BigQuery input, select BigQuery path. If you select BigQuery as input, you must also select BigQuery as output and Google-managed encryption key. Customer-managed encryption key (CMEK) is not supported with BigQuery as input/output.
    4. In the Destination path field, specify the Cloud Storage directory where you want Vertex AI to store batch prediction output.

    5. Optionally, you may check Enable feature attributions for this model, in order to get feature attributions as part of the batch prediction response. Then click Edit to configure explanation settings. (Editing the explanation settings is optional if you previously configured explanation settings for the model, and required otherwise.)

    6. Specify compute options for the batch prediction job: Number of compute nodes, Machine type, and (optionally) Accelerator type and Accelerator count

  3. Optional: Model Monitoring analysis for batch predictions is available in Preview. See the Prerequisites for adding skew detection configuration to your batch prediction job.

    1. Click to toggle on Enable model monitoring for this batch prediction.

    2. Select a Training data source. Enter the data path or location for the training data source that you selected.

    3. Optional: Under Alert thresholds, specify thresholds at which to trigger alerts.

    4. For Notification emails, enter one or more comma-separated email addresses to receive alerts when a model exceeds an alerting threshold.

    5. Optional: For Notification channels, add Cloud Monitoring channels to receive alerts when a model exceeds an alerting threshold. You can select existing Cloud Monitoring channels or create a new one by clicking Manage notification channels. The Console supports PagerDuty, Slack, and Pub/Sub notification channels.

  4. Click Create.


Use the Vertex AI API to send batch prediction requests. Select a tab depending on which tool you are using to get batch predictions.


Before using any of the request data, make the following replacements:

  • LOCATION_ID: Region where Model is stored and batch prediction job is executed. For example, us-central1.

  • PROJECT_ID: Your project ID.

  • BATCH_JOB_NAME: Display name for the batch prediction job.

  • MODEL_ID: The ID for the model to use for making predictions.

  • INPUT_FORMAT: The format of your input data: jsonl, csv, tf-record, tf-record-gzip, or file-list.

  • INPUT_URI: Cloud Storage URI of your input data. May contain wildcards.

  • OUTPUT_DIRECTORY: Cloud Storage URI of a directory where you want Vertex AI to save output.

  • MACHINE_TYPE: The machine resources to be used for this batch prediction job.

    You can optionally configure the machineSpec field to use accelerators, but the following example does not demonstrate this.

  • BATCH_SIZE: The number of instances to send in each prediction request; the default is 64. Increasing the batch size can lead to higher throughput, but it can also cause request timeouts.

  • STARTING_REPLICA_COUNT: The number of nodes for this batch prediction job.

HTTP method and URL:


Request JSON body:

  "displayName": "BATCH_JOB_NAME",
  "model": "projects/PROJECT_ID/locations/LOCATION_ID/models/MODEL_ID",
  "inputConfig": {
    "instancesFormat": "INPUT_FORMAT",
    "gcsSource": {
      "uris": ["INPUT_URI"],
  "outputConfig": {
    "predictionsFormat": "jsonl",
    "gcsDestination": {
      "outputUriPrefix": "OUTPUT_DIRECTORY",
  "dedicatedResources" : {
    "machineSpec" : {
      "machineType": MACHINE_TYPE
    "startingReplicaCount": STARTING_REPLICA_COUNT
  "manualBatchTuningParameters": {
    "batch_size": BATCH_SIZE,

To send your request, choose one of these options:


Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \


Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "" | Select-Object -Expand Content

You should receive a JSON response similar to the following:

  "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/batchPredictionJobs/BATCH_JOB_ID",
  "displayName": "BATCH_JOB_NAME 202005291958",
  "model": "projects/PROJECT_ID/locations/LOCATION_ID/models/MODEL_ID",
  "inputConfig": {
    "instancesFormat": "jsonl",
    "gcsSource": {
      "uris": [
  "outputConfig": {
    "predictionsFormat": "jsonl",
    "gcsDestination": {
      "outputUriPrefix": "OUTPUT_DIRECTORY"
  "state": "JOB_STATE_PENDING",
  "createTime": "2020-05-30T02:58:44.341643Z",
  "updateTime": "2020-05-30T02:58:44.341643Z",


Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

In the following sample, replace PREDICTIONS_FORMAT with jsonl. To learn how to replace the other placeholders, see the REST & CMD LINE tab of this section.


public class CreateBatchPredictionJobSample {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String project = "PROJECT";
    String displayName = "DISPLAY_NAME";
    String modelName = "MODEL_NAME";
    String instancesFormat = "INSTANCES_FORMAT";
    String gcsSourceUri = "GCS_SOURCE_URI";
    String predictionsFormat = "PREDICTIONS_FORMAT";
    String gcsDestinationOutputUriPrefix = "GCS_DESTINATION_OUTPUT_URI_PREFIX";

  static void createBatchPredictionJobSample(
      String project,
      String displayName,
      String model,
      String instancesFormat,
      String gcsSourceUri,
      String predictionsFormat,
      String gcsDestinationOutputUriPrefix)
      throws IOException {
    JobServiceSettings settings =
    String location = "us-central1";

    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests. After completing all of your requests, call
    // the "close" method on the client to safely clean up any remaining background resources.
    try (JobServiceClient client = JobServiceClient.create(settings)) {

      // Passing in an empty Value object for model parameters
      Value modelParameters = ValueConverter.EMPTY_VALUE;

      GcsSource gcsSource = GcsSource.newBuilder().addUris(gcsSourceUri).build();
      BatchPredictionJob.InputConfig inputConfig =
      GcsDestination gcsDestination =
      BatchPredictionJob.OutputConfig outputConfig =
      MachineSpec machineSpec =
      BatchDedicatedResources dedicatedResources =
      String modelName = ModelName.of(project, location, model).toString();
      BatchPredictionJob batchPredictionJob =
      LocationName parent = LocationName.of(project, location);
      BatchPredictionJob response = client.createBatchPredictionJob(parent, batchPredictionJob);
      System.out.format("response: %s\n", response);
      System.out.format("\tName: %s\n", response.getName());


To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

def create_batch_prediction_job_dedicated_resources_sample(
    project: str,
    location: str,
    model_resource_name: str,
    job_display_name: str,
    gcs_source: Union[str, Sequence[str]],
    gcs_destination: str,
    instances_format: str = "jsonl",
    machine_type: str = "n1-standard-2",
    accelerator_count: int = 1,
    accelerator_type: Union[str, aiplatform_v1.AcceleratorType] = "NVIDIA_TESLA_K80",
    starting_replica_count: int = 1,
    max_replica_count: int = 1,
    sync: bool = True,
    aiplatform.init(project=project, location=location)

    my_model = aiplatform.Model(model_resource_name)

    batch_prediction_job = my_model.batch_predict(


    return batch_prediction_job


The preceding REST example uses Cloud Storage for the source and destination. To use BigQuery instead, make the following changes:

  • Change the inputConfig field to the following:

    "inputConfig": {
       "instancesFormat": "bigquery",
       "bigquerySource": {
  • Change the outputConfig field to the following:

    "outputConfig": {
  • Replace the following:

    • SOURCE_PROJECT_ID: ID of the source Google Cloud project
    • SOURCE_DATASET_NAME: name of the source BigQuery dataset
    • SOURCE_TABLE_NAME: name of the BigQuery source table
    • DESTINATION_PROJECT_ID: ID of the destination Google Cloud project
    • DESTINATION_DATASET_NAME: name of the destination BigQuery dataset
    • DESTINATION_TABLE_NAME: name of the BigQuery destination table

Feature importance

If you want feature importance values returned for your predictions, set the generateExplanation property to true. Note that forecasting models don't support feature importance, so you can't include it in your batch prediction requests.

Feature importance, sometimes called feature attributions, is part of Vertex Explainable AI.

You can only set generateExplanation to true if you have configured your Model for explanations or if you specify the BatchPredictionJob's explanationSpec field.

Choose machine type and replica count

Scaling horizontally by increasing the number of replicas improves throughput more linearly and predictably than by using larger machine types.

In general, we recommend that you specify the smallest machine type possible for your job and increase the number of replicas.

For cost-effectiveness, we recommend that you choose the replica count such that your batch prediction job runs for at least 10 minutes. This is because you are billed per replica node hour, which includes the approximately 5 minutes it takes for each replica to start up. It is not cost-effective to process for only a few seconds and then shut down.

As general guidance, for thousands of instances, we recommend a starting_replica_count in the tens. For millions of instances, we recommend a starting_replica_count in the hundreds. You can also use the following forumla to estimate the number of replicas:

N / (T * (60 / Tb))


  • N: The number of batches in the job. For example, 1 million instances / 100 batch size = 10,000 batches.
  • T: desired time for the batch prediction job. For example, 10 minutes.
  • Tb: time in seconds it takes for a replica to process a single batch. For example, 1 second per batch on a 2-core machine type.

In our example, 10,000 batches / (10 minutes * (60 / 1s)) rounds up to 17 replicas.

Unlike online prediction, batch prediction jobs do not autoscale. Because all of the input data is known up front, the system partitions the data to each replica when the job starts. The system uses the starting_replica_count parameter; the max_replica_count parameter is ignored.

These recommendations are all approximate guidelines. They are not guaranteed to give optimal throughput for every model. They do not provide exact predictions of processing time and cost. And they do not necessarily capture the best cost/throughput tradeoffs for each scenario. Use them as a reasonable starting point and adjust them as necessary. To measure characteristics such as throughput for your model, run the Finding ideal machine type notebook.

For GPU or TPU accelerated machines

Follow the guidelines for CPU-only models with the following additional considerations:

  • You might need more CPUs and GPUs (e.g. for data preprocessing).
  • GPU machine types take more time to startup (10 minutes), so you may want to target longer times (for example, at least 20 minutes instead of 10 minutes) for the batch prediction job so that a reasonable proportion of the time and cost is spent on generating predictions.

Retrieve batch prediction results

When a batch prediction task is complete, the output of the prediction is stored in the Cloud Storage bucket or BigQuery location that you specified in your request.

Example batch prediction result

The output folder contains a set of JSON Lines files.

The files are named {gcs_path}/prediction.results-{file_number}-of-{number_of_files_generated}. The number of files not deterministic due to the distributed nature of batch prediction.

Each line in the file corresponds to an instance from the input and has the following key/value pairs:

  • prediction: contains the value returned by prediction container.
  • instance: For FileList, it contains the Cloud Storage URI. For all other input formats, it contains the value that was sent to the prediction container in the HTTP request body.

Example 1

If the HTTP request contains:

  "instances": [
    [1, 2, 3, 4],
    [5, 6, 7, 8]

And the prediction container returns:

  "predictions": [

Then, the jsonl output file is:

{ "instance": [1, 2, 3, 4], "prediction": [0.1,0.9]}
{ "instance": [5, 6, 7, 8], "prediction": [0.7,0.3]}

Example 2

If the HTTP request contains:

  "instances": [
    {"values": [1, 2, 3, 4], "key": 1},
    {"values": [5, 6, 7, 8], "key": 2}

And the prediction container returns:

  "predictions": [

Then, the jsonl output file is:

{ "instance": {"values": [1, 2, 3, 4], "key": 1}, "prediction": {"result":1}}
{ "instance": {"values": [5, 6, 7, 8], "key": 2}, "prediction": {"result":0}}

Use Explainable AI

We don't recommend running feature-based explanations on a large amount of data. This is because each input can potentially fan out to thousands of requests based on the set of possible feature values which may result in massively increased processing time and cost. In general, a small dataset is enough to understand feature importance.

Batch prediction does not support example-based explanations.

What's next