Prepare video training data for object tracking

This page describes how to prepare video training data for use in a Vertex AI dataset to train a video object tracking model.

The following sections provide information about data requirements, schema files, and the format of the data import files (JSONL & CSV) that are defined by the schema.

Alternatively, you can import videos that have not been annotated and annotate them later using the Google Cloud console (see Labeling using the Google Cloud console).

Data requirements

The following requirements apply to datasets used to train AutoML or custom-trained models.

  • Vertex AI supports the following video formats for training your model or requesting a prediction (annotating a video).

    • .MOV
    • .MPEG4
    • .MP4
    • .AVI
  • To view the video content in the web console or to annotate a video, the video must be in a format that your browser natively supports. Since not all browsers handle .MOV or .AVI content natively, the recommendation is to use either .MPEG4 or .MP4 video format.

  • Maximum file size is 50 GB (up to 3 hours in duration). Individual video files with malformed or empty timestamps in the container aren't supported.

  • The maximum number of labels in each dataset is limited to 1,000.

  • You may assign "ML_USE" labels to the videos in the import files. At training time, you may choose to use those labels to split the videos and their corresponding annotations into "training" or "test" sets. For video object tracking, note the following:

    • The maximum number of labeled video frames in each dataset is limited to 150,000.
    • The maximum number of total annotated bounding boxes in each dataset is limited to 1,000,000.
    • The maximum number of labels in each annotation set is limited to 1,000.

Best practices for video data used to train AutoML models

The following practices apply to datasets used to train AutoML models.

  • The training data should be as close as possible to the data on which predictions are to be made. For example, if your use case involves blurry and low-resolution videos (such as from a security camera), your training data should be composed of blurry, low-resolution videos. In general, you should also consider providing multiple angles, resolutions, and backgrounds for your training videos.

  • Vertex AI models can't generally predict labels that humans can't assign. If a human can't be trained to assign labels by looking at the video for 1-2 seconds, the model likely can't be trained to do it either.

  • The model works best when there are at most 100 times more videos for the most common label than for the least common label. We recommend removing low frequency labels. For object tracking:

    • Minimum bounding box size is 10 px by 10 px.
    • For video frame resolution much larger than 1024 pixels by 1024 pixels, some image quality can be lost during the frame normalization process used by AutoML object tracking.
    • Each unique label must be present in at least three distinct video frames. In addition, each label must also have a minimum of ten annotations.

Schema files

  • Use the following publicly accessible schema file when creating the jsonl file for importing annotations. This schema file dictates the format of the data input files. The structure of the file follows the OpenAPI Schema test.

    Object tracking schema file:


    Full schema file

    title: VideoObjectTracking
    version: 1.0.0
    description: >
      Import and export format for importing/exporting videos together with
      temporal bounding box annotations.
    type: object
    - videoGcsUri
        type: string
        description: >
          A Cloud Storage URI pointing to a video. Up to 50 GB in size and
          up to 3 hours in duration. Supported file mime types: `video/mp4`,
          `video/avi`, `video/quicktime`.
        type: array
        description: Multiple temporal bounding box annotations. Each on a frame of the video.
          type: object
          description: >
            Temporal bounding box anntoation on video. `xMin`, `xMax`, `yMin`, and
            `yMax` are relative to the video frame size, and the point 0,0 is in the
            top left of the frame.
              type: string
              description: >
                It will be imported as/exported from AnnotationSpec's display name,
                i.e., the name of the label/class.
              description: The leftmost coordinate of the bounding box.
              type: number
              format: double
              description: The rightmost coordinate of the bounding box.
              type: number
              format: double
              description: The topmost coordinate of the bounding box.
              type: number
              format: double
              description: The bottommost coordinate of the bounding box.
              type: number
              format: double
              type: string
              description: >
                A time offset of a video in which the object has been detected.
                Expressed as a number of seconds as measured from the
                start of the video, with fractions up to a microsecond precision, and
                with "s" appended at the end.
              type: number
              format: integer
              description: >
                The instance of the object, expressed as a positive integer. Used to
                tell apart objects of the same type when multiple are present on a
                single video.
              description: Resource labels on the Annotation.
              type: object
                type: string
        description: Resource labels on the DataItem.
        type: object
          type: string

Input files

The format of your training data for video object tracking are as follows.

To import your data, create either a JSONL or CSV file.


JSON on each line:
See Object tracking YAML file for details.

	"videoGcsUri": "gs://bucket/filename.ext",
	"TemporalBoundingBoxAnnotations": [{
		"displayName": "LABEL",
		"xMin": "leftmost_coordinate_of_the_bounding box",
		"xMax": "rightmost_coordinate_of_the_bounding box",
		"yMin": "topmost_coordinate_of_the_bounding box",
		"yMax": "bottommost_coordinate_of_the_bounding box",
		"timeOffset": "timeframe_object-detected"
                "instanceId": "instance_of_object
                "annotationResourceLabels": "resource_labels"
	"dataItemResourceLabels": {
		"": "train|test"

Example JSONL - Video object tracking:

{'videoGcsUri': 'gs://demo-data/video1.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '4.000000s', 'xMin': '0.668912', 'yMin': '0.560642', 'xMax': '1.000000', 'yMax': '1.000000'}], "dataItemResourceLabels": {"": "training"}}
{'videoGcsUri': 'gs://demo-data/video2.mp4', 'temporal_bounding_box_annotations': [{'displayName': 'horse', 'instance_id': '-1', 'time_offset': '71.000000s', 'xMin': '0.679056', 'yMin': '0.070957', 'xMax': '0.801716', 'yMax': '0.290358'}], "dataItemResourceLabels": {"": "test"}}


Format of a row in the CSV file:


List of columns

  • ML_USE (Optional). For data split purposes when training a model. Use TRAINING or TEST.
  • VIDEO_URI. This field contains the Cloud Storage URI for the video. Cloud Storage URIs are case-sensitive.
  • LABEL. Labels must start with a letter and only contain letters, numbers, and underscores. You can specify multiple labels for a video by adding multiple rows in the CSV file that each identify the same video segment, with a different label for each row.
  • INSTANCE_ID (Optional). An instance ID that identifies the object instance across video frames in a video. If it's provided, AutoML object tracking uses them for object tracking tuning, training, and evaluation. The bounding boxes of the same object instance present in different video frames are labeled as the same instance ID. The instance id is only unique in each video but not in the dataset. For example, if two objects from two different videos have the same instance ID, it does not mean they are the same object instance.
  • TIME_OFFSET. The video frame that indicates the duration offset from the beginning of the video. The time offset is a floating number and the units are in seconds.
  • BOUNDING_BOX. A bounding box for an object in the video frame. Specifying a bounding box involves more than one column.
    A. x_relative_min,y_relative_min
    B. x_relative_max,y_relative_min
    C. x_relative_max,y_relative_max
    D. x_relative_min,y_relative_max

    Each vertex is specified by x, y coordinate values. The coordinates values must be a float in the 0 to 1 range, where 0 represents the minimum x or y value, and 1 represents the greatest x or y value.
    For example, (0,0) represents the top-left corner, and (1,1) represents the bottom right corner; a bounding box for the entire image is expressed as (0,0,,,1,1,,), or (0,0,1,0,1,1,0,1).
    AutoML object tracking does not require a specific vertex ordering. Also, if four specified vertices don't form a rectangle parallel to image edges, Vertex AI specifies vertices that do form such a rectangle.
    The bounding box for an object can be specified in one of two ways:
    1. Two vertices specified consisting of a set of x,y coordinates if they are diagonally opposite points of the rectangle:
      A. x_relative_min,y_relative_min
      C. x_relative_max,y_relative_max
      as shown in this example:
      x_relative_min, y_relative_min,,,x_relative_max,y_relative_max,,
    2. All four vertices specified as shown in:
      x_relative_min,y_relative_min, x_relative_max,y_relative_min, x_relative_max,y_relative_max, x_relative_min,y_relative_max,
      If the four specified vertices don't form a rectangle parallel to image edges, Vertex AI specifies vertices that do form such a rectangle.

Examples of rows in dataset files

The following rows demonstrate how to specify data in a dataset. The example includes a path to a video on Cloud Storage, a label for the object, a time offset to begin tracking, and two diagonal vertices. VIDEO_URI.,LABEL,INSTANCE_ID,TIME_OFFSET,x_relative_min,y_relative_min,x_relative_max,y_relative_min,x_relative_max,y_relative_max,x_relative_min,y_relative_max


  • VIDEO_URI is gs://folder/video1.avi,
  • LABEL is car,
  • INSTANCE_ID , (not specified)
  • TIME_OFFSET is 12.90,
  • x_relative_min,y_relative_min are 0.8,0.2,
  • x_relative_max,y_relative_min not specified,
  • x_relative_max,y_relative_max are 0.9,0.3,
  • x_relative_min,y_relative_max are not specified

As stated previously, you can also specify your bounding boxes by providing all four vertices, as shown in the following examples.

gs://folder/video1.avi,car,,12.10,0.8,0.8,0.9,0.8,0.9,0.9,0.8,0.9 gs://folder/video1.avi,car,,12.90,0.4,0.8,0.5,0.8,0.5,0.9,0.4,0.9 gs://folder/video1.avi,car,,12.10,0.4,0.2,0.5,0.2,0.5,0.3,0.4,0.3

Example CSV - no labels:

You can also provide videos in the data file without specifying any labels. You must then use the Google Cloud console to apply labels to your data before you train your model. To do so, you only need to provide the Cloud Storage URI for the video followed by eleven commas, as shown in the following example.

Example without assigned ml_use:


Example with ml_use assigned: