Prepare video training data for classification

This page describes how to prepare video training data for use in a Vertex AI dataset to train a video classification model.

The following sections provide information about data requirements, schema files, and the format of the data import files (JSONL & CSV) that are defined by the schema.

Alternatively, you can import videos that have not been annotated and annotate them later using the Google Cloud console (see Labeling using the Google Cloud console).

Data requirements

The following requirements apply to datasets used to train AutoML or custom-trained models.

  • Vertex AI supports the following video formats for training your model or requesting a prediction (annotating a video).

    • .MOV
    • .MPEG4
    • .MP4
    • .AVI
  • To view the video content in the web console or to annotate a video, the video must be in a format that your browser natively supports. Since not all browsers handle .MOV or .AVI content natively, the recommendation is to use either .MPEG4 or .MP4 video format.

  • Maximum file size is 50 GB (up to 3 hours in duration). Individual video files with malformed or empty timestamps in the container aren't supported.

  • The maximum number of labels in each dataset is limited to 1,000.

  • You may assign "ML_USE" labels to the videos in the import files. At training time, you may choose to use those labels to split the videos and their corresponding annotations into "training" or "test" sets. For video classification, note the following:

    • At least two different classes are required for model training. For example, "news" and "MTV", or "game" and "others".
    • Consider including a "None_of_the_above" class and video segments that do not match any of your defined classes.

Best practices for video data used to train AutoML models

The following practices apply to datasets used to train AutoML models.

  • The training data should be as close as possible to the data on which predictions are to be made. For example, if your use case involves blurry and low-resolution videos (such as from a security camera), your training data should be composed of blurry, low-resolution videos. In general, you should also consider providing multiple angles, resolutions, and backgrounds for your training videos.

  • Vertex AI models can't generally predict labels that humans can't assign. If a human can't be trained to assign labels by looking at the video for 1-2 seconds, the model likely can't be trained to do it either.

  • The model works best when there are at most 100 times more videos for the most common label than for the least common label. We recommend removing low frequency labels. For video classification, the recommended number of training videos per label is about 1,000. The minimum per label is 10, or 50 for advanced models. In general, it takes more examples per label to train models with multiple labels per video, and resulting scores are harder to interpret.

Schema files

  • Use the following publicly accessible schema file when creating the jsonl file for importing annotations. This schema file dictates the format of the data input files. The structure of the file follows the OpenAPI Schema test.

    Video classification schema file:


    Full schema file

    title: VideoClassification
    description: >
      Import and export format for importing/exporting videos together with
      classification annotations with time segment. Can be used in
      Dataset.import_schema_uri field.
    type: object
    - videoGcsUri
        type: string
        description: >
          A Cloud Storage URI pointing to a video. Up to 50 GB in size and
          up to 3 hours in duration. Supported file mime types: `video/mp4`,
          `video/avi`, `video/quicktime`.
        type: array
        description: >
          Multiple classification annotations. Each on a time segment of the video.
          type: object
          description: Annotation with a time segment on media (e.g., video).
              type: string
              description: >
                It will be imported as/exported from AnnotationSpec's display name.
              type: string
              description: >
                The start of the time segment. Expressed as a number of seconds as
                measured from the start of the video, with "s" appended at the end.
                Fractions are allowed, up to a microsecond precision.
              default: 0s
              type: string
              description: >
                The end of the time segment. Expressed as a number of seconds as
                measured from the start of the video, with "s" appended at the end.
                Fractions are allowed, up to a microsecond precision, and "Infinity"
                is allowed, which corresponds to the end of the video.
              default: Infinity
              description: Resource labels on the Annotation.
              type: object
                type: string
        description: Resource labels on the DataItem.
        type: object
          type: string

Input files

The format of your training data for video classification are as follows.

To import your data, create either a JSONL or CSV file.


JSON on each line:
See Classification schema (global) file for details.

	"videoGcsUri": "gs://bucket/filename.ext",
	"timeSegmentAnnotations": [{
		"displayName": "LABEL",
		"startTime": "start_time_of_segment",
		"endTime": "end_time_of_segment"
	"dataItemResourceLabels": {
		"": "train|test"

Example JSONL - Video classification:

{"videoGcsUri": "gs://demo/video1.mp4", "timeSegmentAnnotations": [{"displayName": "cartwheel", "startTime": "1.0s", "endTime": "12.0s"}], "dataItemResourceLabels": {"": "training"}}
{"videoGcsUri": "gs://demo/video2.mp4", "timeSegmentAnnotations": [{"displayName": "swing", "startTime": "4.0s", "endTime": "9.0s"}], "dataItemResourceLabels": {"": "test"}}


Format of a row in the CSV:


List of columns

  1. ML_USE (Optional). For data split purposes when training a model. Use TRAINING or TEST.
  2. VIDEO_URI. This field contains the Cloud Storage URI for the video. Cloud Storage URIs are case-sensitive.
  3. LABEL. Labels must start with a letter and only contain letters, numbers, and underscores. You can specify multiple labels for a video by adding multiple rows in the CSV file that each identify the same video segment, with a different label for each row.
  4. START,END. These two columns, START and END, respectively, identify the start and end time of the video segment to analyze, in seconds. The start time must be less than the end time. Both values must be non-negative and within the time range of the video. For example, 0.09845,1.36005. To use the entire content of the video, specify a start time of 0 and an end time of the full-length of the video or "inf". For example, 0,inf.

Example CSV - Classification using single label

Single-label on the same video segment:


Example CSV - multiple labels:

Multi-label on the same video segment:


Example CSV - no labels:

You can also provide videos in the data file without specifying any labels. You must then use the Google Cloud console to apply labels to your data before you train your model. To do so, you only need to provide the Cloud Storage URI for the video followed by three commas, as shown in the following example.