Downloading, preprocessing, and uploading the COCO dataset

COCO is a large-scale object detection, segmentation, and captioning dataset. Machine learning models that use the COCO dataset include:

  • Mask-RCNN
  • Retinanet
  • ShapeMask

Before you can train a model on a Cloud TPU, you must prepare the training data.

This topic describes how to prepare the COCO dataset for models that run on Cloud TPU. The COCO dataset can only be prepared after you have created a Compute Engine VM. The script used to prepare the data,, is installed on the VM and must be run on the VM.

After preparing the data by running the script, you can bring up the Cloud TPU and run the training.

To fully download/preprocess and upload the COCO dataset to a Google Cloud storage bucket takes approximately 2 hours.

  1. In your Cloud Shell, configure gcloud with your project ID.

    export PROJECT_ID=project-id
    gcloud config set project ${PROJECT_ID}
  2. In your Cloud Shell, create a Cloud Storage bucket using the following command:

    gsutil mb -p ${PROJECT_ID} -c standard -l europe-west4 gs://bucket-name
  3. Launch a Compute Engine VM instance.

    This VM instance will only be used to download and preprocess the COCO dataset. Fill in the instance-name with a name of your choosing.

    $ gcloud compute tpus execution-groups create \
     --vm-only \
     --name=instance-name \
     --zone=europe-west4-a \
     --disk-size=300 \
     --machine-type=n1-standard-16 \

    Command flag descriptions

    Create a VM only. By default the gcloud compute tpus execution-groups command creates a VM and a Cloud TPU.
    The name of the Cloud TPU to create.
    The zone where you plan to create your Cloud TPU.
    The size of the hard disk in GB of the VM created by the gcloud compute tpus execution-groups command.
    The machine type of the Compute Engine VM to create.
    The version of Tensorflow gcloud compute tpus execution-groups installs on the VM.
  4. If you are not automatically logged in to the Compute Engine instance, log in by running the following ssh command. When you are logged into the VM, your shell prompt changes from username@projectname to username@vm-name:

      $ gcloud compute ssh instance-name --zone=europe-west4-a

  5. Set up two variables, one for the storage bucket you created earlier and one for the directory that holds the training data (DATA_DIR) on the storage bucket.

    (vm)$ export STORAGE_BUCKET=gs://bucket-name
    (vm)$ export DATA_DIR=${STORAGE_BUCKET}/coco
  6. Install the packages needed to pre-process the data.

    (vm)$ sudo apt-get install -y python3-tk && \
      pip3 install --user Cython matplotlib opencv-python-headless pyyaml Pillow && \
      pip3 install --user "git+"
  7. Run the script to convert the COCO dataset into a set of TFRecords (*.tfrecord) that the training application expects.

    (vm)$ git clone
    (vm)$ sudo bash tpu/tools/datasets/ ./data/dir/coco

    This installs the required libraries and then runs the preprocessing script. It outputs a number of *.tfrecord files in your local data directory. The COCO download and conversion script takes approximately 1 hour to complete.

  8. Copy the data to your Cloud Storage bucket

    After you convert the data into TFRecords, copy them from local storage to your Cloud Storage bucket using the gsutil command. You must also copy the annotation files. These files help validate the model's performance.

    (vm)$ gsutil -m cp ./data/dir/coco/*.tfrecord ${DATA_DIR}
    (vm)$ gsutil cp ./data/dir/coco/raw-data/annotations/*.json ${DATA_DIR}
  9. Clean up the VM resources

    Once the COCO dataset has been converted to TFRecords and copied to the DATA_DIR on your Cloud Storage bucket, you can delete the Compute Engine instance.

    Disconnect from the Compute Engine instance:

    (vm)$ exit

    Your prompt should now be username@projectname, showing you are in the Cloud Shell.

  10. Delete your Compute Engine instance.

      $ gcloud compute instances delete instance-name