Downloading, preprocessing, and uploading the COCO dataset
COCO is a large-scale object detection, segmentation, and captioning dataset. Machine learning models that use the COCO dataset include:
- Mask-RCNN
- Retinanet
- ShapeMask
Before you can train a model on a Cloud TPU, you must prepare the training data.
This document describes how to prepare the COCO dataset for
models that run on Cloud TPU. The COCO dataset can only be prepared after you
have created a Compute Engine VM. The script used to prepare the data,
download_and_preprocess_coco.sh
,
is installed on the VM and must be run on the VM.
After preparing the data by running the download_and_preprocess_coco.sh
script, you can bring up the Cloud TPU and run the training.
To fully download and preprocess and upload the COCO dataset to a Cloud Storage bucket takes approximately 2 hours.
In your Cloud Shell, configure
gcloud
with your project ID.export TPU_NAME=tpu-name gcloud config set tpu-name ${TPU_NAME} export PROJECT_ID=project-id gcloud config set project ${PROJECT_ID} export ZONE=zone gcloud config set zone ${ZONE}
In your Cloud Shell, create a Cloud Storage bucket using the following command:
gcloud storage buckets create gs://bucket-name --project=${PROJECT_ID} --location=us-central2
Create a Compute Engine VM to download and preprocess the dataset. For more information, see Create and start a Compute Engine instance.
$ gcloud compute instances create vm-name \ --zone=us-central2-b \ --image-family=ubuntu-2204-lts \ --image-project=ubuntu-os-cloud \ --machine-type=n1-standard-16 \ --boot-disk-size=300GB \ --scopes=https://www.googleapis.com/auth/cloud-platform
Connect to the Compute Engine VM using SSH:
$ gcloud compute ssh vm-name --zone=us-central2-b
When you connect to the VM, your shell prompt changes from
username@projectname
tousername@vm-name
.Set up two variables, one for the storage bucket you created earlier and one for the directory that holds the training data (
DATA_DIR
) on the storage bucket.(vm)$ export STORAGE_BUCKET=gs://bucket-name
(vm)$ export DATA_DIR=${STORAGE_BUCKET}/coco
Install the packages needed to pre-process the data.
(vm)$ sudo apt-get update && \ sudo apt-get install python3-pip && \ sudo apt-get install -y python3-tk && \ pip3 install --user Cython matplotlib opencv-python-headless pyyaml Pillow numpy absl-py tensorflow && \ pip3 install --user "git+https://github.com/cocodataset/cocoapi#egg=pycocotools&subdirectory=PythonAPI" && \ pip3 install protobuf==3.20.0 tensorflow==2.11.0
Run the
download_and_preprocess_coco.sh
script to convert the COCO dataset into a set of TFRecord files (*.tfrecord
) that the training application expects.(vm)$ sudo chown $USER /home/$USER/.config (vm)$ git clone https://github.com/tensorflow/tpu.git (vm)$ sudo -E bash tpu/tools/datasets/download_and_preprocess_coco.sh ./data/dir/coco
This installs the required libraries and then runs the preprocessing script. It outputs
*.tfrecord
files in your local data directory. The COCO download and conversion script takes approximately one hour to complete.Copy the data to your Cloud Storage bucket.
After you convert the data into the TFRecord format, copy the data from local storage to your Cloud Storage bucket using the gcloud CLI. You must also copy the annotation files. These files help validate the model's performance.
(vm)$ Google Cloud CLI storage cp ./data/dir/coco/*.tfrecord ${DATA_DIR} (vm)$ Google Cloud CLI storage cp ./data/dir/coco/raw-data/annotations/*.json ${DATA_DIR}
Clean up
When you are done with your TPU VM, follow these steps to clean up your Cloud TPU resources.
Disconnect from the Compute Engine VM:
(vm)$ exit
Delete your Cloud TPU and Compute Engine resources.
$ gcloud compute tpus tpu-vm delete ${TPU_NAME} \ --zone=${ZONE} \ --project=${PROJECT_ID}
Verify the resources have been deleted by running
gcloud compute tpus execution-groups list
. The deletion might take several minutes. The output from the following command shouldn't include any of the resources created in this tutorial.