Training NCF on Cloud TPU (TF 2.x)


Overview

This is an implementation of the Neural Collaborative Filtering (NCF) framework using a Neural Matrix Factorization (NeuMF) model as described in the Neural Collaborative Filtering paper. The current implementation is based on the code from the authors' NCF code and the Stanford implementation in the MLPerf repository.

NCF is a general framework for collaborative filtering of recommendations in which a neural network architecture is used to model user-item interactions. NCF replaces the inner product with a multi-layer perceptron that can learn an arbitrary function from data.

Two implementations of NCF are Generalized Matrix Factorization (GMF) and Multi-Layer Perceptron (MLP). GMF applies a linear kernel to model the latent feature interactions, and MLP uses a nonlinear kernel to learn the interaction function from data. NeuMF is a fused model of GMF and MLP to better model complex user-item interactions, and unifies the strengths of linearity of MF and non-linearality of MLP for modeling the user-item latent structures. NeuMF allows GMF and MLP to learn separate embeddings, and combines the two models by concatenating their last hidden layer. neumf_model.py defines the architecture details.

The following instructions assume you are familiar with training a model on Cloud TPU. If you are new to Cloud TPU, refer to Get started for a basic introduction.

Dataset

The MovieLens datasets are used for model training and evaluation. We use two datasets: ml-1m (short for MovieLens 1 million) and ml-20m (short for MovieLens 20 million).

ml-1m

ml-1m dataset contains 1,000,209 anonymous ratings of approximately 3,706 movies made by 6,040 users who joined MovieLens in 2000. All ratings are contained in the file "ratings.dat" without a header row, and are in the following format:

UserID::MovieID::Rating::Timestamp

  • UserIDs range between 1 and 6040.
  • MovieIDs range between 1 and 3952.
  • Ratings are made on a 5-star scale (whole-star ratings only).

ml-20m

ml-20m dataset contains 20,000,263 ratings of 26,744 movies by 138493 users. All ratings are contained in the file "ratings.csv". Each line of this file after the header row represents a single user's rating of a movie, and has the following format:

userId,movieId,rating,timestamp

The lines within this file are ordered first by userId, then, within user, by movieId. Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars). In both datasets, the timestamp is represented in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970. Each user has at least 20 ratings.

Objectives

  • Create a Cloud Storage bucket to hold your dataset and model output
  • Prepare the MovieLens dataset
  • Set up a Compute Engine VM and Cloud TPU node for training and evaluation
  • Run training and evaluation

Costs

In this document, you use the following billable components of Google Cloud:

  • Compute Engine
  • Cloud TPU
  • Cloud Storage

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

Before starting this tutorial, check that your Google Cloud project is correctly set up.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. This walkthrough uses billable components of Google Cloud. Check the Cloud TPU pricing page to estimate your costs. Be sure to clean up resources you create when you've finished with them to avoid unnecessary charges.

Set up your resources

This section provides information on setting up Cloud Storage, VM, and Cloud TPU resources for this tutorial.

  1. Open a Cloud Shell window.

    Open Cloud Shell

  2. Create an environment variable for your project's ID.

    export PROJECT_ID=project-id
  3. Configure Google Cloud CLI to use the project where you want to create the Cloud TPU.

    gcloud config set project ${PROJECT_ID}
    

    The first time you run this command in a new Cloud Shell VM, an Authorize Cloud Shell page is displayed. Click Authorize at the bottom of the page to allow gcloud to make API calls with your credentials.

  4. Create a Service Account for the Cloud TPU project.

    gcloud beta services identity create --service tpu.googleapis.com --project $PROJECT_ID
    

    The command returns a Cloud TPU Service Account with following format:

    service-PROJECT_NUMBER@cloud-tpu.iam.gserviceaccount.com
    
  5. Create a Cloud Storage bucket using the following command:

    gcloud storage buckets create gs://bucket-name --project=${PROJECT_ID} --location=europe-west4
    

    This Cloud Storage bucket stores the data you use to train your model and the training results. The gcloud command used in this tutorial to set up the TPU also sets up default permissions for the Cloud TPU Service Account you set up in the previous step. If you want finer-grain permissions, review the access level permissions.

    The bucket location must be in the same region as your TPU VM. TPU VMs are located in specific zones, which are subdivisions within a region.

  6. Create a Cloud TPU VM.

    $ gcloud compute tpus tpu-vm create ncf-tutorial \
      --zone=europe-west4-a \
      --accelerator-type=v3-8 \
      --version=tpu-vm-tf-2.17.0-pjrt
    

    Command flag descriptions

    zone
    The zone where you plan to create your Cloud TPU.
    accelerator-type
    The accelerator type specifies the version and size of the Cloud TPU you want to create. For more information about supported accelerator types for each TPU version, see TPU versions.
    version
    The Cloud TPU software version.

    For more information on the gcloud command, see the gcloud Reference.

  7. Connect to the Compute Engine instance using SSH. When you are connected to the VM, your shell prompt changes from username@projectname to username@vm-name:

    gcloud compute tpus tpu-vm ssh ncf-tutorial --zone=europe-west4-a
    

Prepare the data

  1. Add an environment variable for your storage bucket. Replace bucket-name with your bucket name.

    (vm)$ export STORAGE_BUCKET=gs://bucket-name
    
  2. Add an environment variable for the data directory.

    (vm)$ export DATA_DIR=${STORAGE_BUCKET}/ncf_data
    
  3. Set up the model location and set the PYTHONPATH environment variable.

    (vm)$ git clone https://github.com/tensorflow/models.git
    (vm)$ pip3 install -r models/official/requirements.txt
    
    (vm)$ export PYTHONPATH="${PWD}/models:${PYTHONPATH}"
    
  4. Change to directory that stores the model processing files:

      (vm)$ cd ~/models/official/recommendation
    
  5. Generate training and evaluation data for the ml-20m dataset in DATA_DIR:

    (vm)$ python3 create_ncf_data.py \
        --dataset ml-20m \
        --num_train_epochs 4 \
        --meta_data_file_path ${DATA_DIR}/metadata \
        --eval_prebatch_size 160000 \
        --data_dir ${DATA_DIR}
    

This script generates and preprocesses the dataset on your VM. Preprocessing converts the data into TFRecord format required by the model. The download and pre-processing takes approximately 25 minutes and generates output similar to the following:

I0804 23:03:02.370002 139664166737728 movielens.py:124] Successfully downloaded /tmp/tmpicajrlfc/ml-20m.zip 198702078 bytes
I0804 23:04:42.665195 139664166737728 data_preprocessing.py:223] Beginning data preprocessing.
I0804 23:04:59.084554 139664166737728 data_preprocessing.py:84] Generating user_map and item_map...
I0804 23:05:20.934210 139664166737728 data_preprocessing.py:103] Sorting by user, timestamp...
I0804 23:06:39.859857 139664166737728 data_preprocessing.py:194] Writing raw data cache.
I0804 23:06:42.375952 139664166737728 data_preprocessing.py:262] Data preprocessing complete. Time: 119.7 sec.
%lt;BisectionDataConstructor(Thread-1, initial daemon)>
General:
  Num users: 138493
  Num items: 26744

Training:
  Positive count:          19861770
  Batch size:              99000
  Batch count per epoch:   1004

Eval:
  Positive count:          138493
  Batch size:              160000
  Batch count per epoch:   866

I0804 23:07:14.137242 139664166737728 data_pipeline.py:887] Negative total vector built. Time: 31.8 seconds
I0804 23:11:25.013135 139664166737728 data_pipeline.py:588] Epoch construction complete. Time: 250.9 seconds
I0804 23:15:46.391308 139664166737728 data_pipeline.py:674] Eval construction complete. Time: 261.4 seconds
I0804 23:19:54.345858 139664166737728 data_pipeline.py:588] Epoch construction complete. Time: 248.0 seconds
I0804 23:24:09.182484 139664166737728 data_pipeline.py:588] Epoch construction complete. Time: 254.8 seconds
I0804 23:28:26.224653 139664166737728 data_pipeline.py:588] Epoch construction complete. Time: 257.0 seconds

Set up and start training the Cloud TPU

  1. Set the Cloud TPU name variable.

      (vm)$ export TPU_NAME=local
    

Run the training and evaluation

The following script runs a sample training for 3 epochs,

  1. Add an environment variable for the Model directory to save checkpoints and TensorBoard summaries:

    (vm)$ export MODEL_DIR=${STORAGE_BUCKET}/ncf
    
  2. When creating your TPU, if you set the --version parameter to a version ending with -pjrt, set the following environment variables to enable the PJRT runtime:

      (vm)$ export NEXT_PLUGGABLE_DEVICE_USE_C_API=true
      (vm)$ export TF_PLUGGABLE_DEVICE_LIBRARY_PATH=/lib/libtpu.so
    
  3. Run the following command to train the NCF model:

    (vm)$ python3 ncf_keras_main.py \
         --model_dir=${MODEL_DIR} \
         --data_dir=${DATA_DIR} \
         --train_dataset_path=${DATA_DIR}/training_cycle_*/* \
         --eval_dataset_path=${DATA_DIR}/eval_data/* \
         --input_meta_data_path=${DATA_DIR}/metadata \
         --learning_rate=3e-5 \
         --train_epochs=3 \
         --dataset=ml-20m \
         --eval_batch_size=160000 \
         --learning_rate=0.00382059 \
         --beta1=0.783529 \
         --beta2=0.909003 \
         --epsilon=1.45439e-07 \
         --dataset=ml-20m \
         --num_factors=64 \
         --hr_threshold=0.635 \
         --keras_use_ctl=true \
         --layers=256,256,128,64 \
         --use_synthetic_data=false \
         --distribution_strategy=tpu \
         --download_if_missing=false
     

The training and evaluation takes about 2 minutes and generates final output similar to:

Result is {'loss': <tf.Tensor: shape=(), dtype=float32, numpy=0.10950611>,
'train_finish_time': 1618016422.1377568, 'avg_exp_per_second': 3062557.5070816963}

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  1. Disconnect from the Compute Engine instance, if you have not already done so:

    (vm)$ exit
    

    Your prompt should now be username@projectname, showing you are in the Cloud Shell.

  2. Delete your Cloud TPU resources.

      $ gcloud compute tpus tpu-vm delete ncf-tutorial \
        --zone=europe-west4-a
    
  3. Verify the resources have been deleted by running gcloud compute tpus tpu-vm list. The deletion might take several minutes. A response like the following indicates your instances have been successfully deleted.

      $ gcloud compute tpus tpu-vm list \
        --zone=europe-west4-a
    
    Listed 0 items.
    
  4. Run the gcloud CLI as shown, replacing bucket-name with the name of the Cloud Storage bucket you created for this tutorial:

    $ gcloud storage rm gs://bucket-name --recursive
    

What's next

The TensorFlow Cloud TPU tutorials generally train the model using a sample dataset. The results of this training are not usable for inference. To use a model for inference, you can train the data on a publicly available dataset or your own dataset. TensorFlow models trained on Cloud TPUs generally require datasets to be in TFRecord format.

You can use the dataset conversion tool sample to convert an image classification dataset into TFRecord format. If you are not using an image classification model, you will have to convert your dataset to TFRecord format yourself. For more information, see TFRecord and tf.Example.

Hyperparameter tuning

To improve the model's performance with your dataset, you can tune the model's hyperparameters. You can find information about hyperparameters common to all TPU supported models on GitHub. Information about model-specific hyperparameters can be found in the source code for each model. For more information on hyperparameter tuning, see Overview of hyperparameter tuning and Tune hyperparameters.

Inference

Once you have trained your model, you can use it for inference (also called prediction). You can use the Cloud TPU inference converter tool to prepare and optimize a TensorFlow model for inference on Cloud TPU v5e. For more information about inference on Cloud TPU v5e, see Cloud TPU v5e inference introduction.