This tutorial shows you how to serve a pre-trained PyTorch machine learning (ML) model on a GKE cluster using the TorchServe framework for scalable serving. The ML model used in this tutorial generates predictions based on user requests. You can use the information in this tutorial to help you to deploy and serve your own models at scale on GKE.
About the tutorial application
The application is a small Python web application created using the Fast Dash framework. You use the application to send prediction requests to the T5 model. This application captures user text inputs and language pairs and sends the information to the model. The model translates the text and returns the result to the application, which displays the result to the user. For more information about Fast Dash, see the Fast Dash documentation.
How it works
This tutorial deploys the workloads on a GKE Autopilot cluster. GKE fully manages Autopilot nodes, which reduces administrative overhead for node configuration, scaling, and upgrades. When you deploy the ML workload and application on Autopilot, GKE chooses the correct underlying machine type and size to run the workloads. For more information, see the Autopilot overview.
After you deploy the model, you get a prediction URL that your application can use to send prediction requests to the model. This method decouples the model from the application, allowing the model to scale independently of the web application.
Objectives
- Prepare a pre-trained T5 model from the Hugging Face repository for serving by packaging it as a container image and pushing it to Artifact Registry
- Deploy the model to an Autopilot cluster
- Deploy the Fast Dash application that communicates with the model
- Autoscale the model based on Prometheus metrics
Costs
In this document, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
Create or select a Google Cloud project.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_ID
with a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_ID
with your Google Cloud project name.
-
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Kubernetes Engine, Cloud Storage, Artifact Registry, and Cloud Build APIs:
gcloud services enable container.googleapis.com
storage.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com - Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
Create or select a Google Cloud project.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_ID
with a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_ID
with your Google Cloud project name.
-
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Kubernetes Engine, Cloud Storage, Artifact Registry, and Cloud Build APIs:
gcloud services enable container.googleapis.com
storage.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com
Prepare the environment
Clone the example repository and open the tutorial directory:
git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
cd kubernetes-engine-samples/ai-ml/t5-model-serving
Create the cluster
Run the following command:
gcloud container clusters create-auto ml-cluster \
--release-channel=RELEASE_CHANNEL \
--cluster-version=CLUSTER_VERSION \
--location=us-central1
Replace the following:
RELEASE_CHANNEL
: the release channel for your cluster. Must be one ofrapid
,regular
, orstable
. Choose a channel that has GKE version 1.28.3-gke.1203000 or later to use L4 GPUs. To see the versions available in a specific channel, see View the default and available versions for release channels.CLUSTER_VERSION
: the GKE version to use. Must be1.28.3-gke.1203000
or later.
This operation takes several minutes to complete.
Create an Artifact Registry repository
Create a new Artifact Registry standard repository with the Docker format in the same region as your cluster:
gcloud artifacts repositories create models \ --repository-format=docker \ --location=us-central1 \ --description="Repo for T5 serving image"
Verify the repository name:
gcloud artifacts repositories describe models \ --location=us-central1
The output is similar to the following:
Encryption: Google-managed key Repository Size: 0.000MB createTime: '2023-06-14T15:48:35.267196Z' description: Repo for T5 serving image format: DOCKER mode: STANDARD_REPOSITORY name: projects/PROJECT_ID/locations/us-central1/repositories/models updateTime: '2023-06-14T15:48:35.267196Z'
Package the model
In this section, you package the model and the serving framework in a single container image using Cloud Build and push the resulting image to the Artifact Registry repository.
Review the Dockerfile for the container image:
This Dockerfile defines the following multiple stage build process:
- Download the model artifacts from the Hugging Face repository.
- Package the model using the PyTorch Serving Archive tool. This creates a model archive (.mar) file that the inference server uses to load the model.
- Build the final image with PyTorch Serve.
Build and push the image using Cloud Build:
gcloud builds submit model/ \ --region=us-central1 \ --config=model/cloudbuild.yaml \ --substitutions=_LOCATION=us-central1,_MACHINE=gpu,_MODEL_NAME=t5-small,_MODEL_VERSION=1.0
The build process takes several minutes to complete. If you use a larger model size than
t5-small
, the build process might take significantly more time.Check that the image is in the repository:
gcloud artifacts docker images list us-central1-docker.pkg.dev/PROJECT_ID/models
Replace
PROJECT_ID
with your Google Cloud project ID.The output is similar to the following:
IMAGE DIGEST CREATE_TIME UPDATE_TIME us-central1-docker.pkg.dev/PROJECT_ID/models/t5-small sha256:0cd... 2023-06-14T12:06:38 2023-06-14T12:06:38
Deploy the packaged model to GKE
To deploy the image, modify the Kubernetes manifest in the example repository to match your environment.
Review the manifest for the inference workload:
Replace
PROJECT_ID
with your Google Cloud project ID:sed -i "s/PROJECT_ID/PROJECT_ID/g" "kubernetes/serving-gpu.yaml"
This ensures that the container image path in the Deployment specification matches the path to your T5 model image in Artifact Registry.
Create the Kubernetes resources:
kubectl create -f kubernetes/serving-gpu.yaml
To verify that the model deployed successfully, do the following:
Get the status of the Deployment and the Service:
kubectl get -f kubernetes/serving-gpu.yaml
Wait until the output shows ready Pods, similar to the following. Depending on the size of the image, the first image pull might take several minutes.
NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/t5-inference 1/1 1 0 66s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/t5-inference ClusterIP 10.48.131.86 <none> 8080/TCP,8081/TCP,8082/TCP 66s
Open a local port for the
t5-inference
Service:kubectl port-forward svc/t5-inference 8080
Open a new terminal window and send a test request to the Service:
curl -v -X POST -H 'Content-Type: application/json' -d '{"text": "this is a test sentence", "from": "en", "to": "fr"}' "http://localhost:8080/predictions/t5-small/1.0"
If the test request fails and the Pod connection closes, check the logs:
kubectl logs deployments/t5-inference
If the output is similar to the following, TorchServe failed to install some model dependencies:
org.pytorch.serve.archive.model.ModelException: Custom pip package installation failed for t5-small
To resolve this issue, restart the Deployment:
kubectl rollout restart deployment t5-inference
The Deployment controller creates a new Pod. Repeat the previous steps to open a port on the new Pod.
Access the deployed model using the web application
Build and push the Fast Dash web application as a container image in Artifact Registry:
gcloud builds submit client-app/ \ --region=us-central1 \ --config=client-app/cloudbuild.yaml
Open
kubernetes/application.yaml
in a text editor and replacePROJECT_ID
in theimage:
field with your project ID. Alternatively, run the following command:sed -i "s/PROJECT_ID/PROJECT_ID/g" "kubernetes/application.yaml"
Create the Kubernetes resources:
kubectl create -f kubernetes/application.yaml
The Deployment and Service might take some time to fully provision.
To check the status, run the following command:
kubectl get -f kubernetes/application.yaml
Wait until the output shows ready Pods, similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/fastdash 1/1 1 0 1m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/fastdash NodePort 203.0.113.12 <none> 8050/TCP 1m
The web application is now running, although it isn't exposed on an external IP address. To access the web application, open a local port:
kubectl port-forward service/fastdash 8050
In a browser, open the web interface:
- If you're using a local shell, open a browser and go to http://127.0.0.1:8050.
- If you're using Cloud Shell, click Web preview, and then click
Change port. Specify port
8050
.
To send a request to the T5 model, specify values in the TEXT, FROM LANG, and TO LANG fields in the web interface and click Submit. For a list of available languages, see the T5 documentation.
Enable autoscaling for the model
This section shows you how to enable autoscaling for the model based on metrics from Google Cloud Managed Service for Prometheus by doing the following:
- Install Custom Metrics Stackdriver Adapter
- Apply PodMonitoring and HorizontalPodAutoscaling configurations
Google Cloud Managed Service for Prometheus is enabled by default in Autopilot clusters running version 1.25 and later.
Install Custom Metrics Stackdriver Adapter
This adapter lets your cluster use metrics from Prometheus to make Kubernetes autoscaling decisions.
Deploy the adapter:
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml
Create an IAM service account for the adapter to use:
gcloud iam service-accounts create monitoring-viewer
Grant the IAM service account the
monitoring.viewer
role on the project and theiam.workloadIdentityUser
role:gcloud projects add-iam-policy-binding PROJECT_ID \ --member "serviceAccount:monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com" \ --role roles/monitoring.viewer gcloud iam service-accounts add-iam-policy-binding monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com \ --role roles/iam.workloadIdentityUser \ --member "serviceAccount:PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]"
Replace
PROJECT_ID
with your Google Cloud project ID.Annotate the Kubernetes ServiceAccount of the adapter to let it impersonate the IAM service account:
kubectl annotate serviceaccount custom-metrics-stackdriver-adapter \ --namespace custom-metrics \ iam.gke.io/gcp-service-account=monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com
Restart the adapter to propagate the changes:
kubectl rollout restart deployment custom-metrics-stackdriver-adapter \ --namespace=custom-metrics
Apply PodMonitoring and HorizontalPodAutoscaling configurations
PodMonitoring is a Google Cloud Managed Service for Prometheus custom resource that enables metrics ingestion and target scraping in a specific namespace.
Deploy the PodMonitoring resource in the same namespace as the TorchServe Deployment:
kubectl apply -f kubernetes/pod-monitoring.yaml
Review the HorizontalPodAutoscaler manifest:
The HorizontalPodAutoscaler scales the T5 model Pod quantity based on the cumulative duration of the request queue. Autoscaling is based on the
ts_queue_latency_microseconds
metric, which shows cumulative queue duration in microseconds.Create the HorizontalPodAutoscaler:
kubectl apply -f kubernetes/hpa.yaml
Verify autoscaling using a load generator
To test your autoscaling configuration, generate load for the serving application. This tutorial uses a Locust load generator to send requests to the prediction endpoint for the model.
Create the load generator:
kubectl apply -f kubernetes/loadgenerator.yaml
Wait for the load generator Pods to become ready.
Expose the load generator web interface locally:
kubectl port-forward svc/loadgenerator 8080
If you see an error message, try again when the Pod is running.
In a browser, open the load generator web interface:
- If you're using a local shell, open a browser and go to http://127.0.0.1:8080.
- If you're using Cloud Shell, click Web preview, and then
click Change port. Enter port
8080
.
Click the Charts tab to observe performance over time.
Open a new terminal window and watch the replica count of your horizontal Pod autoscalers:
kubectl get hpa -w
The number of replicas increases as the load increases. The scaleup might take approximately ten minutes. As new replicas start, the number of successful requests in the Locust chart increases.
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE t5-inference Deployment/t5-inference 71352001470m/7M 1 5 1 2m11s
Recommendations
- Build your model with the same version of the base Docker image that you'll use for serving.
- If your model has special package dependencies, or if the size of your dependencies is large, create a custom version of your base Docker image.
- Watch the tree version of your model dependency packages. Ensure that your package dependencies support each others' versions. For example, Panda version 2.0.3 supports NumPy version 1.20.3 and later.
- Run GPU-intensive models on GPU nodes and CPU-intensive models on CPU. This could improve the stability of model serving and ensures that you're efficiently consuming node resources.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
Delete a Google Cloud project:
gcloud projects delete PROJECT_ID
Delete individual resources
Delete the Kubernetes resources:
kubectl delete -f kubernetes/loadgenerator.yaml kubectl delete -f kubernetes/hpa.yaml kubectl delete -f kubernetes/pod-monitoring.yaml kubectl delete -f kubernetes/application.yaml kubectl delete -f kubernetes/serving-gpu.yaml kubectl delete -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml
Delete the GKE cluster:
gcloud container clusters delete "ml-cluster" \ --location="us-central1" --quiet
Delete the IAM service account and IAM policy bindings:
gcloud projects remove-iam-policy-binding PROJECT_ID \ --member "serviceAccount:monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com" \ --role roles/monitoring.viewer gcloud iam service-accounts remove-iam-policy-binding monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com \ --role roles/iam.workloadIdentityUser \ --member "serviceAccount:PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]" gcloud iam service-accounts delete monitoring-viewer
Delete the images in Artifact Registry. Optionally, delete the entire repository. For instructions, see the Artifact Registry documentation about Deleting images.
Component overview
This section describes the components used in this tutorial, such as the model, the web application, the framework, and the cluster.
About the T5 model
This tutorial uses a pre-trained multilingual T5 model. T5 is a text-to-text transformer that converts text from one language to another. In T5, inputs and outputs are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. The T5 model can also be used for tasks such as summarization, Q&A, or text classification. The model is trained on a large quantity of text from Colossal Clean Crawled Corpus (C4) and Wiki-DPR.
For more information, see the T5 model documentation.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu presented the T5 model in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, published in the Journal of Machine Learning Research.
The T5 model supports various model sizes, with different levels of complexity
that suit specific use cases. This tutorial uses the default size, t5-small
,
but you can also choose a different size. The following T5 sizes are distributed
under the Apache 2.0 license:
t5-small
: 60 million parameterst5-base
: 220 million parameterst5-large
: 770 million parameters. 3GB download.t5-3b
: 3 billion parameters. 11GB download.t5-11b
: 11 billion parameters. 45GB download.
For other available T5 models, see the Hugging Face repository.
About TorchServe
TorchServe is a flexible tool for serving PyTorch models. It provides out of the box support for all major deep learning frameworks, including PyTorch, TensorFlow, and ONNX. TorchServe can be used to deploy models in production, or for rapid prototyping and experimentation.
What's next
- Serve an LLM with multiple GPUs.
- Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.