This reference architecture series describes how you can design and deploy a high performance online inference system for deep learning models by using an NVIDIA® T4 GPU and Triton Inference Server.
Using this architecture, you can create a system that uses machine learning models and can leverage GPU acceleration. Google Kubernetes Engine (GKE) lets you scale the system according to a growing number of clients. You can improve throughput and reduce the latency of the system by applying the optimization techniques that are described in this series.
This series is intended for developers who are familiar with Google Kubernetes Engine and machine learning (ML) frameworks, including TensorFlow and NVIDIA TensorRT.
The following documents are included in this series:
- Reference architecture: Scalable TensorFlow inference system (this document)
- Deployment: Deploy a scalable TensorFlow inference system
- Measure and tune deployment: Measure and tune performance of a TensorFlow inference system
Architecture
The following diagram shows the architecture of the inference system.
This architecture includes the following components:
Cloud Load Balancing: Sends the request traffic to the GKE cluster that's closest to the client.
GKE cluster: Contains the cluster nodes and monitoring servers. If clients send requests from multiple regions, you can deploy GKE clusters to multiple regions. You deploy the Locust load testing tool on the same cluster.
Cluster nodes with GPU accelerator: Contains Triton Inference Server Pods; a single Pod is deployed for each node (a single GPU cannot be shared with multiple Pods).
Triton Inference Server: Serves ResNet-50 models that you create. The server provides an inference service through an HTTP or gRPC endpoint. The inference service allows remote clients to request inferencing for any model that's being managed by the server.
NVIDIA T4: Improves inference performance. There must be one NVIDIA T4 for each Pod. This GPU features Tensor Cores, which are specialized processing units that support and accelerate INT8 and FP16 calculations.
Monitoring servers: Collect metrics data on GPU utilization and memory usage from Triton. You use Prometheus for event monitoring and alerting. You use Grafana to visualize and analyze performance data stored in Prometheus.
Design considerations
The following guidelines can help you to develop an architecture that meets your organization's requirements for reliability and performance.
Reliability
This architecture uses GKE for scalability and flexible resource management.
By deploying the system on GKE, you can scale the system according to the number of clients. You can deploy GKE clusters to multiple regions and increase the number of nodes in the cluster.
Performance optimization
When you tune performance, follow these general guidelines:
- Define performance metrics and target performance according to the use case of the system.
- Measure baseline performance before applying performance tuning.
- Apply one change and observe the improvement. If you apply multiple changes at a time, you cannot tell which change caused the improvement.
- Collect appropriate metrics to understand performance characteristics and decide on a next action for performance tuning.
Using the guidelines, you measure the performance improvement made by the following factors:
TensorRT (graph optimization). TensorRT applies graph optimizations for NVIDIA T4. For example, it automatically modifies deep learning models so that they can be processed with Tensor Core. First, you observe the inference performance without TensorRT as a baseline. Then, you observe the performance improvement after applying the TensorRT graph optimization.
FP16 conversion. NVIDIA T4 supports FP32 (32-bit floating point) and FP16 for floating-point calculations. When you convert the precision of variables from the default FP32 to FP16, you can improve the inference performance.
INT8 quantization. Quantization is an optimization technique for deep learning models that improves the computation performance on GPUs. NVIDIA T4 supports INT8 (8-bit integer) variable types for quantization. Compared to the conversion to FP16, INT8 quantization can provide improved performance but potentially reduce accuracy. However, TensorRT uses a calibration process that minimizes the information loss during calculations.
Batch size and number of instance groups. You can adjust the batch size and number of instance groups by using Triton. For example, when you set a batch size to 16, inference requests are stored in a batch queue and 16 requests are processed as a single batch. Likewise, if you set a number of instance groups to 4, multiple requests are processed with 4 threads in parallel. In this scenario, there are 16 requests in each batch, and 4 threads processing in parallel, which means that 64 requests are processed simultaneously on a single GPU.
Increasing the number of instance groups allows TensorRT to achieve higher GPU utilization. At the same time, by adjusting the batch size, you let Triton optimize calculations on the GPU. For example, it can combine multiple calculations from different requests into a single computational task on Tensor Cores.
Deployment
To deploy this architecture, see Deploy a scalable TensorFlow inference system.
To measure and tune the deployment, see Measure and tune performance of a TensorFlow inference system.
What's next
- Learn more about Google Kubernetes Engine (GKE).
- Learn more about Cloud Load Balancing.
- For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.