Perform roll out operations for GKE Inference Gateway


This page shows you how to perform incremental roll out operations, which gradually deploy new versions of your inference infrastructure, for GKE Inference Gateway. This gateway lets you perform safe and controlled updates to your inference infrastructure. You can update nodes, base models, and LoRA adapters with minimal service disruption. This page also provides guidance on traffic splitting and rollbacks to help ensure reliable deployments.

This page is for GKE Identity and account admins and Developers who want to perform roll out operations for GKE Inference Gateway.

The following use cases are supported:

Update a node roll out

Node update roll outs safely migrate inference workloads to new node hardware or accelerator configurations. This process happens in a controlled manner without interrupting model service. Use node update roll outs to minimize service disruption during hardware upgrades, driver updates, or security issue resolution.

  1. Create a new InferencePool: deploy an InferencePool configured with the updated node or hardware specifications.

  2. Split traffic using an HTTPRoute: configure an HTTPRoute to distribute traffic between the existing and new InferencePool resources. Use the weight field in backendRefs to manage the traffic percentage directed to the new nodes.

  3. Maintain a consistent InferenceModel: retain the existing InferenceModel configuration to ensure uniform model behavior across both node configurations.

  4. Retain original resources: keep the original InferencePool and nodes active during the roll out to enable rollbacks if needed.

For example, you can create a new InferencePool named llm-new. Configure this pool with the same model configuration as your existing llm InferencePool. Deploy the pool on a new set of nodes within your cluster. Use an HTTPRoute object to split traffic between the original llm and the new llm-new InferencePool. This technique lets you incrementally update your model nodes.

The following diagram illustrates how GKE Inference Gateway performs a node update roll out.

Node update rollout process
Figure: Node update roll out process

To perform a node update roll out, perform the following steps:

  1. Save the following sample manifest as routes-to-llm.yaml:

    apiVersion: gateway.networking.k8s.io/v1
    kind: `HTTPRoute`
    metadata:
      name: routes-to-llm
    spec:
      parentRefs:
        - name: my-inference-gateway
      rules:
        backendRefs:
        - name: llm
          kind: InferencePool
          weight: 90
        - name: llm-new
          kind: InferencePool
          weight: 10
    
  2. Apply the sample manifest to your cluster:

    kubectl apply -f routes-to-llm.yaml
    

The original llm InferencePool receives most of the traffic, while the llm-new InferencePool receives the rest of the traffic. Increase the traffic weight gradually for the llm-new InferencePool to complete the node update roll out.

Roll out a base model

Base model updates roll out in phases to a new base LLM, retaining compatibility with existing LoRA adapters. You can use base model update roll outs to upgrade to improved model architectures or to address model-specific issues.

To roll out a base model update:

  1. Deploy new infrastructure: Create new nodes and a new InferencePool configured with the new base model that you chose.
  2. Configure traffic distribution: Use an HTTPRoute to split traffic between the existing InferencePool (which uses the old base model) and the new InferencePool (using the new base model). The backendRefs weight field controls the traffic percentage allocated to each pool.
  3. Maintain InferenceModel integrity: keep your InferenceModel configuration unchanged. This ensures that the system applies the same LoRA adapters consistently across both base model versions.
  4. Preserve rollback capability: retain the original nodes and InferencePool during the roll out to facilitate a rollback if necessary.

You create a new InferencePool named llm-pool-version-2. This pool deploys a new version of the base model on a new set of nodes. By configuring an HTTPRoute, as shown in the provided example, you can incrementally split traffic between the original llm-pool and llm-pool-version-2. This lets you control base model updates in your cluster.

To perform a base model update roll out, perform the following steps:

  1. Save the following sample manifest as routes-to-llm.yaml:

    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: routes-to-llm
    spec:
      parentRefs:
        - name: my-inference-gateway
      rules:
        backendRefs:
        - name: llm-pool
          kind: InferencePool
          weight: 90
        - name: llm-pool-version-2
          kind: InferencePool
          weight: 10
    
  2. Apply the sample manifest to your cluster:

    kubectl apply -f routes-to-llm.yaml
    

The original llm-pool InferencePool receives most of the traffic, while the llm-pool-version-2 InferencePool receives the rest. Increase the traffic weight gradually for the llm-pool-version-2 InferencePool to complete the base model update roll out.

Roll out LoRA adapter updates

LoRA adapter update roll outs let you deploy new versions of fine-tuned models in phases, without altering the underlying base model or infrastructure. Use LoRA adapter update roll outs to test improvements, bug fixes, or new features in your LoRA adapters.

To update a LoRA adapter, follow these steps:

  1. Make adapters available: Ensure that the new LoRA adapter versions are available on the model servers. For more information, see Adapter roll out.

  2. Modify the InferenceModel configuration: in your existing InferenceModel configuration, define multiple versions of your LoRA adapter. Assign a unique modelName to each version (for example, llm-v1, llm-v2).

  3. Distribute traffic: use the weight field in the InferenceModel specification to control the traffic distribution among the different LoRA adapter versions.

  4. Maintain a consistent poolRef: ensure that all LoRA adapter versions reference the same InferencePool. This prevents node or InferencePool redeployments. Retain previous LoRA adapter versions in the InferenceModel configuration to enable rollbacks.

The following example shows two LoRA adapter versions, llm-v1 and llm-v2. Both versions use the same base model. You define llm-v1 and llm-v2 within the same InferenceModel. You assign weights to incrementally shift traffic from llm-v1 to llm-v2. This control allows a controlled roll out without requiring any changes to your nodes or InferencePool configuration.

To roll out LoRA adapter updates, run the following command:

  1. Save the following sample manifest as inferencemodel-sample.yaml:

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceModel
    metadata:
      name: inferencemodel-sample
    spec:
    versions:
    -   modelName: llm-v1
      criticality: Critical
      weight: 90
      poolRef:
        name: llm-pool
    -   modelName: llm-v2
      criticality: Critical
      weight: 10
      poolRef:
        name: llm-pool
    
  2. Apply the sample manifest to your cluster:

    kubectl apply -f inferencemodel-sample.yaml
    

The llm-v1 version receives most of the traffic, while the llm-v2 version receives the rest. Increase the traffic weight gradually for the llm-v2 version to complete the LoRA adapter update roll out.

What's next