Serve open models using Hex-LLM premium container on Cloud TPU

Hex-LLM, a high-efficiency large language model (LLM) serving with XLA, is the Vertex AI LLM serving framework that's designed and optimized for Cloud TPU hardware. Hex-LLM combines LLM serving technologies such as continuous batching and paged attention with Vertex AI optimizations that are tailored for XLA and Cloud TPU. It's a high-efficiency and low-cost LLM serving on Cloud TPU for open source models.

Hex-LLM is available in Model Garden through model playground, one-click deployment, and notebook.

Features

Hex-LLM is based on open source projects with Google's own optimizations for XLA and Cloud TPU. Hex-LLM achieves high throughput and low latency when serving frequently used LLMs.

Hex-LLM includes the following optimizations:

  • Token-based continuous batching algorithm to help ensure models are fully utilizing the hardware with a large number of concurrent requests.
  • A complete rewrite of the PagedAttention kernel that is optimized for XLA.
  • Flexible and composable data parallelism and tensor parallelism strategies with highly optimized weight sharding methods to efficiently run LLMs on multiple Cloud TPU chips.

Hex-LLM supports a wide range of dense and sparse LLMs:

  • Gemma 2B and 7B
  • Gemma 2 9B and 27B
  • Llama 2 7B, 13B and 70B
  • Llama 3 8B and 70B
  • Mistral 7B and Mixtral 8x7B

Hex-LLM also provides a variety of features, such as the following:

  • Hex-LLM is included in a single container. Hex-LLM packages the API server, inference engine, and supported models into a single Docker image to be deployed.
  • Compatible with the Hugging Face models format. Hex-LLM can load a Hugging Face model from local disk, the Hugging Face Hub, and a Cloud Storage bucket.
  • Quantization using bitsandbytes and AWQ.
  • Dynamic LoRA loading. Hex-LLM is able to load the LoRA weights through reading the request argument during serving.

Get started in Model Garden

The Hex-LLM Cloud TPU serving container is integrated into Model Garden. You can access this serving technology through the playground, one-click deployment, and Colab Enterprise notebook examples for a variety of models.

Use playground

Model Garden playground is a pre-deployed Vertex AI endpoint that is reachable by sending requests in the model card.

  1. Enter a prompt and, optionally, include arguments for your request.

  2. Click SUBMIT to get the model response quickly.

Try it out with Gemma!

Use one-click deployment

You can deploy a custom Vertex AI endpoint with Hex-LLM by using a model card.

  1. Navigate to the model card page and click Deploy.

  2. For the model variation that you want to use, select the Cloud TPU v5e machine type for deployment.

  3. Click Deploy at the bottom to begin the deployment process. You receive two email notifications; one when the model is uploaded and another when the endpoint is ready.

Use the Colab Enterprise notebook

For flexibility and customization, you can use Colab Enterprise notebook examples to deploy a Vertex AI endpoint with Hex-LLM by using the Vertex AI SDK for Python.

  1. Navigate to the model card page and click Open notebook.

  2. Select the Vertex Serving notebook. The notebook is opened in Colab Enterprise.

  3. Run through the notebook to deploy a model by using Hex-LLM and send prediction requests to the endpoint. The code snippet for the deployment is as follows:

hexllm_args = [
    f"--model=google/gemma-2-9b-it",
    f"--tensor_parallel_size=4",
    f"--hbm_utilization_factor=0.8",
    f"--max_running_seqs=512",
]
hexllm_envs = {
    "PJRT_DEVICE": "TPU",
    "MODEL_ID": "google/gemma-2-9b-it",
    "DEPLOY_SOURCE": "notebook",
}
model = aiplatform.Model.upload(
    display_name="gemma-2-9b-it",
    serving_container_image_uri=HEXLLM_DOCKER_URI,
    serving_container_command=[
        "python", "-m", "hex_llm.server.api_server"
    ],
    serving_container_args=hexllm_args,
    serving_container_ports=[7080],
    serving_container_predict_route="/generate",
    serving_container_health_route="/ping",
    serving_container_environment_variables=hexllm_envs,
    serving_container_shared_memory_size_mb=(16 * 1024),
    serving_container_deployment_timeout=7200,
)

endpoint = aiplatform.Endpoint.create(display_name="gemma-2-9b-it-endpoint")
model.deploy(
    endpoint=endpoint,
    machine_type="ct5lp-hightpu-4t",
    deploy_request_timeout=1800,
    service_account="<your-service-account>",
    min_replica_count=1,
    max_replica_count=1,
)

You can modify the following Hex-LLM server launch arguments for custom serving:

  • --model: The model to load. You can specify a Hugging Face model ID, a local absolute path, or a Cloud Storage bucket path.
  • --tokenizer: The tokenizer to load. Can be a Hugging Face model ID, a local absolute path, or a Cloud Storage bucket path. The default value is the same as that of --model.
  • --enable_jit: Whether to enable JIT mode. The default value is True.
  • --enable_lora: Whether to enable LoRA loading mode. The default value is False.
  • --max_lora_rank: The maximum LoRA rank supported for LoRA adapters, defined in requests. The default value is 16.
  • --data_parallel_size: The number of data parallel replicas. The default value is 1.
  • --tensor_parallel_size: The number of tensor parallel replicas. The default value is 1.
  • --max_running_seqs: The maximum number of requests the server can process concurrently. The larger this argument is, the higher throughput the server could achieve, but with potential adverse effects on latency. The default value is 256.
  • --hbm_utilization_factor: The percentage of the free Cloud TPU HBM that can be allocated for the KV-cache after model weights loaded. Setting this argument to a lower value can effectively prevent Cloud TPU HBM out of memory. The default value is 0.9.
  • --seed: The seed for initializing all random number generators. Changing this argument may affect the generated output for the same prompt. The default value is 0.

Request Cloud TPU quota

In Model Garden, your default quota is four Cloud TPU v5e chips in the us-west1 region. This quotas applies to one-click deployments and Colab Enterprise notebook deployments. To request additional quotas, see Request a higher quota.