Serve open models using Hex-LLM premium container on Cloud TPU

Hex-LLM, a high-efficiency large language model (LLM) serving with XLA, is the Vertex AI LLM serving framework that's designed and optimized for Cloud TPU hardware. Hex-LLM combines LLM serving technologies such as continuous batching and PagedAttention with Vertex AI optimizations that are tailored for XLA and Cloud TPU. It's a high-efficiency and low-cost LLM serving on Cloud TPU for open source models.

Hex-LLM is available in Model Garden through model playground, one-click deployment, and notebook.

Features

Hex-LLM is based on open source projects with Google's own optimizations for XLA and Cloud TPU. Hex-LLM achieves high throughput and low latency when serving frequently used LLMs.

Hex-LLM includes the following optimizations:

  • Token-based continuous batching algorithm to help ensure models are fully utilizing the hardware with a large number of concurrent requests.
  • A complete rewrite of the attention kernels that are optimized for XLA.
  • Flexible and composable data parallelism and tensor parallelism strategies with highly optimized weight sharding methods to efficiently run LLMs on multiple Cloud TPU chips.

Hex-LLM supports a wide range of dense and sparse LLMs:

  • Gemma 2B and 7B
  • Gemma 2 9B and 27B
  • Llama 2 7B, 13B and 70B
  • Llama 3 8B and 70B
  • Llama 3.1 8B and 70B
  • Llama 3.2 1B and 3B
  • Llama Guard 3 1B and 8B
  • Mistral 7B
  • Mixtral 8x7B and 8x22B
  • Phi-3 mini and medium

Hex-LLM also provides a variety of features, such as the following:

  • Hex-LLM is included in a single container. Hex-LLM packages the API server, inference engine, and supported models into a single Docker image to be deployed.
  • Compatible with the Hugging Face models format. Hex-LLM can load a Hugging Face model from local disk, the Hugging Face Hub, and a Cloud Storage bucket.
  • Quantization using bitsandbytes and AWQ.
  • Dynamic LoRA loading. Hex-LLM is able to load the LoRA weights through reading the request argument during serving.

Get started in Model Garden

The Hex-LLM Cloud TPU serving container is integrated into Model Garden. You can access this serving technology through the playground, one-click deployment, and Colab Enterprise notebook examples for a variety of models.

Use playground

Model Garden playground is a pre-deployed Vertex AI endpoint that is reachable by sending requests in the model card.

  1. Enter a prompt and, optionally, include arguments for your request.

  2. Click SUBMIT to get the model response quickly.

Try it out with Gemma!

Use one-click deployment

You can deploy a custom Vertex AI endpoint with Hex-LLM by using a model card.

  1. Navigate to the model card page and click Deploy.

  2. For the model variation that you want to use, select the Cloud TPU v5e machine type for deployment.

  3. Click Deploy at the bottom to begin the deployment process. You receive two email notifications; one when the model is uploaded and another when the endpoint is ready.

Use the Colab Enterprise notebook

For flexibility and customization, you can use Colab Enterprise notebook examples to deploy a Vertex AI endpoint with Hex-LLM by using the Vertex AI SDK for Python.

  1. Navigate to the model card page and click Open notebook.

  2. Select the Vertex Serving notebook. The notebook is opened in Colab Enterprise.

  3. Run through the notebook to deploy a model by using Hex-LLM and send prediction requests to the endpoint. The code snippet for the deployment is as follows:

hexllm_args = [
    f"--model=google/gemma-2-9b-it",
    f"--tensor_parallel_size=4",
    f"--hbm_utilization_factor=0.8",
    f"--max_running_seqs=512",
]
hexllm_envs = {
    "PJRT_DEVICE": "TPU",
    "MODEL_ID": "google/gemma-2-9b-it",
    "DEPLOY_SOURCE": "notebook",
}
model = aiplatform.Model.upload(
    display_name="gemma-2-9b-it",
    serving_container_image_uri=HEXLLM_DOCKER_URI,
    serving_container_command=[
        "python", "-m", "hex_llm.server.api_server"
    ],
    serving_container_args=hexllm_args,
    serving_container_ports=[7080],
    serving_container_predict_route="/generate",
    serving_container_health_route="/ping",
    serving_container_environment_variables=hexllm_envs,
    serving_container_shared_memory_size_mb=(16 * 1024),
    serving_container_deployment_timeout=7200,
)

endpoint = aiplatform.Endpoint.create(display_name="gemma-2-9b-it-endpoint")
model.deploy(
    endpoint=endpoint,
    machine_type="ct5lp-hightpu-4t",
    deploy_request_timeout=1800,
    service_account="<your-service-account>",
    min_replica_count=1,
    max_replica_count=1,
)

Configure the server

You can set the following arguments for the Hex-LLM server launch:

  • --model: The model to load. You can specify a Hugging Face model ID, a local absolute path, or a Cloud Storage bucket path.
  • --tokenizer: The tokenizer to load. Can be a Hugging Face model ID, a local absolute path, or a Cloud Storage bucket path. The default value is the same as that of --model.
  • --enable_jit: Whether to enable JIT mode. The default value is True.
  • --data_parallel_size: The number of data parallel replicas. The default value is 1.
  • --tensor_parallel_size: The number of tensor parallel replicas. The default value is 1.
  • --num_hosts: The number of TPU VMs to use for the multi-hosts serving jobs. Refer to TPU machine types for the settings of different topologies. The default value is 1.
  • --worker_distributed_method: The distributed method to launch the worker. Use mp for the multiprocessing module or ray for the Ray library. The default value is mp.
  • --max_model_len: The maximum context length the server can process. The default value is read from the model config files.
  • --max_running_seqs: The maximum number of requests the server can process concurrently. The larger this argument is, the higher throughput the server could achieve, but with potential adverse effects on latency. The default value is 256.
  • --hbm_utilization_factor: The percentage of the free Cloud TPU HBM that can be allocated for the KV-cache after model weights loaded. Setting this argument to a lower value can effectively prevent Cloud TPU HBM out of memory. The default value is 0.9.
  • --enable_lora: Whether to enable LoRA loading mode. The default value is False.
  • --max_lora_rank: The maximum LoRA rank supported for LoRA adapters, defined in requests. The default value is 16.
  • --seed: The seed for initializing all random number generators. Changing this argument may affect the generated output for the same prompt. The default value is 0.
  • --prefill_len_padding: Pads the sequence length to a multiple of this value. Increasing this value reduces model recompilation times but lowers inference performance. Default value is 512.
  • --decode_seqs_padding: Pads the number of sequences in a batch to a multiple of this value during decoding. Increasing this value reduces model recompilation times but lowers inference performance. Default value is 8.
  • --decode_blocks_padding: Pads the number of memory blocks used for a sequence's KV-cache to a multiple of this value during decoding. Increasing this value reduces model recompilation times but lowers inference performance. Default value is 128.

You can also configure the server using the following environment variables:

  • HEX_LLM_LOG_LEVEL: Controls the amount of logging information generated. Set this to one of the standard Python logging levels defined in the logging module.
  • HEX_LLM_VERBOSE_LOG: Enables or disables detailed logging output. Allowed values are true or false. Default value is false.

Request Cloud TPU quota

In Model Garden, your default quota is 4 Cloud TPU v5e chips in the us-west1 region. This quotas applies to one-click deployments and Colab Enterprise notebook deployments. To request additional quotas, see Request a higher quota.