Hex-LLM, a high-efficiency large language model (LLM) serving with XLA, is the Vertex AI LLM serving framework that's designed and optimized for Cloud TPU hardware. Hex-LLM combines LLM serving technologies such as continuous batching and PagedAttention with Vertex AI optimizations that are tailored for XLA and Cloud TPU. It's a high-efficiency and low-cost LLM serving on Cloud TPU for open source models.
Hex-LLM is available in Model Garden through model playground, one-click deployment, and notebook.
Features
Hex-LLM is based on open source projects with Google's own optimizations for XLA and Cloud TPU. Hex-LLM achieves high throughput and low latency when serving frequently used LLMs.
Hex-LLM includes the following optimizations:
- Token-based continuous batching algorithm to help ensure models are fully utilizing the hardware with a large number of concurrent requests.
- A complete rewrite of the attention kernels that are optimized for XLA.
- Flexible and composable data parallelism and tensor parallelism strategies with highly optimized weight sharding methods to efficiently run LLMs on multiple Cloud TPU chips.
Hex-LLM supports a wide range of dense and sparse LLMs:
- Gemma 2B and 7B
- Gemma 2 9B and 27B
- Llama 2 7B, 13B and 70B
- Llama 3 8B and 70B
- Llama 3.1 8B and 70B
- Llama 3.2 1B and 3B
- Llama Guard 3 1B and 8B
- Mistral 7B
- Mixtral 8x7B and 8x22B
- Phi-3 mini and medium
Hex-LLM also provides a variety of features, such as the following:
- Hex-LLM is included in a single container. Hex-LLM packages the API server, inference engine, and supported models into a single Docker image to be deployed.
- Compatible with the Hugging Face models format. Hex-LLM can load a Hugging Face model from local disk, the Hugging Face Hub, and a Cloud Storage bucket.
- Quantization using bitsandbytes and AWQ.
- Dynamic LoRA loading. Hex-LLM is able to load the LoRA weights through reading the request argument during serving.
Get started in Model Garden
The Hex-LLM Cloud TPU serving container is integrated into Model Garden. You can access this serving technology through the playground, one-click deployment, and Colab Enterprise notebook examples for a variety of models.
Use playground
Model Garden playground is a pre-deployed Vertex AI endpoint that is reachable by sending requests in the model card.
Enter a prompt and, optionally, include arguments for your request.
Click SUBMIT to get the model response quickly.
Use one-click deployment
You can deploy a custom Vertex AI endpoint with Hex-LLM by using a model card.
Navigate to the model card page and click Deploy.
For the model variation that you want to use, select the Cloud TPU v5e machine type for deployment.
Click Deploy at the bottom to begin the deployment process. You receive two email notifications; one when the model is uploaded and another when the endpoint is ready.
Use the Colab Enterprise notebook
For flexibility and customization, you can use Colab Enterprise notebook examples to deploy a Vertex AI endpoint with Hex-LLM by using the Vertex AI SDK for Python.
Navigate to the model card page and click Open notebook.
Select the Vertex Serving notebook. The notebook is opened in Colab Enterprise.
Run through the notebook to deploy a model by using Hex-LLM and send prediction requests to the endpoint. The code snippet for the deployment is as follows:
hexllm_args = [
f"--model=google/gemma-2-9b-it",
f"--tensor_parallel_size=4",
f"--hbm_utilization_factor=0.8",
f"--max_running_seqs=512",
]
hexllm_envs = {
"PJRT_DEVICE": "TPU",
"MODEL_ID": "google/gemma-2-9b-it",
"DEPLOY_SOURCE": "notebook",
}
model = aiplatform.Model.upload(
display_name="gemma-2-9b-it",
serving_container_image_uri=HEXLLM_DOCKER_URI,
serving_container_command=[
"python", "-m", "hex_llm.server.api_server"
],
serving_container_args=hexllm_args,
serving_container_ports=[7080],
serving_container_predict_route="/generate",
serving_container_health_route="/ping",
serving_container_environment_variables=hexllm_envs,
serving_container_shared_memory_size_mb=(16 * 1024),
serving_container_deployment_timeout=7200,
)
endpoint = aiplatform.Endpoint.create(display_name="gemma-2-9b-it-endpoint")
model.deploy(
endpoint=endpoint,
machine_type="ct5lp-hightpu-4t",
deploy_request_timeout=1800,
service_account="<your-service-account>",
min_replica_count=1,
max_replica_count=1,
)
Configure the server
You can set the following arguments for the Hex-LLM server launch:
--model
: The model to load. You can specify a Hugging Face model ID, a local absolute path, or a Cloud Storage bucket path.--tokenizer
: The tokenizer to load. Can be a Hugging Face model ID, a local absolute path, or a Cloud Storage bucket path. The default value is the same as that of--model
.--enable_jit
: Whether to enable JIT mode. The default value isTrue
.--data_parallel_size
: The number of data parallel replicas. The default value is1
.--tensor_parallel_size
: The number of tensor parallel replicas. The default value is1
.--num_hosts
: The number of TPU VMs to use for the multi-hosts serving jobs. Refer to TPU machine types for the settings of different topologies. The default value is1
.--worker_distributed_method
: The distributed method to launch the worker. Usemp
for the multiprocessing module orray
for the Ray library. The default value ismp
.--max_model_len
: The maximum context length the server can process. The default value is read from the model config files.--max_running_seqs
: The maximum number of requests the server can process concurrently. The larger this argument is, the higher throughput the server could achieve, but with potential adverse effects on latency. The default value is256
.--hbm_utilization_factor
: The percentage of the free Cloud TPU HBM that can be allocated for the KV-cache after model weights loaded. Setting this argument to a lower value can effectively prevent Cloud TPU HBM out of memory. The default value is0.9
.--enable_lora
: Whether to enable LoRA loading mode. The default value isFalse
.--max_lora_rank
: The maximum LoRA rank supported for LoRA adapters, defined in requests. The default value is16
.--seed
: The seed for initializing all random number generators. Changing this argument may affect the generated output for the same prompt. The default value is0
.--prefill_len_padding
: Pads the sequence length to a multiple of this value. Increasing this value reduces model recompilation times but lowers inference performance. Default value is512
.--decode_seqs_padding
: Pads the number of sequences in a batch to a multiple of this value during decoding. Increasing this value reduces model recompilation times but lowers inference performance. Default value is8
.--decode_blocks_padding
: Pads the number of memory blocks used for a sequence's KV-cache to a multiple of this value during decoding. Increasing this value reduces model recompilation times but lowers inference performance. Default value is128
.
You can also configure the server using the following environment variables:
HEX_LLM_LOG_LEVEL
: Controls the amount of logging information generated. Set this to one of the standard Python logging levels defined in the logging module.HEX_LLM_VERBOSE_LOG
: Enables or disables detailed logging output. Allowed values aretrue
orfalse
. Default value isfalse
.
Request Cloud TPU quota
In Model Garden, your default quota is 4 Cloud TPU v5e
chips in the us-west1
region. This quotas applies to one-click deployments and
Colab Enterprise notebook deployments. To request additional quotas,
see Request a higher
quota.