[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-08-25。"],[],[],null,["# Best practices: Cloud Run jobs with GPUs\n\n| **Preview\n| --- GPU support for Cloud Run jobs**\n|\n|\n| This feature is subject to the \"Pre-GA Offerings Terms\" in the General Service Terms section\n| of the [Service Specific Terms](/terms/service-terms#1).\n|\n| Pre-GA features are available \"as is\" and might have limited support.\n|\n| For more information, see the\n| [launch stage descriptions](/products#product-launch-stages).\nThis page provides best practices for optimizing performance when using a Cloud Run job with GPU for AI workloads such as, training large language models (LLMs) using your preferred frameworks, fine-tuning, and performing batch or offline inference on LLMs. To create a Cloud Run job that can perform compute intensive tasks or batch processing in real time, you should:\n\n- Use models that load fast and require minimal transformation into GPU-ready structures, and optimize how they are loaded.\n- Use configurations that allow for maximum, efficient, concurrent execution to reduce the number of GPUs needed to serve a target request per second while keeping costs down.\n\nRecommended ways to load large ML models on Cloud Run\n-----------------------------------------------------\n\nGoogle recommends either storing ML models [inside container images](#model-container)\nor [optimize loading them from Cloud Storage](#model-storage).\n\n### Storing and loading ML models trade-offs\n\nHere is a comparison of the options:\n\n### Store models in container images\n\nBy storing the ML model in the container image, model loading will benefit from Cloud Run's optimized container streaming infrastructure.\nHowever, building container images that include ML models is a resource-intensive process,\nespecially when working with large models. In particular, the build process can become\nbottlenecked on network throughput. When using Cloud Build, we recommend\nusing a more powerful build machine with increased compute and networking\nperformance. To do this, build an image using a\n[build config file](/build/docs/build-push-docker-image#build_an_image_using_a_build_config_file)\nthat has the following steps: \n\n```yaml\nsteps:\n- name: 'gcr.io/cloud-builders/docker'\n args: ['build', '-t', '\u003cvar translate=\"no\"\u003eIMAGE\u003c/var\u003e', '.']\n- name: 'gcr.io/cloud-builders/docker'\n args: ['push', '\u003cvar translate=\"no\"\u003eIMAGE\u003c/var\u003e']\nimages:\n- IMAGE\noptions:\n machineType: 'E2_HIGHCPU_32'\n diskSizeGb: '500'\n \n```\n\nYou can create one model copy per image if the layer containing the model is\ndistinct between images (different hash). There could be additional Artifact Registry\ncost because there could be one copy of the model per image if your model layer\nis unique across each image.\n\n### Store models in Cloud Storage\n\nTo optimize ML model loading when loading ML models from Cloud Storage,\neither using [Cloud Storage volume mounts](/run/docs/configuring/jobs/cloud-storage-volume-mounts)\nor directly using the Cloud Storage API or command line, you must use [Direct VPC](/run/docs/configuring/vpc-direct-vpc) with\nthe egress setting value set to `all-traffic`, along with [Private Google Access](/vpc/docs/configure-private-google-access).\n\n### Load models from the internet\n\nTo optimize ML model loading from the internet, [route all traffic through\nthe vpc network](/run/docs/configuring/vpc-direct-vpc) with the egress setting\nvalue set to `all-traffic` and set up [Cloud NAT](/nat/docs/overview) to reach the public internet at high bandwidth.\n\nBuild, deployment, runtime, and system design considerations\n------------------------------------------------------------\n\nThe following sections describe considerations for build, deploy, runtime and system design.\n\n### At build time\n\nThe following list shows considerations you need to take into account when you\nare planning your build:\n\n- Choose a good base image. You should start with an image from the [Deep Learning Containers](/deep-learning-containers/docs/choosing-container) or the [NVIDIA container registry](https://catalog.ngc.nvidia.com/containers) for the ML framework you're using. These images have the latest performance-related packages installed. We don't recommend creating a custom image.\n- Choose 4-bit quantized models to maximize concurrency unless you can prove they affect result quality. Quantization produces smaller and faster models, reducing the amount of GPU memory needed to serve the model, and can increase parallelism at run time. Ideally, the models should be trained at the target bit depth rather than quantized down to it.\n- Pick a model format with fast load times to minimize container startup time, such as GGUF. These formats more accurately reflect the target quantization type and require less transformations when loaded onto the GPU. For security reasons, don't use pickle-format checkpoints.\n- Create and warm LLM caches at build time. Start the LLM on the build machine while building the docker image. Enable prompt caching and feed common or example prompts to help warm the cache for real-world use. Save the outputs it generates to be loaded at runtime.\n- Save your own inference model that you generate during build time. This saves significant time compared to loading less efficiently stored models and applying transforms like quantization at container startup.\n\n### At deployment\n\nThe following list shows considerations you need to take into account when you\nare planning your deployment:\n\n- Set a [task timeout of one hour\n or lesser](/run/docs/configuring/task-timeout) for job executions.\n- If you are running parallel tasks in a job execution, determine and set [parallelism](/run/docs/configuring/parallelism) to less than the lowest value of the [applicable quota limits](/run/docs/configuring/jobs/gpu#request-quota) you allocated for your project. By default, the GPU job instance quota is set to `5` for tasks that run parallelly. To request for a quota increase, see [How to increase quota](/run/quotas#increase). GPU tasks start as quickly as possible, and go up to a maximum that varies depending on how much GPU quota you allocated for the project and the region selected. Deployments fail if you set parallelism to more than the GPU quota limit.\n\n### At run time\n\n- Actively manage your supported context length. The smaller the context window you support, the more queries you can support running in parallel. The details of how to do this depend on the framework.\n- Use the LLM caches you generated at build time. Supply the same flags you used during build time when you generated the prompt and prefix cache.\n- Load from the saved model you just wrote. See [Storing and loading models trade-offs](#loading-storing-models-tradeoff) for a comparison on how to load the model.\n- Consider using a quantized key-value cache if your framework supports it. This can reduce per-query memory requirements and allows for configuration of more parallelism. However, it can also impact quality.\n- Tune the amount of GPU memory to reserve for model weights, activations and key-value caches. Set it as high as you can without getting an out-of-memory error.\n- Check to see whether your framework has any options for improving container startup performance (for example, using model loading parallelization).\n\n### At the system design level\n\n- Add semantic caches where appropriate. In some cases, caching whole queries and responses can be a great way of limiting the cost of common queries.\n- Control variance in your preambles. Prompt caches are only useful when they contain the prompts in sequence. Caches are effectively prefix-cached. Insertions or edits in the sequence mean that they're either not cached or only partially present."]]