Anda dapat menggunakan Dockerfile dan skrip
yang kami gunakan untuk membangun container Model Garden sebagai referensi atau
titik awal untuk membangun container kustom Anda sendiri.
Menayangkan inferensi dengan NVIDIA NIM
NVIDIA Inference Microservices (NIM)
adalah model AI terlatih dan dioptimalkan yang dikemas sebagai microservice.
API ini dirancang untuk menyederhanakan deployment AI berperforma tinggi dan siap produksi ke dalam aplikasi.
NVIDIA NIM dapat digunakan bersama dengan
Artifact Registry dan Vertex AI
untuk men-deploy model AI generatif untuk inferensi online.
Setelan untuk container kustom
Bagian ini menjelaskan kolom di
containerSpec model yang mungkin perlu Anda
tentukan saat mengimpor model AI generatif.
Beberapa model AI generatif memerlukan lebih banyak memori bersama. Memori bersama adalah mekanisme komunikasi antarproses (IPC) yang memungkinkan beberapa proses mengakses dan memanipulasi blok memori umum. Ukuran memori bersama default adalah 64 MB.
Beberapa server model, seperti vLLM
atau Nvidia Triton, menggunakan memori bersama untuk meng-cache data internal selama inferensi model. Tanpa memori bersama yang cukup, beberapa server model tidak dapat menyajikan inferensi untuk model generatif. Jumlah memori bersama yang diperlukan, jika ada, adalah detail penerapan penampung dan model Anda. Lihat dokumentasi server model Anda untuk mengetahui panduannya.
Selain itu, karena memori bersama dapat digunakan untuk komunikasi lintas GPU, penggunaan lebih banyak memori bersama dapat meningkatkan performa akselerator tanpa kemampuan NVLink (misalnya, L4), jika container model memerlukan komunikasi lintas GPU.
Untuk mengetahui informasi tentang cara menentukan nilai kustom untuk memori bersama, lihat Kolom API terkait container.
startupProbe
Pemeriksaan startup adalah pemeriksaan opsional yang digunakan untuk mendeteksi kapan
container telah dimulai. Pemeriksaan ini digunakan untuk menunda pemeriksaan kondisi dan
pemeriksaan keaktifan
hingga container dimulai, yang membantu mencegah container yang dimulai dengan lambat
dimatikan sebelum waktunya.
Untuk mengetahui informasi selengkapnya, lihat Health check.
healthProbe
Pemeriksaan kesehatan memeriksa apakah container siap menerima traffic.
Jika pemeriksaan kondisi tidak disediakan, Vertex AI akan menggunakan pemeriksaan kondisi default yang mengeluarkan permintaan HTTP ke port container dan mencari respons 200 OK dari server model.
Jika server model Anda merespons dengan 200 OK sebelum model dimuat sepenuhnya, yang mungkin terjadi, terutama untuk model besar, pemeriksaan kondisi akan berhasil sebelum waktunya dan Vertex AI akan merutekan traffic ke container sebelum siap.
Dalam kasus ini, tentukan pemeriksaan kesehatan kustom yang berhasil hanya setelah model dimuat sepenuhnya dan siap menerima traffic.
Untuk mengetahui informasi selengkapnya, lihat Health check.
Batasan
Pertimbangkan batasan berikut saat men-deploy model AI generatif:
Model AI generatif hanya dapat di-deploy ke satu mesin. Deployment multi-host tidak didukung.
Untuk model yang sangat besar yang tidak sesuai dengan vRAM terbesar yang didukung, seperti Llama 3.1 405B, sebaiknya lakukan kuantisasi agar sesuai.
[[["Mudah dipahami","easyToUnderstand","thumb-up"],["Memecahkan masalah saya","solvedMyProblem","thumb-up"],["Lainnya","otherUp","thumb-up"]],[["Sulit dipahami","hardToUnderstand","thumb-down"],["Informasi atau kode contoh salah","incorrectInformationOrSampleCode","thumb-down"],["Informasi/contoh yang saya butuhkan tidak ada","missingTheInformationSamplesINeed","thumb-down"],["Masalah terjemahan","translationIssue","thumb-down"],["Lainnya","otherDown","thumb-down"]],["Terakhir diperbarui pada 2025-08-18 UTC."],[],[],null,["# Deploy generative AI models\n\nThis page provides guidance for deploying a generative AI model to an endpoint\nfor online inference.\n\nCheck the Model Garden\n----------------------\n\nIf the model is in Model Garden, you can deploy it by clicking\n**Deploy** (available for some models) or **Open Notebook**.\n\n[Go to Model Garden](https://console.cloud.google.com/vertex-ai/model-garden)\n\nOtherwise, you can do one of the following:\n\n- If your model is similar to one in the Model Garden, you might be\n able to directly reuse one of the\n [model garden containers](https://us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers).\n\n- Build your own custom container that adheres to\n [Custom container requirements for inference](/vertex-ai/docs/predictions/custom-container-requirements)\n before [importing your model](/vertex-ai/docs/model-registry/import-model)\n into the [Vertex AI Model Registry](/vertex-ai/docs/model-registry/introduction).\n After it's imported, it becomes a [`model`](/vertex-ai/docs/reference/rest/v1/projects.locations.models)\n resource that you can [deploy to an endpoint](/vertex-ai/docs/general/deployment).\n\n You can use the [Dockerfiles and scripts](https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/main/community-content/vertex_model_garden)\n that we use to build our Model Garden containers as a reference or\n starting point to build your own custom containers.\n\nServing inferences with NVIDIA NIM\n----------------------------------\n\n[NVIDIA Inference Microservices (NIM)](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/)\nare pre-trained and optimized AI models that are packaged as microservices.\nThey're designed to simplify the deployment of high-performance,\nproduction-ready AI into applications.\n\nNVIDIA NIM can be used together with\n[Artifact Registry](/artifact-registry/docs/overview) and Vertex AI\nto deploy generative AI models for online inference. \n| To see an example of using NVIDIA NIM,\n| run the \"NVIDIA NIM on Google Cloud Vertex AI\" notebook in one of the following\n| environments:\n|\n| [Open in Colab](https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/generative_ai/nvidia_nim_vertexai.ipynb)\n|\n|\n| \\|\n|\n| [Open in Colab Enterprise](https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fgenerative_ai%2Fnvidia_nim_vertexai.ipynb)\n|\n|\n| \\|\n|\n| [Open\n| in Vertex AI Workbench](https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https%3A%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fvertex-ai-samples%2Fmain%2Fnotebooks%2Fofficial%2Fgenerative_ai%2Fnvidia_nim_vertexai.ipynb)\n|\n|\n| \\|\n|\n| [View on GitHub](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/generative_ai/nvidia_nim_vertexai.ipynb)\n\nSettings for custom containers\n------------------------------\n\nThis section describes fields in your model's\n[`containerSpec`](/vertex-ai/docs/reference/rest/v1/ModelContainerSpec) that you may need to\nspecify when importing generative AI models.\n\nYou can specify these fields by using the Vertex AI REST API or the\n[`gcloud ai models upload` command](/sdk/gcloud/reference/ai/models/upload).\nFor more information, see\n[Container-related API fields](/vertex-ai/docs/predictions/use-custom-container#fields).\n\n`sharedMemorySizeMb`\n\n: Some generative AI models require more **shared memory**. Shared memory is\n an Inter-process communication (IPC) mechanism that allows multiple\n processes to access and manipulate a common block of memory. The default\n shared memory size is 64MB.\n\n Some model servers, such as [vLLM](https://github.com/GoogleCloudPlatform/vertex-ai-samples/tree/main/community-content/vertex_model_garden/model_oss/vllm)\n or Nvidia Triton, use shared memory to cache internal data during model\n inferences. Without enough shared memory, some model servers cannot serve\n inferences for generative models. The amount of shared memory needed, if\n any, is an implementation detail of your container and model. Consult your\n model server documentation for guidelines.\n\n Also, because shared memory can be used for cross GPU communication, using\n more shared memory can improve performance for accelerators without NVLink\n capabilities (for example, L4), if the model container requires\n communication across GPUs.\n\n For information on how to specify a custom value for shared memory, see\n [Container-related API fields](/vertex-ai/docs/predictions/use-custom-container#fields).\n\n`startupProbe`\n\n: A **startup probe** is an optional probe that is used to detect when the\n container has started. This probe is used to delay the health probe and\n [liveness checks](/vertex-ai/docs/predictions/custom-container-requirements#liveness_checks)\n until the container has started, which helps prevent slow starting containers\n from getting shut down prematurely.\n\n For more information, see [Health checks](/vertex-ai/docs/predictions/custom-container-requirements#health).\n\n`healthProbe`\n\n: The **health probe** checks whether a container is ready to accept traffic.\n If health probe is not provided, Vertex AI will use the default\n health checks which issues a HTTP request to the container's port and looks\n for a `200 OK` response from the model server.\n\n If your model server responds with `200 OK` before the model is fully\n loaded, which is possible, especially for large models, then the health\n check will succeed prematurely and Vertex AI will route traffic to\n the container before it is ready.\n\n In these cases, specify a custom health probe that succeeds only after the\n model is fully loaded and ready to accept traffic.\n\n For more information, see [Health checks](/vertex-ai/docs/predictions/custom-container-requirements#health).\n\nLimitations\n-----------\n\nConsider the following limitations when deploying generative AI models:\n\n- Generative AI models can only be deployed to a single machine. Multi-host deployment isn't supported.\n- For very large models that don't fit in the largest supported vRAM, such as [Llama 3.1 405B](/vertex-ai/generative-ai/docs/open-models/use-llama#llama_31), we recommend quantizing them to fit."]]