[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-08-18。"],[[["\u003cp\u003eThis guide demonstrates how to run a training job in a Deep Learning Containers instance and subsequently deploy that container image to a Google Kubernetes Engine (GKE) cluster.\u003c/p\u003e\n"],["\u003cp\u003eBefore starting, users must complete the setup outlined in the "Getting started with a local deep learning container" guide and ensure that billing is enabled for their Google Cloud project.\u003c/p\u003e\n"],["\u003cp\u003eThe guide provides instructions for both using Cloud Shell, which comes pre-installed with necessary command-line tools, and using local command-line tools, where users will need to install the \u003ccode\u003ekubectl\u003c/code\u003e command-line tool.\u003c/p\u003e\n"],["\u003cp\u003eUsers will learn how to create a GKE cluster or use an existing one, create a Dockerfile to build a container image with a specified Python script, and deploy the application to the GKE cluster using the \u003ccode\u003ekubectl\u003c/code\u003e command-line tool.\u003c/p\u003e\n"],["\u003cp\u003eThe process includes building and uploading the container image to Artifact Registry, and using the provided example \u003ccode\u003epod.yaml\u003c/code\u003e file to deploy the container image in the GKE cluster, using the provided command to track the Pod's status.\u003c/p\u003e\n"]]],[],null,["# Train in a container using Google Kubernetes Engine\n\nThis page shows you how to run a training job in a Deep Learning Containers\ninstance, and run that container image on a Google Kubernetes Engine cluster.\n\nBefore you begin\n----------------\n\nBefore you begin, make sure you have completed the following steps.\n\n1. Complete the set up steps in the Before you begin section of [Getting\n started with a local deep learning\n container](/deep-learning-containers/docs/getting-started-local).\n\n2. Make sure that billing is enabled for your Google Cloud project.\n\n [Learn how to enable\n billing](https://cloud.google.com/billing/docs/how-to/modify-project)\n3. Enable the Google Kubernetes Engine, Compute Engine, and Artifact Registry APIs.\n\n [Enable the\n APIs](https://console.cloud.google.com/flows/enableapi?apiid=container.googleapis.com,compute_component,artifactregistry.googleapis.com)\n\nOpen your command line tool\n---------------------------\n\nYou can follow this guide using\n[Cloud Shell](https://cloud.google.com/shell) or\ncommand line tools locally. Cloud Shell comes preinstalled\nwith the `gcloud`, `docker`, and `kubectl` command-line tools used\nin this tutorial. If you use Cloud Shell, you don't need to\ninstall these command-line tools on your workstation. \n\n### Cloud Shell\n\nTo use Cloud Shell, complete the following steps.\n\n1. Go to the [Google Cloud console](https://console.cloud.google.com/).\n\n2. Click the **Activate Cloud Shell** button at the top of the console\n window.\n\n A Cloud Shell session opens inside a new frame at the bottom\n of the console and displays a command-line prompt.\n\n### Local command line\n\nTo use your local command line, complete the following steps.\n\n1. Using the gcloud CLI, install the\n [Kubernetes](https://kubernetes.io/) command-line tool.\n `kubectl` is used to communicate with Kubernetes, which is the\n cluster orchestration system of Deep Learning Containers clusters:\n\n gcloud components install kubectl\n\n When you completed the [getting started\n steps](/deep-learning-containers/docs/getting-started-local), you installed\n the Google Cloud CLI and\n [Docker](https://docs.docker.com/engine/installation/).\n\nCreate a GKE cluster\n--------------------\n\nRun the following command to create a two-node cluster in GKE\nnamed `pytorch-training-cluster`: \n\n gcloud container clusters create pytorch-training-cluster \\\n --num-nodes=2 \\\n --zone=us-west1-b \\\n --accelerator=\"type=nvidia-tesla-p100,count=1\" \\\n --machine-type=\"n1-highmem-2\" \\\n --scopes=\"gke-default,storage-rw\"\n\nFor more information on these settings, see the [documentation on creating\nclusters for running\ncontainers](/sdk/gcloud/reference/container/clusters/create).\n\nIt may take several minutes for the cluster to be created.\n\nAlternatively, instead of creating a cluster, you can use an existing\ncluster in your Google Cloud project. If you do this, you may need\nto run the following command to make sure the `kubectl` command-line tool\nhas the proper credentials to access your cluster: \n\n gcloud container clusters get-credentials \u003cvar translate=\"no\"\u003eYOUR_EXISTING_CLUSTER\u003c/var\u003e\n\nNext, [install the NVIDIA GPU device\ndrivers](/kubernetes-engine/docs/how-to/gpus#installing_drivers).\n\nCreate the Dockerfile\n---------------------\n\nThere are many ways to build a container image.\nThese steps will show you how to build one to run a Python\nscript named `trainer.py`.\n\nTo view a list of container images available: \n\n gcloud container images list \\\n --repository=\"us-docker.pkg.dev/deeplearning-platform-release/gcr.io\"\n\nYou may want to go to [Choosing a container](/deep-learning-containers/docs/choosing-container)\nto help you select the container that you want.\n\nThe following example will show you how to place a Python script named\n`trainer.py` into a specific PyTorch deep learning container type.\n\nTo create the dockerfile, write the following commands to a file named\n`Dockerfile`. This step assumes that you have code to train a machine\nlearning model in a directory named `model-training-code` and that the\nmain Python module in that directory is named `trainer.py`. In this\nscenario, the container will be removed once the job completes, so\nyour training script should be configured to output to Cloud Storage (see\n[an example of a script that outputs to\nCloud Storage](https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/pytorch/containers/quickstart/mnist/trainer/mnist.py))\nor to output to\n[persistent storage](/kubernetes-engine/docs/concepts/persistent-volumes). \n\n FROM us-docker.pkg.dev/deeplearning-platform-release/gcr.io/pytorch-gpu\n COPY model-training-code /train\n CMD [\"python\", \"/train/trainer.py\"]\n\nBuild and upload the container image\n------------------------------------\n\nTo build and upload the container image to Artifact Registry,\nuse the following commands: \n\n export PROJECT_ID=$(gcloud config list project --format \"value(core.project)\")\n export IMAGE_REPO_NAME=pytorch_custom_container\n export IMAGE_TAG=$(date +%Y%m%d_%H%M%S)\n export IMAGE_URI=us-docker.pkg.dev/$PROJECT_ID/$IMAGE_REPO_NAME:$IMAGE_TAG\n\n docker build -f Dockerfile -t $IMAGE_URI ./\n\n docker push $IMAGE_URI\n\nDeploy your application\n-----------------------\n\nCreate a file named pod.yaml with the following contents, replacing\n\u003cvar translate=\"no\"\u003eIMAGE_URI\u003c/var\u003e with your image's URI. \n\n apiVersion: v1\n kind: Pod\n metadata:\n name: gke-training-pod\n spec:\n containers:\n - name: my-custom-container\n image: \u003cvar translate=\"no\"\u003e\u003cspan class=\"devsite-syntax-n\"\u003eIMAGE_URI\u003c/span\u003e\u003c/var\u003e\n resources:\n limits:\n nvidia.com/gpu: 1\n\nUse the `kubectl` command-line tool to run the following command and\ndeploy your application: \n\n kubectl apply -f ./pod.yaml\n\nTo track the pod's status, run the following command: \n\n kubectl describe pod gke-training-pod"]]