[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-04-11。"],[[["\u003cp\u003eAI Platform Training allows you to run distributed PyTorch training jobs on a cluster of virtual machines, with environment variables configured to support the distribution of tasks.\u003c/p\u003e\n"],["\u003cp\u003eDistributed PyTorch training on AI Platform requires one master worker node to manage connections and one or more worker nodes to handle portions of the training workload.\u003c/p\u003e\n"],["\u003cp\u003eWhen configuring your distributed training job, you must specify the Docker container images for the master worker and worker nodes, where the worker node image defaults to the master worker image if not explicitly defined.\u003c/p\u003e\n"],["\u003cp\u003eYou must update your training code to initialize the cluster using \u003ccode\u003etorch.distributed.init_process_group\u003c/code\u003e with a specified backend (\u003ccode\u003egloo\u003c/code\u003e for CPU or \u003ccode\u003enccl\u003c/code\u003e for GPU) and use the \u003ccode\u003etorch.nn.parallel.DistributedDataParallel\u003c/code\u003e class to distribute training.\u003c/p\u003e\n"],["\u003cp\u003eAI Platform Training sets environment variables like \u003ccode\u003eWORLD_SIZE\u003c/code\u003e, \u003ccode\u003eRANK\u003c/code\u003e, \u003ccode\u003eMASTER_ADDR\u003c/code\u003e, and \u003ccode\u003eMASTER_PORT\u003c/code\u003e on each node to facilitate PyTorch cluster initialization.\u003c/p\u003e\n"]]],[],null,[]]