[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-08-11。"],[],[],null,["# Training on TPU slices\n======================\n\nTPUs are designed to be scaled out to a TPU Pod. A TPU Pod is a collection of\nTPU devices connected by dedicated high-speed network interfaces. A TPU Pod\nlets you distribute the processing load across\nmultiple TPUs. Each TPU board is connected to a high-performance CPU-based host\nmachine for things like loading and preprocessing data. To take full advantage\nof larger numbers of TPUs, you must tune several training task parameters.\n\nThe setup for training with TPU Pods is different for each framework.\nUse the following links to see detailed\ninformation about training on Pods with each framework:\n\n- [JAX](/tpu/docs/jax-pods)\n\n- [PyTorch](/tpu/docs/pytorch-pods)\n\nThe following sections explain some common issues, changes you need to make\nin your models, and best practices to reduce or avoid Pod failures.\n\nScaling batch size and train steps\n----------------------------------\n\nTo achieve linear scaling on larger TPU types, keep the per-core batch size\nthe same.\n\nFor example, if you use a batch size of 1024 on a v6e-8, use a batch size of 4096\n(4 \\* 1024) on a v6e-32. This fully utilizes the TPU hardware. You can use\nsmaller batch sizes, but your training won't scale linearly if you do so.\n\nSome models include a `train_steps` flag where one step corresponds to\nprocessing a single batch of data. When you increase the batch size, scale down\nthe number of training steps so that the total number of training examples\nremains the same.\n\nFor example, if you have a batch size of 1000 for 100 steps,\n100,000 examples are processed during training. If you now have 4 workers and an\neffective batch size of 4000, you would have to adjust the number of steps to 25\nto process that same 100,000 examples. If your model uses an `epochs` flag, you\ndon't need to scale the number of steps.\n\nLarger batch sizes can change convergence behavior of the model, so you might\nalso tune some hyperparameters, like learning rate.\n\nUsing regional Cloud Storage buckets in the same region as the TPU Pod\n----------------------------------------------------------------------\n\nIn general, the best practice for TPU training is to always use resources in the\nsame region. Resource region is particularly important when using TPU Pods\nbecause the data transfer rates are higher when your Cloud Storage bucket\nand TPU are in the same region.\n\nEnsure you are using a regional Cloud Storage bucket in the same region\nas the TPU for training datasets and checkpoints.\n\nWorkflow best practices for development on TPU Pods\n---------------------------------------------------\n\nWhen developing a new TPU workload, it is often optimal to begin\ndevelopment on the smallest TPUs and progressively iterate to\nlarger TPU sizes. Start by using a small TPU version\n(for example, v6e-8).\n\n- Test your workload for expected behavior\n- Test and validate performance using the performance tools\n\nOnce your workload is functional and reaches your performance targets,\nscale up to a larger TPU type such as a v6e-32. Gradually and\niteratively increase\nthe TPU size while validating scalability (functionality and performance)\nuntil you reach the TPU size that you want."]]