Stay organized with collections
Save and categorize content based on your preferences.
Troubleshooting PyTorch - TPU
This guide provides troubleshooting information to
help you identify and resolve problems you might encounter while training
PyTorch models on Cloud TPU. For a more general guide to
getting started with Cloud TPU, see the
PyTorch quickstart.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-28 UTC."],[],[],null,["# Troubleshooting PyTorch - TPU\n=============================\n\nThis guide provides troubleshooting information to\nhelp you identify and resolve problems you might encounter while training\nPyTorch models on Cloud TPU. For a more general guide to\ngetting started with Cloud TPU, see the\n[PyTorch quickstart](/tpu/docs/run-calculation-pytorch).\n| **Note:** If you aren't able to resolve your issue using this guide, see [Getting Support](/tpu/docs/getting-support) for further assistance.\n\nTroubleshooting slow training performance\n-----------------------------------------\n\nIf your model trains slowly, [generate and review a metrics report.](https://pytorch.org/xla/release/r2.6/learn/troubleshoot.html#get-a-metrics-report)\n\nTo automatically analyze the metrics report and provide a summary, run\nyour workload with PT_XLA_DEBUG=1.\n\nFor more information about issues that might cause your model to train slowly,\nsee [Known performance caveats](https://pytorch.org/xla/release/r2.6/learn/troubleshoot.html#known-performance-caveats).\n\nPerformance profiling\n---------------------\n\nTo profile your workload in-depth to discover bottlenecks, review these resources:\n\n- [PyTorch/XLA performance profiling](https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm)\n- [Sample MNIST training script with profiling](https://github.com/pytorch/xla/blob/master/test/test_profile_mp_mnist.py)\n\nMore debugging tools\n--------------------\n\nYou can specify [environment variables](https://pytorch.org/xla/release/r2.6/learn/troubleshoot.html#environment-variables)\nto control the behavior of the PyTorch/XLA software stack.\n\nIf you encounter an unexpected bug and need help, [file a GitHub issue](https://github.com/pytorch/xla).\n\nManaging XLA tensors\n--------------------\n\n[XLA tensor Quirks](https://pytorch.org/xla/release/r2.6/learn/troubleshoot.html#xla-tensor-quirks)\ndescribes what you should and shouldn't do when working with XLA tensors and\nshared weights."]]