Troubleshooting PyTorch - TPU
This guide provides troubleshooting information to help you identify and resolve problems you might encounter while training PyTorch models on Cloud TPU. For a more general guide to getting started with Cloud TPU, see the PyTorch quickstart.
Troubleshooting slow training performance
If your model trains slowly, generate and review a metrics report.
To automatically analyze the metrics report and provide a summary, run your workload with PT_XLA_DEBUG=1.
For more information about issues that might cause your model to train slowly, see Known performance caveats.
Performance profiling
To profile your workload in-depth to discover bottlenecks, review these resources:
More debugging tools
You can specify environment variables to control the behavior of the PyTorch/XLA software stack.
If you encounter an unexpected bug and need help, file a GitHub issue.
Managing XLA tensors
XLA tensor Quirks describes what you should and shouldn't do when working with XLA tensors and shared weights.