Troubleshooting PyTorch - TPU

This guide provides troubleshooting information to help you identify and resolve problems you might encounter while training PyTorch models on Cloud TPU. For a more general guide to getting started with Cloud TPU, see the PyTorch quickstart.

Troubleshooting slow training performance

If your model trains slowly, generate and review a metrics report.

To automatically analyze the metrics report and provide a summary, run your workload with PT_XLA_DEBUG=1.

For more information about issues that might cause your model to train slowly, see Known performance caveats.

Performance profiling

To profile your workload in-depth to discover bottlenecks, review these resources:

More debugging tools

You can specify environment variables to control the behavior of the PyTorch/XLA software stack.

If you encounter an unexpected bug and need help, file a GitHub issue.

Managing XLA tensors

XLA tensor Quirks describes what you should and shouldn't do when working with XLA tensors and shared weights.