Anthos Service Mesh and Traffic Director are now Cloud Service Mesh. For more information, see the Cloud Service Mesh overview.

Resolving scaling issues in Cloud Service Mesh

This section explains common Cloud Service Mesh problems and how to resolve them. If you need additional assistance, see Getting support.

Scaling factors

Istiod sends configuration to each sidecar using a long-lived gRPC stream. It has several characteristics that affect scaling:

The size of the configuration to generate:
- Total number of services/pods & Istio resources
- For large scale, adjust settings for the Sidecar to reduce the configuration size.
The rate of change in the environment:
- When a new service is created or the Istio configuration is changed, full updates are sent to proxies.
- Adding new endpoints is inexpensive for performance, because only incremental updates are sent.
The number of proxies for which configuration is generated:
- Affected by the number of gateways and pods with a sidecar.

Scaling considerations

Istiod scales well vertically (large requests) and horizontally (more replicas). Ensure that your CPU limits are not too restrictive; if Istiod reaches the CPU limit, throttling may occur which will negatively affect configuration distribution. If you encounter performance issues, consider upgrading to the latest version of Cloud Service Mesh, as each version has performance optimizations.

For more guidance on scaling your mesh, see the Scalability best practices guide.

Unbalanced load

Large changes in cluster size might cause a temporarily unbalanced load, due to the long-lived connections. This is mitigated by a 30 minute maximum connection age, which might result in error messages in Envoy, such as gRPC config stream closed: 13, which allows the load to naturally rebalance.

Mitigate this issue by having multiple replicas of Istiod (the default is 2 replicas), and pre-scaling if you expect extreme cluster scale-ups.