Troubleshoot the Kubernetes controller manager

This pages shows you how to resolve issues with the Kubernetes controller manager (kube-controller-manager) for Google Distributed Cloud.

Leader election lost

This error might be observed in a regional cluster or replicated control plane when kube-controller-manager (KCM) restarts unexpectedly. This restart might involve quitting itself or being restarted by kubelet. The KCM logs might include leaderelection lost messages.

This scenario can occur when the leader checks if it's still actively leading as part of the KCM health check.

If the leader is no longer leading or the lease check fails, the health check reports to be unhealthy, and the leader is restarted.

The leader election status can be retrieved by getting the Lease resources of the coordination.k8s.io group:

To see all leases, run the following kubectl command:
```
kubectl -n kube-system get lease
```

To check status of a given lease, such as lease/kube-controller-manager, use the following kubectl describe command:

kubectl -n kube-system describe lease/kube-controller-manager

Under the Events section, check for LeaderElection events. Review who takes the leadership and when that happens. The following example output shows that when the first node was manually shut down, the second instantaneously takes over the leadership:

Events:
  Type    Reason          Age    From                     Message
  ----    ------          ----   ----                     -------
  Normal  LeaderElection  26m    kube-controller-manager  control-plane_056a86ec-84c5-48b8-b58d-86f3fde2ecdd became leader
  Normal  LeaderElection  5m20s  kube-controller-manager  control-plane2_b0475d49-7010-4f03-8a9d-34f82ed60cd4 became leader

You can also observe the process of losing and gaining leadership by using the kubernetes.io/anthos/leader_election_master_status metric grouped by name.

The leader election process only happens if the current leader fails. You can confirm the failure by looking at kubernetes.io/anthos/container/uptime and kubernetes.io/anthos/container/restart_count metrics filtered by a container_name of kube-controller-manager.

If you experience issues of the leader election process repeatedly running or failing, review the following remediation considerations:

If KCM restarts every few minutes or less, check the KCM logs for failed requests to API server. Failed requests indicate connectivity issues between the components or part of the service is overloaded.
If the controller manager fails to communicate with the API server for too long, the renewal fails and the KCM instance loses its leadership, even if the connection is later restored.
If the control plane is replicated, the new leader should smoothly take over without downtime. No action is required. The control plane of a multi-cloud or regional cluster is always replicated. Don't attempt to disable leader election for a replicated control plane. You can't re-enable leader election without downtime.

What's next

If you need additional assistance, reach out to Cloud Customer Care.

You can also see Getting support for more information about support resources, including the following:

Requirements for opening a support case.
Tools to help you troubleshoot, such as logs and metrics.
Supported components, versions, and features of Google Distributed Cloud for VMware (software only).