The following sections describe issues you might encounter while using GKE on-prem, and how to resolve them.
Before you begin
Check the following sections before you begin troubleshooting an issue.
Diagnosing cluster issues using gkectl
Use gkectl diagnose
commands to identify cluster issues
and share cluster information with Google. See
Diagnosing cluster issues.
Running gkectl
commands verbosely
-v5
Logging gkectl
errors to stderr
--alsologtostderr
Locating gkectl
logs in the admin workstation
Even if you don't pass in its debugging flags, you can view
gkectl
logs in the following admin workstation directory:
/home/ubuntu/.config/gke-on-prem/logs
Locating Cluster API logs in the admin cluster
If a VM fails to start after the admin control plane has started, you can try debugging this by inspecting the Cluster API controllers' logs in the admin cluster:
Find the name of the Cluster API controllers Pod in the
kube-system
namespace, where [ADMIN_CLUSTER_KUBECONFIG] is the path to the admin cluster's kubeconfig file:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system get pods | grep clusterapi-controllers
Open the Pod's logs, where [POD_NAME] is the name of the Pod. Optionally, use
grep
or a similar tool to search for errors:kubectl --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] -n kube-system logs [POD_NAME] vsphere-controller-manager
Installation
Debugging F5 BIG-IP issues using the admin cluster control plane node's kubeconfig
After an installation, GKE on-prem generates a kubeconfig file in
the home directory of your admin workstation named
internal-cluster-kubeconfig-debug
. This kubeconfig file is
identical to your admin cluster's kubeconfig, except that it points directly at
the admin cluster's control plane node, where the admin control plane runs. You can use
the internal-cluster-kubeconfig-debug
file to debug F5 BIG-IP
issues.
gkectl check-config
validation fails: can't find F5 BIG-IP partitions
- Symptoms
Validation fails because F5 BIG-IP partitions can't be found, even though they exist.
- Potential causes
An issue with the F5 BIG-IP API can cause validation to fail.
- Resolution
Try running
gkectl check-config
again.
gkectl prepare --validate-attestations
fails: could not validate build attestation
- Symptoms
Running
gkectl prepare
with the optional--validate-attestations
flag returns the following error:could not validate build attestation for gcr.io/gke-on-prem-release/.../...: VIOLATES_POLICY
- Potential causes
An attestation might not exist for the affected image(s).
- Resolution
Try downloading and deploying the admin workstation OVA again, as instructed in Creating an admin workstation. If the issue persists, reach out to Google for assistance.
Debugging using the bootstrap cluster's logs
During installation, GKE on-prem creates a temporary bootstrap cluster. After a successful installation, GKE on-prem deletes the bootstrap cluster, leaving you with your admin cluster and user cluster. Generally, you should have no reason to interact with this cluster.
If something goes wrong during an installation, and you did pass
--cleanup-external-cluster=false
to gkectl create cluster
,
you might find it useful to debug using the bootstrap cluster's logs. You can
find the Pod, and then get its logs:
kubectl --kubeconfig /home/ubuntu/.kube/kind-config-gkectl get pods -n kube-system
kubectl --kubeconfig /home/ubuntu/.kube/kind-config-gkectl -n kube-system get logs [POD_NAME]
Authentication plugin for GKE Enterprise
Failure
running gkectl create-login-config
Issue 1:
- Symptoms
When running
gkectl create-login-config
, you encounter the following error:Error getting clientconfig using [user_cluster_kubeconfig]
- Potential causes
This error means either the kubeconfig file passed to
gkectl create-login-config
is not for a user cluster or the ClientConfig CRD did not come up during cluster creation.- Resolution
Run the following command to see if the ClientConfig CRD is in the cluster:
$ kubectl --kubeconfig [user_cluster_kubeconfig] get clientconfig default -n kube-public
Issue 2:
- Symptoms
When running
gkectl create-login-config
, you encounter the following error:error merging with file [merge_file] because [merge_file] contains a cluster with the same name as the one read from [kubeconfig]. Please write to a new output file
- Potential causes
Each login configuration file must contain unique cluster names. If you are seeing this error, the file you are writing login config data to contains a cluster name that already exists in the destination file.
- Resolution
Write to a new
--output
file. Note the following:- If
--output
is not provided, the login config data will be written to a file calledkubectl-anthos-config.yaml
in the current directory by default. - If
--output
already exists, the command will try to merge the new login config to--output
.
- If
Failure running gcloud anthos auth
login
Issue 1:
- Symptoms
Running
login
using the auth plugin and the generated login config YAML file fails.- Potential causes
There might be an error in the OIDC configuration details.
- Resolution
Verify OIDC client registration with your administrator.
Issue 2:
- Symptoms
When a proxy is configured for HTTPS traffic, running the
gcloud anthos auth login
command fails withproxyconnect tcp
in the error message. An example of the type of message you might see isproxyconnect tcp: tls: first record does not look like a TLS handshake
.- Potential causes
There might be an error in the
https_proxy
orHTTPS_PROXY
environment variable configurations. If there's anhttps://
specified in the environment variables, then the GoLang HTTP client libraries might fail if the proxy is configured to handle HTTPS connections using other protocols such as SOCK5.- Resolution
Modify the
https_proxy
andHTTPS_PROXY
environment variables to omit thehttps://
prefix. On Windows, modify the system environment variables. For example, change the value of thehttps_proxy
environment variable fromhttps://webproxy.example.com:8000
towebproxy.example.com:8000
.
Failure using kubeconfig generated by
gcloud anthos auth login
to access cluster
- Symptoms
"Unauthorized" Error
If there is an `Unauthorized` error when using the kubeconfig generated by
gcloud anthos auth login
to access the cluster, this means that the apiserver is unable to authorize the user.- Potential causes
- Either the appropriate RBACs are missing or incorrect or there is an error in the OIDC configuration for the cluster.
- Resolution
- Try the following steps to resolve the issue:
Parse the
id-token
from kubeconfig.In the kubeconfig file that was generated by the login command, copy the
id-token
:kind: Config … users: - name: … user: auth-provider: config: id-token: [id-token] …
Follow the steps to install jwt-cli and run:
$ jwt [id-token]
Verify OIDC configuration
The
oidc
section filled out inconfig.yaml
, which was used to create the cluster, contains the fieldsgroup
andusername
, which are used to set the flags--oidc-group-claim
and--oidc-username-claim
in the apiserver. When the apiserver is presented with the token, it will look for that group- claim and username-claim and verify that the corresponding group or user has the correct permissions.Verify that the claims set for
group
anduser
in theoidc
section ofconfig.yaml
are present in theid-token
.Check RBACs that were applied.
Verify that there is an RBAC with the correct permissions for either the user specified by the username-claim or one of the groups listed under the group-claim from the previous step. The name of the user or group in the RBAC should be prefixed with the
usernameprefix
orgroupprefix
that was provided in theoidc
section ofconfig.yaml
.Note that if
usernameprefix
was left blank, andusername
is a value other thanemail
, the prefix will default toissuerurl#
. To disable username prefixes,usernameprefix
should be set to-
.For more information about user and group prefixes, see Populating the oidc spec.
Note that the Kubernetes API server currently treats a backslash as an escape character. Therefore, if the name of the user or group contains
\\
, the API server will read it as a single\
when parsing theid_token
. Therefore, the RBAC applied for this user or group should only contain a single backslash, or you might see anUnauthorized
error.Example:
config.yaml:
oidc: issuerurl: … username: "unique_name" usernameprefix: "-" group: "group" groupprefix: "oidc:" ...
id_token:
{ ... "email": "cluster-developer@example.com", "unique_name": "EXAMPLE\\cluster-developer", "group": [ "Domain Users", "EXAMPLE\\developers" ], ... }
The following RBACs would grant this user cluster-admin permissions (note the single slash in the name field instead of a double slash):
Group RBAC:
apiVersion: kind: metadata: name: example-binding subjects: - kind: Group name: "oidc:EXAMPLE\developers" apiGroup: rbac.authorization.k8s.io roleRef: kind: ClusterRole name: pod-reader apiGroup: rbac.authorization.k8s.io
User RBAC:
apiVersion: kind: metadata: name: example-binding subjects: - kind: User name: "EXAMPLE\cluster-developer" apiGroup: rbac.authorization.k8s.io roleRef: kind: ClusterRole name: pod-reader apiGroup: rbac.authorization.k8s.io
Check API Server logs
If the OIDC plugin configured in the kube apiserver does not start up correctly, the API server will return an "Unauthorized" error when presented with the
id-token
. To see if there were any issues with the OIDC plugin in the API server, run:$ kubectl --kubeconfig=[admin_cluster_kubeconfig] logs statefulset/kube-apiserver -n [user_cluster_name]
- Symptoms
Unable to connect to the server: Get {DISCOVERY_ENDPOINT}: x509: certificate signed by unknown authority
- Potential causes
The refresh token in the kubeconfig expired.
- Resolution
Run the
login
command again.
Google Cloud console login
The following are common errors that might occur while using Google Cloud console to try to log in:
Login redirects to page with "URL not found" error
- Symptoms
Google Cloud console is not able to reach the GKE on-prem identity provider.
- Potential causes
Google Cloud console is not able to reach the GKE on-prem identity provider.
- Resolution
Try the following steps to resolve the issue:
-
Set
useHTTPProxy
totrue
If the IDP is not reachable over the public internet, then you will need to enable the OIDC HTTP Proxy to login via Google Cloud console. In the
oidc
section ofconfig.yaml
,usehttpproxy
should be set totrue
. If you have already created a cluster and want to turn on the proxy, you can edit the ClientConfig CRD directly. Run$ kubectl edit clientconfig default -n kube-public
and changeuseHTTPProxy
totrue
. useHTTPProxy
is already set totrue
If the HTTP proxy is enabled and you are still seeing this error,there might have been an issue with the proxy starting up. To get the logs of the proxy, run
$ kubectl logs deployment/clientconfig-operator -n kube-system
. Note that even if your IDP has a well known CA, for the http proxy to start, the fieldcapath
in theoidc
section ofconfig.yaml
must be provided.IDP prompts for consent
If the authorization server prompts for consent, and you have not included the extraparam
prompt=consent
, then you might see this error. Run$ kubectl edit clientconfig default -n kube-public
and addprompt=consent
toextraparams
and try logging in again.RBACs are misconfigured
If you have not done so already, try authenticating using the Authentication Plugin for Anthos. If you are seeing an authorization error logging in with the plugin as well, then follow the troubleshooting steps to resolve the issue with the plugin, and then try logging in via Google Cloud console again.
Try logging out and logging back in
In some cases, if some settings are changed on storage service, you might need to log out explicitly. Go to the cluster details page, click Log out, and try logging back in.
Admin workstation
AccessDeniedException
while downloading OVA- Symptoms
Attempting to download the admin workstation OVA and signature returns the following error:
AccessDeniedException: 403 whitelisted-service-account@project.iam.gserviceaccount.com does not have storage.objects.list access to gke-on-prem-release
- Potential causes
Your allowlisted service account is not activated.
- Resolution
Make sure you have activated your allowlisted service account. If the issue persists, reach out to Google for assistance.
openssl
can't validate admin workstation OVA- Symptoms
Running
openssl dgst
against the admin workstation OVA file doesn't returnVerified OK
- Potential causes
An issue is present in the OVA file that prevents successful validation.
- Resolution
Try downloading and deploying the admin workstation OVA again, as instructed in Download the admin workstation OVA . If the issue persists, reach out to Google for assistance.
Connect
Unable to register a user cluster
If you encounter issues with registering user clusters, reach out to Google for assistance.
Cluster created during alpha was deregistered
Refer to Registering a user cluster in the Connect documentation.
You might also choose to delete and recreate the cluster.
Storage
Volume fails to attach
Symptoms
The output of
gkectl diagnose cluster
looks like the following:Checking cluster object...PASS Checking machine objects...PASS Checking control plane pods...PASS Checking gke-connect pods...PASS Checking kube-system pods...PASS Checking gke-system pods...PASS Checking storage...FAIL PersistentVolume pvc-776459c3-d350-11e9-9db8-e297f465bc84: virtual disk "[datastore_nfs] kubevols/kubernetes-dynamic-pvc-776459c3-d350-11e9-9db8-e297f465bc84.vmdk" IS attached to machine "gsl-test-user-9b46dbf9b-9wdj7" but IS NOT listed in the Node.Status 1 storage errors
One or more Pods is stuck in
ContainerCreating
state with a warning like the following:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedAttachVolume 6s (x6 over 31s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-776459c3-d350-11e9-9db8-e297f465bc84" : Failed to add disk 'scsi0:6'.
Potential causes
If a virtual disk is attached to the wrong virtual machine, it may be due to issue #32727 in Kubernetes 1.12.
Resolution
If a virtual disk is attached to the wrong virtual machine, you might need to manually detach it:
- Drain the node.
See Safely draining a node. You might want to
include the
--ignore-daemonsets
and--delete-local-data
flags in yourkubectl drain
command. - Power off the VM.
- Edit the VM's hardware config in vCenter to remove the volume.
- Power on the VM
- Uncordon the node.
Volume is lost
Symptoms
The output of
gkectl diagnose cluster
looks like the following:Checking cluster object...PASS Checking machine objects...PASS Checking control plane pods...PASS Checking gke-connect pods...PASS Checking kube-system pods...PASS Checking gke-system pods...PASS Checking storage...FAIL PersistentVolume pvc-52161704-d350-11e9-9db8-e297f465bc84: virtual disk "[datastore_nfs] kubevols/kubernetes-dynamic-pvc-52161704-d350-11e9-9db8-e297f465bc84.vmdk" IS NOT found 1 storage errors
One or more Pods is stuck in
ContainerCreating
state with a warning like the following:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedAttachVolume 71s (x28 over 42m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-52161704-d350-11e9-9db8-e297f465bc84" : File []/vmfs/volumes/43416d29-03095e58/kubevols/ kubernetes-dynamic-pvc-52161704-d350-11e9-9db8-e297f465bc84.vmdk was not found
Potential causes
If you see a "not found" error related to your VMDK file, it is likely that the virtual disk was permanently deleted. This can happen if an operator manually deletes a virtual disk or the virtual machine it is attached to. To prevent this, manage your virtual machines as described in Resizing a user cluster and Upgrading clusters
Resolution
If a virtual disk was permanently deleted, you might need to manually clean up related Kubernetes resources:
- Delete the PVC that referenced the PV by running
kubectl delete pvc [PVC_NAME].
- Delete the Pod that referenced the PVC by running
kubectl delete pod [POD_NAME].
- Repeat step 2. (Yes, really. See Kubernetes issue 74374.)
Upgrades
About downtime during upgrades
Resource Description Admin cluster When an admin cluster is down, user cluster control planes and workloads on user clusters continue to run, unless they were affected by a failure that caused the downtime
User cluster control plane Typically, you should expect no noticeable downtime to user cluster control planes. However, long-running connections to the Kubernetes API server might break and would need to be re-established. In those cases, the API caller should retry until it establishes a connection. In the worst case, there can be up to one minute of downtime during an upgrade.
User cluster nodes If an upgrade requires a change to user cluster nodes, GKE on-prem recreates the nodes in a rolling fashion, and reschedules Pods running on these nodes. You can prevent impact to your workloads by configuring appropriate PodDisruptionBudgets and anti-affinity rules.
Resizing user clusters
Resizing a user cluster fails
- Symptoms
A resize operation on a user cluster fails.
- Potential causes
Several factors could cause resize operations to fail.
- Resolution
If a resize fails, follow these steps:
Check the cluster's MachineDeployment status to see if there are any events or error messages:
kubectl describe machinedeployments [MACHINE_DEPLOYMENT_NAME]
Check if there are errors on the newly-created Machines:
kubectl describe machine [MACHINE_NAME]
Error: "no addresses can be allocated"
- Symptoms
After resizing a user cluster,
kubectl describe machine [MACHINE_NAME]
displays the following error:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Failed 9s (x13 over 56s) machineipam-controller ipam: no addresses can be allocated
- Potential causes
There aren't enough IP addresses available for the user cluster.
- Resolution
Allocate more IP addresses for the cluster. Then, delete the affected Machine:
kubectl delete machine [MACHINE_NAME]
If the cluster is configured correctly, a replacement Machine is created with an IP address.
Sufficient number of IP addresses allocated, but Machine fails to register with cluster
- Symptoms
Network has enough addresses allocated but the Machine still fails to register with the user cluster.
- Possible causes
There might be an IP conflict. The IP might be taken by another Machine or by your load balancer.
- Resolution
Check that the affected Machine's IP address is not taken. If there is a conflict, you need to resolve the conflict in your environment.
vSphere
Debugging with
govc
If you encounter issues specific to vSphere, you can use
govc
to troubleshoot. For example, you can easily confirm permissions and access for your vCenter user accounts and collect vSphere logs.Changing vCenter Certificate
If you are running a vCenter server in evaluation or default setup mode, and it has a generated TLS certificate, this certificate might change over time. If the certificate has changed, you need to let your running cluster(s) know about the new certificate:
Retrieve the new vCenter cert and save to a file:
true | openssl s_client -connect [VCENTER_IP_ADDRESS]:443 -showcerts 2>/dev/null | sed -ne '/-BEGIN/,/-END/p' > vcenter.pem
Now, for each cluster, delete the ConfigMap containing the vSphere and vCenter certificate for each cluster, and create a new ConfigMap with the new cert. For example:
kubectl --kubeconfig kubeconfig delete configmap vsphere-ca-certificate -n kube-system
kubectl --kubeconfig kubeconfig delete configmap vsphere-ca-certificate -n user-cluster1
kubectl --kubeconfig kubeconfig create configmap -n user-cluster1 --dry-run vsphere-ca-certificate --from-file=ca.crt=vcenter.pem -o yaml | kubectl --kubeconfig kubeconfig apply -f -
kubectl --kubeconfig kubeconfig create configmap -n kube-system --dry-run vsphere-ca-certificate --from-file=ca.crt=vcenter.pem -o yaml | kubectl --kubeconfig kubeconfig apply -f -
Delete the clusterapi-controllers Pod for each cluster. When the Pod restarts, it begins using the new certificate. For example:
kubectl --kubeconfig kubeconfig -n kube-system get pods
kubectl --kubeconfig kubeconfig -n kube-system delete pod clusterapi-controllers-...
Miscellaneous
Terraform vSphere provider session limit
GKE on-prem uses Terraform's vSphere provider to bring up VMs in your vSphere environment. The provider's session limit is 1000 sessions. The current implementation doesn't close active sessions after use. You might encounter 503 errors if you have too many sessions running.
Sessions are automatically closed after 300 seconds.
- Symptoms
If you have too many sessions running, you might encounter the following error:
Error connecting to CIS REST endpoint: Login failed: body: {"type":"com.vmware.vapi.std.errors.service_unavailable","value": {"messages":[{"args":["1000","1000"],"default_message":"Sessions count is limited to 1000. Existing sessions are 1000.", "id":"com.vmware.vapi.endpoint.failedToLoginMaxSessionCountReached"}]}}, status: 503 Service Unavailable
- Potential causes
There are too many Terraform provider sessions running in your environment.
- Resolution
Currently, this is working as intended. Sessions are automatically closed after 300 seconds. For more information, refer to to GitHub issue #618.
Using a proxy for Docker:
oauth2: cannot fetch token
- Symptoms
While using a proxy, you encounter the following error:
oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: proxyconnect tcp: tls: oversized record received with length 20527
- Potential causes
You might have provided a HTTPS proxy instead of HTTP.
- Resolution
In your Docker configuration, change the proxy address to
http://
instead ofhttps://
.
Verifying that licenses are valid
Remember to verify that your licenses is valid, especially if you are using trial licenses. You might encounter unexpected failures if your F5, ESXi host, or vCenter licenses have expired.
-