This document provides troubleshooting steps for common issues that you might encounter with the container runtime on your Google Kubernetes Engine (GKE) nodes.
If you need additional assistance, reach out to Cloud Customer Care.Mount paths with simple drive letters fail on Windows node pools with containerd
This issue has been resolved in containerd version 1.6.6 and higher.
GKE clusters running Windows Server node pools that use the containerd runtime prior to version 1.6.6 might experience errors when starting containers like the following:
failed to create containerd task : CreateComputeSystem : The parameter is incorrect : unknown
For more details, refer to GitHub issue #6589.
Solution
To resolve this issue, upgrade your node pools to the latest GKE versions that uses containerd runtime version 1.6.6 or higher.
Container images with non-array pre-escaped CMD
or ENTRYPOINT
command lines fail on Windows node pools with containerd
This issue has been resolved in containerd version 1.6 and higher.
GKE clusters running Windows Server node pools that use the containerd runtime 1.5.X might experience errors when starting containers like the following:
failed to start containerd task : hcs::System::CreateProcess : The system cannot find the file specified.: unknown
For more details, refer to GitHub issue #5067 and GitHub issue #6300.
Solution
To resolve this issue, upgrade your node pools to the latest GKE versions that uses containerd runtime version 1.6.6 or higher.
Container image volumes with non-existing paths or Linux-like (forward slash) paths fail on Windows node pools with containerd
This issue has been resolved in containerd version 1.6 and higher.
GKE clusters running Windows Server node pools that use the containerd runtime 1.5.X might experience errors when starting containers like the following:
failed to generate spec: failed to stat "<volume_path>": CreateFile : The system cannot find the path specified.
For more details, refer to GitHub issue #5671.
Solution
To resolve this issue, upgrade your node pools to the latest GKE versions that uses containerd runtime version 1.6.x or higher.
/etc/mtab
: No such file or directory
The Docker container runtime populates this symlink inside the container by default, but the containerd runtime does not.
For more details, refer to GitHub issue #2419.
Solution
To resolve this issue, manually create the symlink /etc/mtab
during your image build.
ln -sf /proc/mounts /etc/mtab
Image pull error: not a directory
Affected GKE versions: all
When you build an image with kaniko, it might fail to be pulled with containerd with the error message "not a directory". This error happens if the image is built in a special way: when a previous command removes a directory and the next command recreates the same files in that directory.
The following Dockerfile example with npm
that illustrates this problem.
RUN npm cache clean --force
RUN npm install
For more details, refer to GitHub issue #4659.
Solution
To resolve this issue, build your image using docker build
, which is unaffected
by this issue.
If docker build
isn't an option for you, then combine the commands into one.
The following Dockerfile example combines RUN npm cache clean --force
and
RUN npm install
:
RUN npm cache clean --force && npm install
Some file system metrics are missing and the metrics format is different
Affected GKE versions: all
The Kubelet /metrics/cadvisor
endpoint provides Prometheus metrics, as
documented in
Metrics for Kubernetes system components.
If you install a metrics collector that depends on that endpoint, you might see
the following issues:
- The metrics format on the Docker node is
k8s_<container-name>_<pod-name>_<namespace>_<pod-uid>_<restart-count>
but the format on the containerd node is<container-id>
. Some file system metrics are missing on the containerd node, as follows:
container_fs_inodes_free container_fs_inodes_total container_fs_io_current container_fs_io_time_seconds_total container_fs_io_time_weighted_seconds_total container_fs_limit_bytes container_fs_read_seconds_total container_fs_reads_merged_total container_fs_sector_reads_total container_fs_sector_writes_total container_fs_usage_bytes container_fs_write_seconds_total container_fs_writes_merged_total
Solution
You can mitigate this issue by using cAdvisor as a standalone daemonset.
- Find the latest cAdvisor release
with the name pattern
vX.Y.Z-containerd-cri
(for example,v0.42.0-containerd-cri
). - Follow the steps in cAdvisor Kubernetes Daemonset to create the daemonset.
- Point the installed metrics collector to use the cAdvisor
/metrics
endpoint that provides the full set of Prometheus container metrics.
Alternatives
- Migrate your monitoring solution to Cloud Monitoring, which provides the full set of container metrics.
- Collect metrics from the Kubelet summary API
with an endpoint of
/stats/summary
.
Attach-based operations don't function correctly after container-runtime restarts on GKE Windows
Affected GKE versions: 1.21 to 1.21.5-gke.1802, 1.22 to 1.22.3-gke.700
GKE clusters running Windows Server node pools that use the containerd runtime (version 1.5.4 and 1.5.7-gke.0) might experience issues if the container runtime is forcibly restarted, with attach operations to existing running containers not being able to bind IO again. The issue won't cause failures in API calls, however data won't be sent or received. This includes data for attach and logs CLIs and APIs through the cluster API server.
Solution
To resolve this issue, upgrade to patched container runtime version (1.5.7-gke.1) with newer GKE releases.
Pods display failed to allocate for range 0: no IP addresses available in range set
error message
Affected GKE versions: 1.24.6-gke.1500 or earlier, 1.23.14-gke.1800 or earlier, and 1.22.16-gke.2000 or earlier
GKE clusters running node pools that use containerd might experience IP leak issues and exhaust all the Pod IPs on a node. A Pod scheduled on an affected node displays an error message similar to the following:
failed to allocate for range 0: no IP addresses available in range set: 10.48.131.1-10.48.131.62
For more information about the issue, see containerd GitHub issue #5438 and GitHub issue #5768.
There is a known issue in GKE Dataplane V2 that can trigger this issue. However, this issue can be triggered by other causes, including runc stuck.
Solution
To resolve this issue, follow the workarounds mentioned in the Workarounds for Standard GKE clusters for GKE Dataplane V2.
Exec probe behavior difference when probe exceeds the timeout
Affected GKE versions: all
Exec probe behavior on containerd images is different from the behavior
on dockershim
images. When exec probe that is defined for the Pod exceeds the
declared Kubernetes timeoutSeconds
threshold, on dockershim
images, it is treated as a probe failure.
On containerd images, probe results returned after the declared timeoutSeconds
threshold are ignored.
Solution
In GKE, the feature gate ExecProbeTimeout
is set to
false
and cannot be changed. To resolve this issue, increase the
timeoutSeconds
threshold for all affected exec probes or implement the timeout
functionality as part of the probe logic.
Troubleshoot issues with private registries
This section provides troubleshooting information for private registry configurations in containerd.
Image pull fails with error x509: certificate signed by unknown authority
This issue occurs if GKE couldn't find a certificate for a specific private registry domain. You can check for this error in Cloud Logging using the following query:
Go to the Logs Explorer page in the Google Cloud console:
Run the following query:
("Internal error pulling certificate" OR "Failed to get credentials from metadata server" OR "Failed to install certificate")
To resolve this issue, try the following:
In GKE Standard, open the configuration file exists in the following path:
/etc/containerd/hosts.d/DOMAIN/config.toml
Replace
DOMAIN
with the FQDN for the registry.Verify that your configuration file contains the correct FQDN.
Verify that the path to the certificate in the
secretURI
field in the configuration file is correct.Verify that the certificate exists in Secret Manager.
Certificate not present
This issue occurs if GKE couldn't pull the certificate from Secret Manager to configure containerd on your nodes.
To resolve this issue, try the following:
- Ensure that the affected node runs Container-Optimized OS. Ubuntu and Windows nodes aren't supported.
- In your configuration file, ensure that the path to the secret in the
secretURI
field is correct. - Check that your cluster's IAM service account has the correct permissions to access the secret.
- Check that the cluster has the
cloud-platform
access scope. For instructions, see Check access scopes.
Insecure registry option is not configured for local network (10.0.0.0/8)
Affected GKE versions: all
On containerd images, the insecure registry option is not configured for local
network 10.0.0.0/8
. If you use insecure private registries, you might notice
errors similar to the following:
pulling image: rpc error: code = Unknown desc = failed to pull and unpack image "IMAGE_NAME": failed to do request: Head "IMAGE_NAME": http: server gave HTTP response to HTTPS client
To resolve this issue, try the following:
- Use Artifact Registry
- Configure TLS on your private registries if your use case supports this option. You can use a containerd configuration file to tell GKE to use certificates that you store in Secret Manager to access your private registry. For instructions, see Access private registries with private CA certificates.
Configure privileged DaemonSets to modify your containerd configuration
For Standard clusters, try the following steps. This workaround isn't available in Autopilot because privileged containers are a security risk. If your environment is exposed to the internet, consider your risk tolerance before deploying this solution. In all cases, we strongly recommend that you configure TLS for your private registry and use the Secret Manager option instead.
Review the following manifest:
In the
.spec.containers.env
field, replace theREGISTRY_ADDRESS
value of theADDRESS
variable with the address of your local HTTP registry in the formatDOMAIN_NAME:PORT
. For example,containers: - name: startup-script ... env: - name: ADDRESS value: "example.com:5000"
Deploy the DaemonSet:
kubectl apply -f insecure-registry-ds.yaml
The DaemonSet adds your insecure registry to the containerd configuration on every node.
containerd ignores any device mappings for privileged pods
Affected GKE versions: all
For privileged Kubernetes Pods,
the container runtime ignores any device mappings that volumeDevices.devicePath
pass to it, and instead makes every device on the host available to the container
under /dev
.
containerd leaks shim processes when nodes are under I/O pressure
Affected GKE versions: 1.25.0 to 1.25.15-gke.1040000, 1.26.0 to 1.26.10-gke.1030000, 1.27.0 to 1.27.6-gke.1513000, and 1.28.0 to 1.28.3-gke.1061000
When a GKE node is under I/O pressure, containerd might fail to
delete the containerd-shim-runc-v2
processes when a Pod is deleted, resulting
in process leaks. When the leak happens on a node, you'll see more
containerd-shim-runc-v2
processes on the node than the number of Pods on that
node. You might also see increased memory and CPU usage along with extra PIDs.
For details, see the GitHub issue
Fix leaked shim caused by high IO pressure.
To resolve this issue, upgrade your nodes to the following versions or later:
- 1.25.15-gke.1040000
- 1.26.10-gke.1030000
- 1.27.6-gke.1513000
- 1.28.3-gke.1061000
IPv6 address family is enabled on pods running containerd
Affected GKE versions: 1.18, 1.19, 1.20.0 to 1.20.9
IPv6 image family is enabled for Pods running with containerd. The dockershim
image disables IPv6 on all Pods, while the containerd image does not. For
example, localhost
resolves to IPv6 address ::1
first. This typically isn't
a problem, but this might result in unexpected behavior in certain cases.
Solution
To resolve this issue, use an IPv4 address such as 127.0.0.1
explicitly, or
configure an application running in the Pod to work on both address families.
Node auto-provisioning only provisions Container-Optimized OS with Docker node pools
Affected GKE versions: 1.18, 1.19, 1.20.0 to 1.20.6-gke.1800
Node auto-provisioning allows autoscaling node pools with any supported image type, but can only create new node pools with the Container-Optimized OS with Docker image type.
Solution
To resolve this issue, upgrade your GKE clusters to version 1.20.6-gke.1800 or later. In these GKE versions, the default image type can be set for the cluster.
Conflict with 172.17/16
IP address range
Affected GKE versions: 1.18.0 to 1.18.14
The 172.17/16
IP address range is occupied by the docker0
interface on the
node VM with containerd enabled. Traffic sending to or originating from that
range might not be routed correctly (for example, a Pod might not be able to
connect to a VPN-connected host with an IP address within 172.17/16
).
GPU metrics not collected
Affected GKE versions: 1.18.0 to 1.18.18
GPU usage metrics are not collected when using containerd as a runtime on GKE versions before 1.18.18.
Solution
To resolve this issue, upgrade your clusters to GKE versions 1.18.18 or later.
Images with config.mediaType
set to application/octet-stream
can't be used on containerd
Affected GKE versions: all
Images with config.mediaType
set to "application/octet-stream"
cannot be used on containerd. For more information, see
GitHub issue #4756.
These images are not compatible with the Open Container Initiative specification
and are considered incorrect. These images work with Docker to provide backward
compatibility, while in containerd these images are not supported.
Symptom and diagnosis
Example error in node logs:
Error syncing pod <pod-uid> ("<pod-name>_<namespace>(<pod-uid>)"), skipping: failed to "StartContainer" for "<container-name>" with CreateContainerError: "failed to create containerd container: error unpacking image: failed to extract layer sha256:<some id>: failed to get reader from content store: content digest sha256:<some id>: not found"
The image manifest can usually be found in the registry where it is hosted.
Once you have the manifest, check config.mediaType
to determine if you
have this issue:
"mediaType": "application/octet-stream",
Solution
As the containerd community decided to not support such images, all
versions of containerd are affected and there is no fix. The container image
must be rebuilt with Docker version 1.11 or later and you must ensure that the
config.mediaType
field is not set to "application/octet-stream"
.
CNI not initialized
Affected GKE versions: all
If you see an error similar to the following, the Container Network Interface (CNI) config isn't ready:
Error: "network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized".
There are two main reasons that this error occurs:
- The CNI hasn't finished installing
- The webhook is misconfigured
Ensure the CNI has finished installation
You might see this error in your log files during node bootstrapping while GKE installs the CNI config. If you see this error, but GKE is creating all nodes correctly, you can safely ignore this error.
This situation can happen because the CNI provides Pods with their network connectivity, so Pods need the CNI to work. However, Kubernetes uses taints to mark nodes that aren't ready and system Pods can tolerate these taints. This means that system Pods can start on a new node before the network is ready, resulting in the error.
To resolve this issue, wait for GKE to finish installing the CNI config. After CNI finishes configuring the network, the system Pods start successfully with no intervention required.
Fix misconfigured webhooks
If the CNI not initialized error persists and you notice that GKE is failing to create nodes during an upgrade, resize, or other action, you might have a misconfigured webhook.
If you have a custom webhook that intercepts the DaemonSet controller command to
create a Pod and that webhook is misconfigured, you might see the error as a
node error status in the Google Cloud console. This misconfiguration prevents
GKE from creating a netd
or calico-node
Pod. If the netd
or
calico-node
Pods started successfully while the error persists,
contact Customer Care.
To fix any misconfigured webhooks, complete the following steps:
Identify misconfigured webhooks.
If you're using a cluster with Dataplane V1 network policy enforcement enabled, you can also check the status of the
calico-typha
Pod for information about which webhooks are causing this error:kubectl describe pod -n kube-system -l k8s-app=calico-typha
If the Pod has an error, the output is similar to the following:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreate 9m15s (x303 over 3d7h) replicaset-controller Error creating: admission webhook WEBHOOK_NAME denied the request [...]
In this output,
WEBHOOK_NAME
is the name of a failing webhook. Your output might include information about a different type of error.If you want to keep the misconfigured webhooks, troubleshoot them. If they're not required, delete them by running the following commands:
kubectl delete mutatingwebhookconfigurations WEBHOOK_NAME kubectl delete validatingwebhookconfigurations WEBHOOK_NAME
Replace
WEBHOOK_NAME
with the name of the misconfigured webhook that you want to remove.Configure your webhooks to ignore system Pods.