Troubleshoot Google Distributed Cloud NFS and DataPlane v2 issues

This document details a manual procedure for Google Distributed Cloud if you have issues with NFS mounts with a stuck volume or Pod and you created your cluster with DataPlane v2 enabled.

This issue has been fixed for the following versions:

  • For minor version 1.16, version 1.16.4-gke.37 and higher.
  • For minor version 1.28, version 1.28.200-gke.111 and higher.

We recommend that you upgrade to a version where this issue is fixed. If you're unable to upgrade, use the procedures outlined in the following sections.

If you're using a version where this issue isn't fixed, you might encounter issues if you have workloads using ReadWriteMany volumes powered by storage drivers that are susceptible to this issue, such as (but not limited to):

  • Portworx (sharedv4 service volumes)
  • csi-nfs

NFS mounts on some storage architectures might become stuck when they're connected to an endpoint using a Kubernetes Service (ClusterIP) and DataPlane v2. This behavior is because of the limitations in how Linux kernel socket code interacts with Cillium's eBPF program. Containers might become blocked on I/O or even be unkillable, as the defunct NFS mount can't be unmounted.

You might experience this issue if you use RWX storage hosted on NFS servers that run on a Kubernetes node, including software-defined or hyperconverged storage solutions like as Ondat,, or Portworx.

If you need additional assistance, reach out to Cloud Customer Care.

Review existing cluster configuration

Get some existing configuration values from your cluster. You use the values from these steps to create a kube-proxy manifest in the next section.

  1. Get the ClusterCIDR from cm/cilium-config:

    kubectl get cm -n kube-system cilium-config -o yaml | grep native-routing-cidr

    The following example output shows that you would use as the ClusterCIDR:

  2. Get the APIServerAdvertiseAddress and APIServerPort from the anetd DaemonSet:

    kubectl get ds -n kube-system  anetd -o yaml | grep KUBERNETES -A 1

    The following example output show that you would use as the APIServerAdvertiseAddress and 443 as the APIServerPort:

      value: "443"
  3. Get the RegistryCredentialsSecretName from the anetd DaemonSet:

    kubectl get ds -n kube-system  anetd -o yaml | grep imagePullSecrets -A 1

    The following example output shows that you would use private-registry-creds as the RegistryCredentialsSecretName:

      - name: private-registry-creds
  4. Get the Registry from the anetd DameonSet:

    kubectl get ds -n kube-system  anetd -o yaml | grep image

    The following example output shows that you would use as the Registry:

  5. Get the KubernetesVersion from the image tag for kube-apiserver in the cluster namespace of the admin cluster:

    kubectl get sts -n CLUSTER_NAME kube-apiserver -o yaml | grep image

    Replace ADMIN_KUBECONFIG with the kubeconfig file for your admin cluster and CLUSTER_NAME with the name of your user cluster.

    The following example output shows that you would use v1.26.2-gke.1001 as the KubernetesVersion:

    imagePullPolicy: IfNotPresent

Prepare kube-proxy manifests

Use the values obtained in the previous section to create and apply a YAML manifest that will deploy kube-proxy to your cluster.

  1. Create a manifest named kube-proxy.yaml in the editor of your choice:

    nano kube-proxy.yaml
  2. Copy and paste the following YAML definition:

    apiVersion: apps/v1
    kind: DaemonSet
        k8s-app: kube-proxy
      name: kube-proxy
      namespace: kube-system
          k8s-app: kube-proxy
            k8s-app: kube-proxy
          - command:
            - kube-proxy
            - --v=2
            - --profiling=false
            - --iptables-min-sync-period=10s
            - --iptables-sync-period=1m
            - --oom-score-adj=-998
            - --ipvs-sync-period=1m
            - --ipvs-min-sync-period=10s
            - --cluster-cidr=ClusterCIDR
            - name: KUBERNETES_SERVICE_HOST
            - name: KUBERNETES_SERVICE_PORT
              value: "APIServerPort"
            image: Registry/kube-proxy-amd64:KubernetesVersion
            imagePullPolicy: IfNotPresent
            name: kube-proxy
                cpu: 100m
                memory: 15Mi
              privileged: true
            - mountPath: /run/xtables.lock
              name: xtables-lock
            - mountPath: /lib/modules
              name: lib-modules
          - name: RegistryCredentialsSecretName
          hostNetwork: true
          priorityClassName: system-node-critical
          serviceAccount: kube-proxy
          serviceAccountName: kube-proxy
          - effect: NoExecute
            operator: Exists
          - effect: NoSchedule
            operator: Exists
          - hostPath:
              path: /run/xtables.lock
              type: FileOrCreate
            name: xtables-lock
          - hostPath:
              path: /lib/modules
              type: DirectoryOrCreate
            name: lib-modules
      kind: ClusterRoleBinding
        name: system:kube-proxy
        kind: ClusterRole
        name: system:node-proxier
        - kind: ServiceAccount
          name: kube-proxy
          namespace: kube-system
      apiVersion: v1
      kind: ServiceAccount
        name: kube-proxy
        namespace: kube-system

    In this YAML manifest, set the following values:

    • APIServerAdvertiseAddress: the value of KUBERNETES_SERVICE_HOST, such as
    • APIServerPort: the value of KUBERNETES_SERVICE_PORT, such as 443.
    • Registry: the prefix of the Cilium image, such as
    • RegistryCredentialsSecretName: the image pull secret name, such as private-registry-creds.
  3. Save and close the manifest file in your editor.

Prepare anetd patch

Create and prepare an update for anetd:

  1. Create a manifest named cilium-config-patch.yaml in the editor of your choice:

    nano cilium-config-patch.yaml
  2. Copy and paste the following YAML definition:

      kube-proxy-replacement: "disabled"
      kube-proxy-replacement-healthz-bind-address: ""
      retry-kube-proxy-healthz-binding: "false"
      enable-host-reachable-services: "false"
  3. Save and close the manifest file in your editor.

Deploy kube-proxy and reconfigure anetd

Apply your configuration changes to your cluster. Create backups of your existing configuration before you apply the changes.

  1. Back up your current anetd and cilium-config configuration:

    kubectl get ds -n kube-system anetd > anetd-original.yaml
    kubectl get cm -n kube-system cilium-config > cilium-config-original.yaml
  2. Apply kube-proxy.yaml using kubectl:

    kubectl apply -f kube-proxy.yaml
  3. Check that the Pods are Running:

    kubectl get pods -n kube-system -o wide | grep kube-proxy

    The following example condensed output shows that the Pods are running correctly:

    kube-proxy-f8mp9    1/1    Running   1 (4m ago)    [...]
    kube-proxy-kndhv    1/1    Running   1 (5m ago)    [...]
    kube-proxy-sjnwl    1/1    Running   1 (4m ago)    [...]
  4. Patch the cilium-config ConfigMap using kubectl:

    kubectl patch cm -n kube-system cilium-config --patch-file cilium-config-patch.yaml
  5. Edit anetd using kubectl:

    kubectl edit ds -n kube-system anetd

    In the editor that opens up, edit the spec of anetd. Insert the following as the first item under initContainers:

    - name: check-kube-proxy-rules
      image: Image
      imagePullPolicy: IfNotPresent
      - sh
      - -ec
      - |
        if [ "$KUBE_PROXY_REPLACEMENT" != "strict" ]; then
          kube_proxy_forward() { iptables -L KUBE-FORWARD; }
          until kube_proxy_forward; do sleep 2; done
            key: kube-proxy-replacement
            name: cilium-config
            optional: true
        privileged: true

    Replace Image with the same image used in the other Cilium containers in the anetd DaemonSet, such as

  6. Save and close the manifest file in your editor.

  7. To apply these changes, reboot of all nodes in your cluster. To minimize disruption, you can attempt to drain each node prior to the reboot. However, Pods using RWX volumes may be stuck in a Terminating state due to broken NFS mounts that block the drain process.

    You can force delete blocked Pods and allow the Node to correctly drain:

    kubectl delete pods -–force -–grace-period=0 --namespace POD_NAMESPACE POD_NAME

    Replace POD_NAME with the Pod you are trying to delete and POD_NAMESPACE with its namespace.

What's next

If you need additional assistance, reach out to Cloud Customer Care.