Google Distributed Cloud air-gapped 1.13.3 release notes

August 30, 2024


Google Distributed Cloud (GDC) air-gapped 1.13.3 is available.
See the product overview to learn about the features of Distributed Cloud.

Cluster management:

H100 support:

  • With 1.13.3, Google Distributed Cloud air-gapped integrates with the latest NVIDIA Hopper H100 GPUs, paired with the newest 5th Generation Intel processors and a new A3 VM.

Vertex AI:



Updated the Rocky OS image version to 20240731 to apply the latest security patches and important updates.


Block storage:

  • Grafana pods stuck in Init state due to volume mount errors.

Database Service:

  • The dbs-fleet subcomponent has a reconciliation error when upgrading.

Identity and access management:

  • The gatekeeper-audit pods in the opa-system namespace frequently restart.

Monitoring:

  • The Cortex store gateway pods can crashloop on startup while syncing with the storage backend. The pods exceed their memory limits, causing Kubernetes to terminate them.
  • The Kube control-plane metrics proxy pods can crashloop with image pull backoff error.
  • A growth in WAL (write-ahead log) causes Prometheus to use a lot of memory. The system control plane VM node reports NodeHasInsufficientMemory and EvictionThresholdMet events because of this issue.

Networking:

  • The switch image failed to extract or pull an image.

Object storage:

  • Some object storage upgrade warnings can be ignored.

Operating system:

  • Pods are stuck in a ContainerCreating state on a single node.

Upgrade:

  • A Helm failure during upgrade causes a series of rollbacks.
  • When upgrading from HW2.0 and Ubuntu, the node upgrade incorrectly displays RockyLinux.
  • The dhcp-tftp-core-server pod is not drained.
  • The OrganizationUpgradeis stuck at node upgrade stage.
  • Intermittent connectivity failure to external cluster VIP.
  • Kernel fails to create container.
  • An Incorrect version of Trident error appears during upgrade.
  • During user cluster provisioning, some pods fail to be scheduled.

Virtual machines:

  • The NVIDIA device plugin DaemonSet fails with the driver rpc error message on cluster nodes with GPUs. This issue causes GPUs to be unavailable for virtual machines and pods.
  • System cluster VM not ready.
  • A data volume reports that the scratch space is not found.

Vertex AI:

  • The streaming_recognize pre-trained API function of Speech-to-Text fails because of an issue with the client library.
  • Job status polling is not supported for the batchTranslateDocument API.
  • batchTranslateDocument requests might cause storage client outage.
  • The first time you enable pre-trained APIs, the GDC console might show an inconsistent status after a few minutes.
  • Translation requests with more than 250 characters can crash translation-prediction-server pods.
  • The `GPUAllocation` for shared service cluster is not configured correctly.

Monitoring:

  • Fixed an issue where the Prober ConfigMap gets reset to include no probe jobs.

Networking:

  • Fixed an issue with a PodCIDR not assigned to nodes even though a ClusterCIDRConfig is created.

Operating system:

  • Fixed an issue with the bm-system-machine-preflight-check Ansible job for a bare metal or VM node failing with Either ip_tables or nf_tables kernel module must be loaded.

Physical servers:

  • Fixed an issue with the server bootstrap failing due to POST issues on the HPE server.

Vertex AI:

  • Fixed an issue where the MonitoringTarget shows a Not Ready status when user clusters are being created, causing pre-trained APIs to continually show an Enabling state in the user interface.

Add-on Manager: