August 30, 2024
See the product overview to learn about the features of Distributed Cloud.
Cluster management:
- Introduced a broader set of Multi-Instance GPU (MIG) profiles (uniform & mixed mode). You can create Google Kubernetes Engine clusters on GPU VMs (A3 VMs) with a variety of GPU slicing schemes and dynamically address the GPU resource needs of services hosting artificial intelligence (AI) workloads.
 
Hardware:
- New DL380a servers with the latest NVIDIA Hopper H100 GPUs (2x2 NVL), paired with the newest 5th Generation Intel processors are available.
 
Virtual machines:
- A new GPU-optimized A3 VM type is available. The A3 VM type has 4x NVIDIA H100 80GB GPUs attached, which can run your AI workloads requiring large language models up to 100B parameters.
 - Smaller A3 VM shapes are introduced, with 1x H100 80GB GPU and 2x H100 80GB GPUs attached per VM. This feature is in Preview.
 
Vertex AI:
- Included support for new file formats of document translation (DOC, PPT, TXT, XLS).
 - Added the API and support for batch document translation.
 - Supported a new format for the accelerator type of MIG GPUs in the resource pool for online predictions.
 - Supported the language auto-detect feature for inline translations and documents stored in buckets.
 - The API platform is in the production stage.
 
- CVE-2021-20230
 - CVE-2022-48655
 - CVE-2022-4968
 - CVE-2022-48674
 - CVE-2023-6270
 - CVE-2023-6597
 - CVE-2023-52752
 - CVE-2024-0397
 - CVE-2024-0450
 - CVE-2024-0760
 - CVE-2024-1724
 - CVE-2024-1737
 - CVE-2024-1975
 - CVE-2024-2201
 - CVE-2024-4032
 - CVE-2024-4076
 - CVE-2024-5569
 - CVE-2024-6655
 - CVE-2024-7264
 - CVE-2024-23307
 - CVE-2024-24861
 - CVE-2024-26583
 - CVE-2024-26584
 - CVE-2024-26585
 - CVE-2024-26586
 - CVE-2024-26642
 - CVE-2024-26643
 - CVE-2024-26828
 - CVE-2024-26886
 - CVE-2024-26889
 - CVE-2024-26907
 - CVE-2024-26922
 - CVE-2024-26923
 - CVE-2024-26925
 - CVE-2024-26926
 - CVE-2024-27019
 - CVE-2024-29068
 - CVE-2024-29069
 - CVE-2024-35235
 - CVE-2024-36016
 - CVE-2024-37370
 - CVE-2024-37371
 - CVE-2024-38428
 
Updated the Rocky OS image version to 20240731 to apply the latest security patches and important updates.
Billing:
-   
User fails to create 
BillingAccountBindingdue to validation webhook error. 
Block storage:
-   
Grafana pods stuck in 
Initstate due to volume mount errors. - There is a Trident multi-attach error.
 
Database Service:
-   
The 
dbs-fleetsubcomponent has a reconciliation error when upgrading. -   
The 
DBClustercreation fails after upgrade. 
Identity and access management:
-   
The 
gatekeeper-auditpods in theopa-systemnamespace frequently restart. 
Monitoring:
- The Cortex store gateway pods can crashloop on startup while syncing with the storage backend. The pods exceed their memory limits, causing Kubernetes to terminate them.
 - The Kube control-plane metrics proxy pods can crashloop with image pull backoff error.
 -   
A growth in WAL (write-ahead log) causes Prometheus to use a lot of memory. The system control plane VM node reports 
NodeHasInsufficientMemoryandEvictionThresholdMetevents because of this issue. 
Networking:
- The switch image failed to extract or pull an image.
 
Object storage:
- Some object storage upgrade warnings can be ignored.
 
Operating system:
-   
Pods are stuck in a 
ContainerCreatingstate on a single node. 
Physical servers:
- The DL380a server fails to provision.
 
Upgrade:
- A Helm failure during upgrade causes a series of rollbacks.
 - When upgrading from HW2.0 and Ubuntu, the node upgrade incorrectly displays RockyLinux.
 -   
The 
dhcp-tftp-core-serverpod is not drained. -   
The 
OrganizationUpgradeis stuck at node upgrade stage. - Intermittent connectivity failure to external cluster VIP.
 - Kernel fails to create container.
 -   
An 
Incorrect version of Tridenterror appears during upgrade. - During user cluster provisioning, some pods fail to be scheduled.
 -   
The tenant organization upgrade fails at the preflight check stage with 
ErrImagePull. - The root org upgrade is stuck on a failed signature job.
 - During upgrade, the task for a root organization fails due to missing service accounts.
 -   
Upgrade fails on 
shared-service-cluster upgrade - The node fails during the user cluster upgrade.
 - The root organization upgrade fails for preflight check.
 -   
There is a persistent timeout during the initial root 
organizationupgrade. -  
The 
obj-syslog-serversubcomponent fails reconciliation in the root org. 
Virtual machines:
-  
The NVIDIA device plugin 
DaemonSetfails with thedriver rpc errormessage on cluster nodes with GPUs. This issue causes GPUs to be unavailable for virtual machines and pods. - System cluster VM not ready.
 - A data volume reports that the scratch space is not found.
 -  
The 
obj-syslog-serversubcomponent fails reconciliation in the root org. 
Vertex AI:
-  
The 
streaming_recognizepre-trained API function of Speech-to-Text fails because of an issue with the client library. -  
Job status polling is not supported for the 
batchTranslateDocumentAPI. -  
batchTranslateDocumentrequests might cause performance issues. - The first time you enable pre-trained APIs, the GDC console might show an inconsistent status after a few minutes.
 -  
Translation requests with more than 250 characters can crash 
translation-prediction-serverpods. -   
The 
GPUAllocationfor shared service cluster is not configured correctly. - When upgrading from version 1.9.x to 1.13.3, the Operable Component Lifecycle Management (OCLCM) controller for Vertex AI subcomponents might show errors.
 -  
Translation requests might generate the 
RESOURCE_EXHAUSTEDerror code when the system frequency limit has been exceeded. -  
batchTranslateDocumentrequests return error503 "Batch Document translation is not implementedif theenableRAGoperable parameter is not set totruein the cluster. 
Monitoring:
- Fixed an issue where the Prober ConfigMap gets reset to include no probe jobs.
 
Networking:
-  
Fixed an issue with a 
PodCIDRnot assigned to nodes even though aClusterCIDRConfigis created. 
Operating system:
- Fixed an issue with the 
bm-system-machine-preflight-checkAnsible job for a bare metal or VM node failing withEither ip_tables or nf_tables kernel module must be loaded. 
Physical servers:
- Fixed an issue with the server bootstrap failing due to POST issues on the HPE server.
 
Upgrade:
- Fixed an issue with upgrade failing in the  
iac-zoneselection-globalsubcomponent. 
Vertex AI:
-  
Fixed an issue where the 
MonitoringTargetshows aNot Readystatus when user clusters are being created, causing pre-trained APIs to continually show anEnablingstate in the user interface. 
Add-on Manager:
The Google Distributed Cloud for bare metal version is updated to 1.29.300-gke.185 to apply the latest security patches and important updates.
See Google Distributed Cloud for bare metal 1.29.300-gke.185 release notes for details.
Upgrade:
- The upgrade documentation provides estimated durations for the different stages of the upgrade process.