August 30, 2024
See the product overview to learn about the features of Distributed Cloud.
Cluster management:
- Added the capability to use Multi-Instance GPU (MIG) profiles to partition GPU instances running container workloads.
H100 support:
- With 1.13.3, Google Distributed Cloud air-gapped integrates with the latest NVIDIA Hopper H100 GPUs, paired with the newest 5th Generation Intel processors and a new A3 VM.
Vertex AI:
- Included support for new file formats of document translation (DOC, PPT, TXT, XLS).
- Added the API and support for batch document translation.
- Supported a new format for the accelerator type of MIG GPUs in the resource pool for online predictions.
- Supported the language auto-detect feature for inline translations and documents stored in buckets.
- The API platform is in the production stage.
- CVE-2021-20230
- CVE-2022-48655
- CVE-2022-4968
- CVE-2022-48674
- CVE-2023-6270
- CVE-2023-6597
- CVE-2023-52752
- CVE-2024-0397
- CVE-2024-0450
- CVE-2024-0760
- CVE-2024-1724
- CVE-2024-1737
- CVE-2024-1975
- CVE-2024-2201
- CVE-2024-4032
- CVE-2024-4076
- CVE-2024-5569
- CVE-2024-6655
- CVE-2024-7264
- CVE-2024-23307
- CVE-2024-24861
- CVE-2024-26583
- CVE-2024-26584
- CVE-2024-26585
- CVE-2024-26586
- CVE-2024-26642
- CVE-2024-26643
- CVE-2024-26828
- CVE-2024-26886
- CVE-2024-26889
- CVE-2024-26907
- CVE-2024-26922
- CVE-2024-26923
- CVE-2024-26925
- CVE-2024-26926
- CVE-2024-27019
- CVE-2024-29068
- CVE-2024-29069
- CVE-2024-35235
- CVE-2024-36016
- CVE-2024-37370
- CVE-2024-37371
- CVE-2024-38428
Updated the Rocky OS image version to 20240731 to apply the latest security patches and important updates.
Block storage:
-
Grafana pods stuck in
Init
state due to volume mount errors.
Database Service:
-
The
dbs-fleet
subcomponent has a reconciliation error when upgrading.
Identity and access management:
-
The
gatekeeper-audit
pods in theopa-system
namespace frequently restart.
Monitoring:
- The Cortex store gateway pods can crashloop on startup while syncing with the storage backend. The pods exceed their memory limits, causing Kubernetes to terminate them.
- The Kube control-plane metrics proxy pods can crashloop with image pull backoff error.
-
A growth in WAL (write-ahead log) causes Prometheus to use a lot of memory. The system control plane VM node reports
NodeHasInsufficientMemory
andEvictionThresholdMet
events because of this issue.
Networking:
- The switch image failed to extract or pull an image.
Object storage:
- Some object storage upgrade warnings can be ignored.
Operating system:
-
Pods are stuck in a
ContainerCreating
state on a single node.
Upgrade:
- A Helm failure during upgrade causes a series of rollbacks.
- When upgrading from HW2.0 and Ubuntu, the node upgrade incorrectly displays RockyLinux.
-
The
dhcp-tftp-core-server
pod is not drained. -
The
OrganizationUpgrade
is stuck at node upgrade stage. - Intermittent connectivity failure to external cluster VIP.
- Kernel fails to create container.
-
An
Incorrect version of Trident
error appears during upgrade. - During user cluster provisioning, some pods fail to be scheduled.
Virtual machines:
-
The NVIDIA device plugin
DaemonSet
fails with thedriver rpc error
message on cluster nodes with GPUs. This issue causes GPUs to be unavailable for virtual machines and pods. - System cluster VM not ready.
- A data volume reports that the scratch space is not found.
Vertex AI:
-
The
streaming_recognize
pre-trained API function of Speech-to-Text fails because of an issue with the client library. -
Job status polling is not supported for the
batchTranslateDocument
API. -
batchTranslateDocument
requests might cause storage client outage. - The first time you enable pre-trained APIs, the GDC console might show an inconsistent status after a few minutes.
-
Translation requests with more than 250 characters can crash
translation-prediction-server
pods. - The `GPUAllocation` for shared service cluster is not configured correctly.
Monitoring:
- Fixed an issue where the Prober ConfigMap gets reset to include no probe jobs.
Networking:
-
Fixed an issue with a
PodCIDR
not assigned to nodes even though aClusterCIDRConfig
is created.
Operating system:
- Fixed an issue with the
bm-system-machine-preflight-check
Ansible job for a bare metal or VM node failing withEither ip_tables or nf_tables kernel module must be loaded
.
Physical servers:
- Fixed an issue with the server bootstrap failing due to POST issues on the HPE server.
Vertex AI:
-
Fixed an issue where the
MonitoringTarget
shows aNot Ready
status when user clusters are being created, causing pre-trained APIs to continually show anEnabling
state in the user interface.
Add-on Manager:
The Google Distributed Cloud for bare metal version is updated to 1.29.300-gke.185 to apply the latest security patches and important updates.
See Google Distributed Cloud for bare metal 1.29.300-gke.185 release notes for details.