本頁面由 Cloud Translation API 翻譯而成。

疑難排解

本頁面提供一些常見問題和錯誤的疑難排解步驟。

FAILED 執行個體

FAILED 狀態表示執行個體資料已遺失，因此必須刪除執行個體。

只要 Parallelstore 執行個體處於 FAILED 狀態，系統就會持續收費，直到刪除為止。

如要擷取執行個體的狀態，請按照「管理執行個體：擷取執行個體」一文的說明操作。

如要刪除執行個體，請參閱「管理執行個體：刪除執行個體」。

dfuse 掛接或網路測試期間發生逾時

如果掛接 Parallelstore 執行個體時，dfuse -m 指令逾時；或 self_test 或 daos health net-test 等網路測試指令逾時，可能是網路連線問題所致。

如要驗證與 Parallelstore 伺服器的連線，請執行

self_test --use-daos-agent-env -r 1

如果測試報告指出連線問題，可能原因有兩個：

DAOS 代理程式在設定期間可能選取了錯誤的網路介面

您可能需要排除無法連線至 access_points 清單中 IP 的網路介面。

執行 ifconfig，列出可用的網路介面。輸出範例可能會顯示多個網路介面，例如 eth0、docker0、ens8、lo 等。
停止 daos_agent。
編輯 /etc/daos/daos_agent.yml，排除不想要的網路介面。取消註解 exclude_fabric_ifaces 行並更新值。您提供的項目會因情況而異。例如：
```
exclude_fabric_ifaces: ["docker0", "ens8", "lo"]
```
重新啟動 daos_agent。

執行個體或用戶端 IP 位址與內部 IP 位址衝突

Parallelstore 執行個體和用戶端無法使用 172.17.0.0/16 子網路範圍內的 IP 位址。詳情請參閱「已知問題」。

`ENOSPC` (執行個體有未使用的容量)

如果執行個體使用最低或 (預設的) 平衡條紋，即使現有檔案未用盡執行個體的所有容量，您也可能會遇到 ENOSPC 錯誤。寫入大型檔案 (通常大於 8 GiB) 時，或從 Cloud Storage 匯入這類檔案時，就可能發生這種情況。

請盡量使用檔案條紋，減少發生這類錯誤的機率。

Google Kubernetes Engine 疑難排解

以下列出一些常見問題和解決步驟。

工作負載 Pod 中的 `Transport endpoint is not connected`

這個錯誤是由於 dfuse 終止所致。在大多數情況下，dfuse 是因為記憶體不足而終止。使用 Pod 註解 gke-parallelstore/[cpu-limit|memory-limit]，為 Parallelstore Sidecar 容器分配更多資源。如果您不知道要為 Sidecar 分配多少記憶體，可以設定 gke-parallelstore/memory-limit: "0" 來移除 Sidecar 記憶體限制。請注意，這項功能僅適用於標準叢集；如果是 Autopilot 叢集，您無法使用值 0 取消設定 Sidecar 容器的資源上限和要求。您必須明確為 Sidecar 容器設定較大的資源限制。

修改註解後，您必須重新啟動工作負載 Pod。為執行中的工作負載新增註解，不會動態修改資源分配。

Pod 事件警告

如果工作負載 Pod 無法啟動，請檢查 Pod 事件：

kubectl describe pod POD_NAME -n NAMESPACE

以下是常見錯誤的解決方法。

CSI 驅動程式啟用問題

常見的 CSI 驅動程式啟用錯誤如下：

MountVolume.MountDevice failed for volume "volume" : kubernetes.io/csi:
attacher.MountDevice failed to create newCsiDriverClient:
driver name parallelstore.csi.storage.gke.io not found in the list of registered CSI drivers

MountVolume.SetUp failed for volume "volume" : kubernetes.io/csi:
mounter.SetUpAt failed to get CSI client:
driver name parallelstore.csi.storage.gke.io not found in the list of registered CSI drivers

這些警告表示 CSI 驅動程式未啟用或未執行。

如果叢集剛完成調度資源、更新或升級，出現這則警告是正常現象，而且應該只是暫時性的。叢集作業完成後，CSI 驅動程式 Pod 需要幾分鐘才能正常運作。

否則，請確認叢集已啟用 CSI 驅動程式。詳情請參閱「啟用 CSI 驅動程式」。如果已啟用 CSI，每個節點都會顯示名為 parallelstore-csi-node-id 的 Pod 正在運作。

AttachVolume.Attach 失敗

將 Pod 排程至節點後，磁碟區會附加至節點，如果使用節點掛接，系統會建立掛接器 Pod。

這會在控制器上發生，並涉及 attachdetach-controller 中的 AttachVolume 步驟。

錯誤代碼	Pod 事件警告	解決方案
InvalidArgument	`AttachVolume.Attach failed for volume "volume" : rpc error: code = InvalidArgument desc = an error occurred while preparing mount options: invalid mount options`	系統將無效的掛接標記傳遞至 PersistentVolume 或 StorageClass。如要瞭解詳情，請參閱支援的 dfuse 掛接選項。
NotFound	`AttachVolume.Attach failed for volume "volume" : rpc error: code = NotFound desc = failed to get instance "instance"`	Parallelstore 執行個體不存在。確認 PersistentVolume 的 volumeHandle 具有正確格式。

MountVolume.MountDevice 失敗

磁碟區連接至節點後，系統會將磁碟區暫存至節點。

這會在節點上發生，並涉及 kubelet 中的 MountVolume.MountDevice 步驟。

錯誤代碼	Pod 事件警告	解決方案
FailedPrecondition	`MountVolume.MountDevice failed for volume "volume" : rpc error: code = FailedPrecondition desc = mounter pod "pod" expected to exist but was not found`	這項錯誤通常是因為手動刪除掛接器 Pod 所致。刪除所有耗用 PVC 的工作負載，然後重新部署。系統會建立新的掛接器 Pod。
DeadlineExceeded	`MountVolume.MountDevice failed for volume "volume": rpc error: code = DeadlineExceeded desc = context deadline exceeded`	無法連線至 Parallelstore 執行個體。確認您的 VPC 網路和存取點設定正確無誤。

MountVolume.SetUp 失敗

磁碟區暫存至節點後，系統會掛接磁碟區，並提供給 Pod 上的容器。這會在節點上發生，並涉及 kubelet 中的 MountVolume.SetUp 步驟。

Pod mount

錯誤代碼	Pod 事件警告	解決方案
ResourceExhausted	`MountVolume.SetUp failed for volume "volume" : rpc error: code = ResourceExhausted desc = the sidecar container failed with error: signal: killed` `MountVolume.SetUp failed for volume "volume" : rpc error: code = ResourceExhausted desc = the sidecar container terminated due to OOMKilled, exit code: 137`	dfuse 程序已結束，通常是因為記憶體不足 (OOM) 狀況所致。請考慮使用 `gke-parallelstore/memory-limit` 註解，提高 Sidecar 容器的記憶體上限。如果不確定要為 parallelstore-sidecar 分配多少記憶體，建議您將 `gke-parallelstore/memory-limit: "0"` 設為，以消除 Parallelstore 施加的記憶體限制。
已取消	`MountVolume.SetUp failed for volume "volume" : rpc error: code = Aborted desc = NodePublishVolume request is aborted due to rate limit` `MountVolume.SetUp failed for volume "volume" : rpc error: code = Aborted desc = An operation with the given volume key key already exists`	由於速率限制或現有作業，磁碟區掛接作業已中止。這項警告是正常現象，而且應該是暫時性的。
InvalidArgument	`MountVolume.SetUp failed for volume "volume" : rpc error: code = InvalidArgument desc =`	如果您在 StorageClass 或 PersistentVolume 中提供無效引數，錯誤記錄會指出含有無效引數的欄位。如要動態佈建，請檢查 Storage Class。如果是靜態佈建，請檢查「永久磁碟區」。
FailedPrecondition	`MountVolume.SetUp failed for volume "volume" : rpc error: code = FailedPrecondition desc = can not find the sidecar container in Pod spec`	Parallelstore Sidecar 容器未注入。確認 `gke-parallelstore/volumes: "true"` Pod 註解設定正確無誤。

節點支架

錯誤代碼	Pod 事件警告	解決方案
已取消	`MountVolume.SetUp failed for volume "volume" : rpc error: code = Aborted desc = NodePublishVolume request is aborted due to rate limit` `MountVolume.SetUp failed for volume "volume" : rpc error: code = Aborted desc = An operation with the given volume key key already exists`	由於速率限制或現有作業，磁碟區掛接作業已中止。這項警告是正常現象，而且應該是暫時性的。
InvalidArgument	`MountVolume.SetUp failed for volume "volume" : rpc error: code = InvalidArgument desc =`	如果您在 StorageClass 或永久磁碟區中提供無效引數，錯誤記錄會指出含有無效引數的欄位。如要動態佈建，請檢查 Storage Class。如果是靜態佈建，請檢查「永久磁碟區」。
FailedPrecondition	`MountVolume.SetUp failed for volume "volume" : rpc error: code = FailedPrecondition desc = mounter pod expected to exist but was not found`	Parallelstore 掛接器 Pod 不存在。如果不小心刪除掛接器 Pod，請重新建立所有工作負載，系統就會重新建立掛接器 Pod。
DeadlineExceeded	`MountVolume.SetUp failed for volume "volume" : rpc error: code = DeadlineExceeded desc = timeout waiting for mounter pod gRPC server to become available`	掛接器 Pod 的 gRPC 伺服器未啟動。檢查掛接器 Pod 的記錄是否有任何錯誤。

排解虛擬私有雲網路問題

無法為服務「`servicenetworking.googleapis.com`」新增對等互連

ERROR: (gcloud.services.vpc-peerings.connect) User [$(USER)] does not have 
permission to access services instance [servicenetworking.googleapis.com]
(or it may not exist): Permission denied to add peering for service
'servicenetworking.googleapis.com'.

這項錯誤表示您的使用者帳戶沒有 servicenetworking.services.addPeering IAM 權限。

請參閱「使用 IAM 控管存取權」一文，瞭解如何將下列任一角色新增至帳戶：

roles/compute.networkAdmin或
roles/servicenetworking.networksAdmin

無法修改 CreateConnection 中的分配範圍

ERROR: (gcloud.services.vpc-peerings.connect) The operation
"operations/[operation_id]" resulted in a failure "Cannot modify allocated
ranges in CreateConnection. Please use UpdateConnection.

如果您已在這個網路上建立具有不同 IP 範圍的 VPC 對等互連，系統就會傳回這項錯誤。兩個可能的解決方案如下：

取代現有 IP 範圍：

gcloud services vpc-peerings update \
  --network=NETWORK_NAME \
  --ranges=IP_RANGE_NAME \
  --service=servicenetworking.googleapis.com \
  --force

或者，將新的 IP 範圍新增至現有連線：

擷取對等互連的現有 IP 範圍清單：

EXISTING_RANGES=$(
  gcloud services vpc-peerings list \
    --network=NETWORK_NAME \
    --service=servicenetworking.googleapis.com \
    --format="value(reservedPeeringRanges.list())"
)

然後將新範圍新增至對等互連：

gcloud services vpc-peerings update \
  --network=NETWORK_NAME \
  --ranges=$EXISTING_RANGES,IP_RANGE_NAME \
  --service=servicenetworking.googleapis.com

IP 位址範圍用盡

建立執行個體時，可能會發生下列範圍用盡錯誤：

ERROR: (gcloud.alpha.Parallelstore.instances.create) FAILED_PRECONDITION: Invalid
resource state for "NETWORK_RANGES_NOT_AVAILABLE": IP address range exhausted

如果看到這則錯誤訊息，請按照 VPC 指南的說明，重新建立 IP 範圍或擴充現有 IP 範圍。

如果您要重新建立 Parallelstore 執行個體，必須重新建立 IP 範圍，而不是擴充範圍。

Pod 中斷預算限制過嚴，因此維護作業遭到封鎖

Google Cloud 控制台可能會顯示下列錯誤訊息，指出由於 Pod 中斷預算 (PDB) 設為允許撤銷零個 Pod，因此無法繼續進行維護作業：

GKE can't perform maintenance because the Pod Disruption Budget allows for 0 Pods evictions.

如果看到這則錯誤訊息，請完成下列步驟，找出有問題的 Pod：

按一下錯誤訊息，開啟錯誤洞察面板。
在「非寬容模式的 Pod 中斷預算」部分中，查看 Pod 的名稱。
如果 Pod 處於 parallelstorecsi-mount 狀態，可以忽略這項錯誤，因為這不會妨礙維護作業。如果是其他 Pod，請檢查 PDB。

疑難排解

FAILED 執行個體

dfuse 掛接或網路測試期間發生逾時

DAOS 代理程式在設定期間可能選取了錯誤的網路介面

執行個體或用戶端 IP 位址與內部 IP 位址衝突

ENOSPC (執行個體有未使用的容量)

Google Kubernetes Engine 疑難排解

工作負載 Pod 中的 Transport endpoint is not connected

Pod 事件警告

CSI 驅動程式啟用問題

AttachVolume.Attach 失敗

MountVolume.MountDevice 失敗

MountVolume.SetUp 失敗

Pod mount

節點支架

排解虛擬私有雲網路問題

無法為服務「servicenetworking.googleapis.com」新增對等互連

無法修改 CreateConnection 中的分配範圍

IP 位址範圍用盡

Pod 中斷預算限制過嚴，因此維護作業遭到封鎖

`ENOSPC` (執行個體有未使用的容量)

工作負載 Pod 中的 `Transport endpoint is not connected`

無法為服務「`servicenetworking.googleapis.com`」新增對等互連