Troubleshoot etcd for GKE on AWS

This pages shows you how to resolve issues with etcd for GKE on AWS.

etcd data disk is full

The following error occurs when the etcd data disk is full, and might be observed in a few different ways:

  • The etcd logs might show timeout errors for writing requests:

    rafthttp: failed to save KV snapshot (write /var/etcd/data/member/snap/tmp720030520: no space left on device)

    You might also see timeout errors for connections to peers:

    rafthttp: health check for peer [peer-id] could not connect: dial tcp [peer-ip]:2380: i/o timeout
    etcd server doesn't start:
  • The serial port logs might indicate that etcd can't start due to lack of space:

    failed on file /dev/stdout (No space left on device)

To determine the size of your etcd instance, use one of the following methods:


  1. Connect to one of the master nodes using SSH and run the following command:

    ETCDCTL_API=3 etcdctl --write-out=table endpoint status

    The DB_SIZE column indicates the size used, as shown in the following condensed example output:

    |    ENDPOINT      |        ID        | VERSION | DB SIZE |
    | | 4917a7ab173fabe7 |  3.5.0  |   45 kB |
    | | 59796ba9cd1bcd72 |  3.5.0  |   45 kB |
    | | 94df724b66343e6c |  3.5.0  |   45 kB |


  1. In the console, go to the Cloud Monitoring page.

    Go to the Cloud Monitoring page

  2. Select Metrics explorer.

  3. Select the metric etcd_mvcc_db_total_size_in_bytes metric.

To resolve this issue, resize the data disk for etcd using the appropriate procedure for your storage provider and operating system. Add enough additional space to account for future etcd growth.

  1. After the disk is resized, check if there's still a warning on disk space:

    ETCDCTL_API=3 etcdctl alarm list
  2. If the last column reports NOSPACE, disarm the alarm as follows:

    ETCDCTL_API=3 etcdctl alarm disarm

