We recently encountered a Kubernetes cluster that had experienced catastrophic etcd failure. We had 3 nodes running etcd that had suddenly been reduced to one without quorum. Repairing the situation required action on a number of fronts.
We had no viable backups, and had to rely on the db file that was left on the remaining etcd node.
Our main problem was the fact that the Kubernetes kube-apiserver
service was in a rolling reboot state, as it was unable to obtain data from etcd. Without a control-plane no kubernetes operations were possible. To restore the API we needed to fix the etcd issue.
Attempts to run etcdctl
commands in the etcd container we thwarted by the corrupt state of the etcd container. We couldn’t run any command to remove the failed members of the etcd cluster or inspect the data.
The Loss of quorum discussion in Andrei Kvapil’s blog post “Breaking down and fixing etcd cluster” (See blog) along with the StackOverflow post “How to start a stopped Docker container with a different command?” (see stackoverflow) provided us with enough information to solve our problem.
To return our etcd service to a working state we needed to: -
config.v2.json
file and add the --force-new-cluster
flag to the Args
arrayetcdctl
commands could now be executed the the etcd containerconfig.v2.json
file, this time to remove the --force-new-cluster
flag in the Args
array