背景
电脑休眠状态意外断电导致虚拟机直接进入关机状态。
问题
kubectl命令报错
bash
[root@master01 ~]#kubectl get node
The connection to the server master01.kktb.org:6443 was refused - did you specify the right host or port?
kubelet服务报错
bash
Oct 15 08:39:37 master01.kktb.org kubelet[747]: E1015 08:39:37.251784 747 kubelet.go:2424] "Error getting node" err="node \"master01.kktb.org\" not found"
Oct 15 08:39:37 master01.kktb.org kubelet[747]: E1015 08:39:37.329952 747 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://master01.kktb.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master01.kktb.org?timeout=10s": dial tcp 10.0.6.5:6443: connect: connection refused
Oct 15 08:39:37 master01.kktb.org kubelet[747]: E1015 08:39:37.352614 747 kubelet.go:2424] "Error getting node" err="node \"master01.kktb.org\" not found"
Oct 15 08:39:37 master01.kktb.org kubelet[747]: I1015 08:39:37.384900 747 kubelet_node_status.go:70] "Attempting to register node" node="master01.kktb.org"
Oct 15 08:39:37 master01.kktb.org kubelet[747]: E1015 08:39:37.385258 747 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://master01.kktb.org:6443/api/v1/nodes\": dial tcp 10.0.6.5:6443: connect: connection refused" node="master01.kktb.org"
6443端口无任何程序监听,判断可能是etcd出现了故障
master节点上的容器
bash
[root@master01 kubelet.service.d]#docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9924571324d2 e3ed7dee73e9 "kube-scheduler --au..." 18 minutes ago Up 18 minutes k8s_kube-scheduler_kube-scheduler-master01.kktb.org_kube-system_bca604f760be36f9a20d9b66b0bf821d_7
9d6171792af2 88784fb4ac2f "kube-controller-man..." 18 minutes ago Up 18 minutes k8s_kube-controller-manager_kube-controller-manager-master01.kktb.org_kube-system_13461bcfe1de9f0e85e957cde36c42d2_10
eb3fdcf0b392 registry.k8s.io/pause:3.6 "/pause" 18 minutes ago Up 18 minutes k8s_POD_etcd-master01.kktb.org_kube-system_67c8baa117269300c9b5e18d5b0a0e44_5
302e36ffb616 registry.k8s.io/pause:3.6 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-scheduler-master01.kktb.org_kube-system_bca604f760be36f9a20d9b66b0bf821d_5
a8f8f02011c7 registry.k8s.io/pause:3.6 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-controller-manager-master01.kktb.org_kube-system_13461bcfe1de9f0e85e957cde36c42d2_5
3fde7a0d5aae registry.k8s.io/pause:3.6 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-apiserver-master01.kktb.org_kube-system_5846def9d7d3ade7311eb0f023db33ff_5
镜像都存在
bash
[root@master01 kubelet.service.d]#docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
istio/proxyv2 1.19.1 b3547b3ef18b 2 weeks ago 251MB
ubuntu jammy 3565a89d9e81 2 weeks ago 77.8MB
ubuntu latest 3565a89d9e81 2 weeks ago 77.8MB
flannel/flannel v0.22.3 e23f7ca36333 3 weeks ago 70.2MB
flannel/flannel-cni-plugin v1.2.0 a55d1bad692b 2 months ago 8.04MB
hello-world latest 9c7a54a9a43c 5 months ago 13.3kB
k8s.gcr.io/kube-apiserver v1.24.0 529072250ccc 17 months ago 130MB
k8s.gcr.io/kube-proxy v1.24.0 77b49675beae 17 months ago 110MB
k8s.gcr.io/kube-scheduler v1.24.0 e3ed7dee73e9 17 months ago 51MB
k8s.gcr.io/kube-controller-manager v1.24.0 88784fb4ac2f 17 months ago 119MB
k8s.gcr.io/etcd 3.5.3-0 aebe758cef4c 18 months ago 299MB
k8s.gcr.io/pause 3.7 221177c6082a 19 months ago 711kB
ikubernetes/proxy v0.1.1 a4fcedf206e8 23 months ago 62.4MB
k8s.gcr.io/coredns/coredns v1.8.6 a4ca41631cc7 2 years ago 46.8MB
registry.k8s.io/pause 3.6 6270bb605e12 2 years ago 683kB
ikubernetes/demoapp v1.0 3342b7518915 3 years ago 92.7MB
ps -a看到退出的容器
bash
[root@master01 kubelet.service.d]#docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
07195c9438d3 aebe758cef4c "etcd --advertise-cl..." 34 seconds ago Exited (2) 33 seconds ago k8s_etcd_etcd-master01.kktb.org_kube-system_67c8baa117269300c9b5e18d5b0a0e44_30
532fabab4479 529072250ccc "kube-apiserver --ad..." 3 minutes ago Exited (1) 3 minutes ago k8s_kube-apiserver_kube-apiserver-master01.kktb.org_kube-system_5846def9d7d3ade7311eb0f023db33ff_27
etcd容器报错
bash
[root@master01 kubelet.service.d]#docker logs 07195c9438d3
{"level":"info","ts":"2023-10-15T09:31:34.713Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["etcd","--advertise-client-urls=https://10.0.6.5:2379","--cert-file=/etc/kubernetes/pki/etcd/server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--experimental-initial-corrupt-check=true","--initial-advertise-peer-urls=https://10.0.6.5:2380","--initial-cluster=master01.kktb.org=https://10.0.6.5:2380","--key-file=/etc/kubernetes/pki/etcd/server.key","--listen-client-urls=https://127.0.0.1:2379,https://10.0.6.5:2379","--listen-metrics-urls=http://127.0.0.1:2381","--listen-peer-urls=https://10.0.6.5:2380","--name=master01.kktb.org","--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/etc/kubernetes/pki/etcd/peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--snapshot-count=10000","--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"]}
{"level":"info","ts":"2023-10-15T09:31:34.714Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
{"level":"info","ts":"2023-10-15T09:31:34.714Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://10.0.6.5:2380"]}
{"level":"info","ts":"2023-10-15T09:31:34.714Z","caller":"embed/etcd.go:479","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
{"level":"info","ts":"2023-10-15T09:31:34.714Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["https://10.0.6.5:2379","https://127.0.0.1:2379"]}
{"level":"info","ts":"2023-10-15T09:31:34.714Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.3","git-sha":"0452feec7","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"master01.kktb.org","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://10.0.6.5:2380"],"listen-peer-urls":["https://10.0.6.5:2380"],"advertise-client-urls":["https://10.0.6.5:2379"],"listen-client-urls":["https://10.0.6.5:2379","https://127.0.0.1:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{"level":"info","ts":"2023-10-15T09:31:34.716Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"655.189µs"}
{"level":"info","ts":"2023-10-15T09:31:35.189Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":450049,"snapshot-size":"11 kB"}
{"level":"warn","ts":"2023-10-15T09:31:35.190Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":450049,"snapshot-file-path":"/var/lib/etcd/member/snap/000000000006de01.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2023-10-15T09:31:35.190Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
是由于恢复快照数据失败。查看其他两个节点是否也有同样报错。
幸运的是节点2的etcd启动正常
bash
[root@master02 snap]#docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
daa59123cbbf aebe758cef4c "etcd --advertise-cl..." 50 seconds ago Up 50 seconds k8s_etcd_etcd-master02.kktb.org_kube-system_3564c10547c1c588ab6c79bcad0e90d0_19
一些关键字日志
bash
[root@master02 snap]#docker logs -f daa59123cbbf
{"level":"info","ts":"2023-10-15T09:39:06.599Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"2.526769ms"}
{"level":"warn","ts":"2023-10-15T09:39:06.599Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000004-00000000000685b1.wal.broken"}
{"level":"info","ts":"2023-10-15T09:39:07.119Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":450048,"snapshot-size":"11 kB"}
{"level":"info","ts":"2023-10-15T09:39:07.119Z","caller":"etcdserver/server.go:521","msg":"recovered v3 backend from snapshot","backend-size-bytes":6635520,"backend-size":"6.6 MB","backend-size-in-use-bytes":2555904,"backend-size-in-use":"2.6 MB"}
{"level":"warn","ts":"2023-10-15T09:39:07.119Z","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000004-00000000000685b1.wal.broken"}
{"level":"info","ts":"2023-10-15T09:39:07.161Z","caller":"etcdserver/raft.go:483","msg":"restarting local member","cluster-id":"e3d7c840b56e7ccb","local-member-id":"77c05bfe945af1a1","commit-index":458027}
{"level":"info","ts":"2023-10-15T09:39:07.161Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"77c05bfe945af1a1 switched to configuration voters=(6311186738797484145 8628998035010679201 11257876798449243399)"}
{"level":"info","ts":"2023-10-15T09:39:07.161Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"77c05bfe945af1a1 became follower at term 16"}
{"level":"warn","ts":"2023-10-15T09:40:06.425Z","caller":"etcdhttp/metrics.go:86","msg":"/health error","output":"{\"health\":\"false\",\"reason\":\"RAFT NO LEADER\"}","status-code":503}
{"level":"warn","ts":"2023-10-15T09:40:07.179Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"5795d6e69d2e1871","rtt":"0s","error":"dial tcp 10.0.6.5:2380: connect: connection refused"}
{"level":"warn","ts":"2023-10-15T09:40:07.179Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"9c3c034d28add507","rtt":"0s","error":"dial tcp 10.0.6.7:2380: connect: connection refused"}
{"level":"warn","ts":"2023-10-15T09:40:07.179Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"9c3c034d28add507","rtt":"0s","error":"dial tcp 10.0.6.7:2380: connect: connection refused"}
{"level":"warn","ts":"2023-10-15T09:40:07.179Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"5795d6e69d2e1871","rtt":"0s","error":"dial tcp 10.0.6.5:2380: connect: connection refused"}
解决
解决问题的方法,直接将节点1的etcd数据目录文件夹移除,尝试将etcd2上面的数据拷贝过来
bash
# 先将节点1的数据文件移动至opt中
[root@master01 ~]#mv /var/lib/etcd/member /opt/
# 节点2拷贝数据到节点1
[root@master02 snap]#scp -r /var/lib/etcd/member 10.0.6.5:/var/lib/etcd/
etcd、api-server服务恢复
集群状态
bash
sh-5.1# etcdctl --endpoints=https://10.0.6.5:2379,https://10.0.6.6:2379,https://10.0.6.7:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key member list --write-out=table
+------------------+---------+-------------------+-----------------------+-----------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+-------------------+-----------------------+-----------------------+------------+
| 5795d6e69d2e1871 | started | master01.kktb.org | https://10.0.6.5:2380 | https://10.0.6.5:2379 | false |
| 77c05bfe945af1a1 | started | master02.kktb.org | https://10.0.6.6:2380 | https://10.0.6.6:2379 | false |
| 9c3c034d28add507 | started | master03.kktb.org | https://10.0.6.7:2380 | https://10.0.6.7:2379 | false |
+------------------+---------+-------------------+-----------------------+-----------------------+------------+
kubectl命令使用正常
bash
[root@master01 ~]#kubectl get node
NAME STATUS ROLES AGE VERSION
master01.kktb.org Ready control-plane 22d v1.24.3
master02.kktb.org Ready control-plane 22d v1.24.3
master03.kktb.org Ready control-plane 22d v1.24.3