虽然提前通知会断电重启机器,但是因为之前也遇到过断电的问题,并且一般情况下都是会自己启动 kubelet 和 docker 及相关上层服务,我就没在意,也没有说提前关闭集群,可是万万没想到啊,今天开机后发现 Master 节点居然都挂掉了
[root@k8s-master wal]# kubectl get nodes
The connection to the server 192.168.113.152:6443 was refused - did you specify the right host or port?
systemctl status kubelet
[root@k8s-master wal]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Fri 2022-04-22 09:18:34 CST; 1h 4min ago
Docs: https://kubernetes.io/docs/
Main PID: 1017 (kubelet)
Tasks: 17 (limit: 75136)
Memory: 152.0M
CGroup: /system.slice/kubelet.service
└─1017 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-c>
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.289217 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.390603 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.491934 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.593267 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.693429 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.793744 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.894905 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.995215 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:56 k8s-master kubelet[1017]: E0421 22:22:56.095825 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
4 月 22 10:22:56 k8s-master kubelet[1017]: E0421 22:22:56.196740 1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
lines 1-22/22 (END)
2022-04-22T02:21:31.203199623Z I0422 02:21:31.202994 1 server.go:565] external host was not specified, using 192.168.113.152
2022-04-22T02:21:31.231588697Z I0422 02:21:31.231374 1 server.go:172] Version: v1.23.5
2022-04-22T02:21:31.389807666Z I0422 02:21:31.389666 1 shared_informer.go:240] Waiting for caches to sync for node_authorizer
2022-04-22T02:21:31.390038398Z I0422 02:21:31.389913 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
2022-04-22T02:21:31.390047004Z I0422 02:21:31.389934 1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
2022-04-22T02:21:31.390767833Z I0422 02:21:31.390608 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
2022-04-22T02:21:31.390778978Z I0422 02:21:31.390639 1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
2022-04-22T02:21:31.392738574Z W0422 02:21:31.392578 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:32.391488027Z W0422 02:21:32.391315 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:32.394099214Z W0422 02:21:32.393918 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:33.393065852Z W0422 02:21:33.392633 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:33.835046849Z W0422 02:21:33.833992 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:35.158712152Z W0422 02:21:35.158257 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:36.424633514Z W0422 02:21:36.424302 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:38.117720953Z W0422 02:21:38.117282 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:41.173338381Z W0422 02:21:41.172823 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:42.894615034Z W0422 02:21:42.894174 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:47.511910137Z W0422 02:21:47.511585 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:50.459764076Z W0422 02:21:50.459450 1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
2022-04-22T02:21:51.393624485Z E0422 02:21:51.393263 1 run.go:74] "command failed" err="context deadline exceeded"
2022-04-22T02:26:23.131382248Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"etcdmain/etcd.go:72","msg":"Running: ","args":["etcd","--advertise-client-urls=https://192.168.113.152:2379","--cert-file=/etc/kubernetes/pki/etcd/server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--initial-advertise-peer-urls=https://192.168.113.152:2380","--initial-cluster=k8s-master=https://192.168.113.152:2380","--key-file=/etc/kubernetes/pki/etcd/server.key","--listen-client-urls=https://127.0.0.1:2379,https://192.168.113.152:2379","--listen-metrics-urls=http://127.0.0.1:2381","--listen-peer-urls=https://192.168.113.152:2380","--name=k8s-master","--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/etc/kubernetes/pki/etcd/peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--snapshot-count=10000","--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"]}
2022-04-22T02:26:23.131412789Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"etcdmain/etcd.go:115","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
2022-04-22T02:26:23.131416304Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.113.152:2380"]}
2022-04-22T02:26:23.131418664Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:478","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
2022-04-22T02:26:23.131421030Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.113.152:2379"]}
2022-04-22T02:26:23.131423955Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:307","msg":"starting an etcd server","etcd-version":"3.5.1","git-sha":"e8732fb5f","go-version":"go1.16.3","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"k8s-master","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.113.152:2380"],"listen-peer-urls":["https://192.168.113.152:2380"],"advertise-client-urls":["https://192.168.113.152:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.113.152:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
2022-04-22T02:26:23.133621480Z panic: freepages: failed to get all reachable pages (page 478: multiple references)
2022-04-22T02:26:23.133635080Z
2022-04-22T02:26:23.133637664Z goroutine 120 [running]:
2022-04-22T02:26:23.133639420Z go.etcd.io/bbolt.(*DB).freepages.func2(0xc00047e540)
2022-04-22T02:26:23.133648190Z /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.etcd.io/[email protected]/db.go:1056 +0xe9
2022-04-22T02:26:23.133650552Z created by go.etcd.io/bbolt.(*DB).freepages
2022-04-22T02:26:23.133652184Z /home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.etcd.io/[email protected]/db.go:1054 +0x1cd
然后重启 etcd 也不好使,一番搜索发现是装在 vmware 里会有问题,可能导致 etcd 文件损坏,寄!
1
rrfeng 2022-04-22 10:41:43 +08:00 via Android
用之前的 snapshot 恢复
这玩意也要定时离线备份的… |
2
seagull7558 OP @rrfeng 老哥是指删掉 /var/lib/etcd/member/snap 文件夹下最新的.snap 文件吗?
|
3
FullBridgeRect 2022-04-22 10:57:26 +08:00
etcd 有做高可用么,达到容灾上限之前可以直接开新节点重建吧?
|
4
seagull7558 OP @FullBridgeRect 没做高可用 本来就是开发环境 为了模拟配置的 确实偷懒没做高可用 也没做备份
|
5
FullBridgeRect 2022-04-22 11:06:34 +08:00
@seagull7558 那没了啊😑
|
6
disk 2022-04-22 11:14:54 +08:00
备份下,*.wal 或*.snap 删了看好不好使,不行看看能不能从最新的.snap 恢复,再不行就没必要再试图恢复了,成本太高。
|
7
seagull7558 OP @disk 试过直接删掉 wal 和 snap 文件夹 会丢失 K8S 集群下的所有配置信息
|
8
zhaoyeye 2022-04-23 00:07:41 +08:00 via Android
@seagull7558 我前两天刚通过 snapshot 恢复了一次,反正是挺多坑的
|
9
stevefan1999 2022-07-17 09:10:31 +08:00 via Android
etcd 背後用的是 BoltDB 理論上是毋須數據恢復的 如果集羣有其他 HA master 那把有問題的節點完整地刪了 再不行找個備份
|