解决 [Kubesphere] 无法登陆

版本信息

  • Kubesphere: v3.1.1
  • k8s : v1.20.6
  • OS: CentOS Linux release 7.9.2009 (Core)

故障现象

Kubesphere 面板无法登陆
ks-apiserver pod CrashLoopBackOff

shell 复制代码
kubesphere-system              ks-apiserver-64f5ffb787-5jpxx                  0/1     CrashLoopBackOff   7          12m
kubesphere-system              ks-apiserver-64f5ffb787-6kp2m                  0/1     CrashLoopBackOff   7          12m
kubesphere-system              ks-apiserver-64f5ffb787-vg5h9                  0/1     CrashLoopBackOff   7          12m

ks-apiserver events & logs

shell 复制代码
$ kubectl describe  -n kubesphere-system po ks-apiserver-64f5ffb787-vg5h9
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  2m4s                default-scheduler  Successfully assigned kubesphere-system/ks-apiserver-64f5ffb787-vg5h9 to master1
  Normal   Pulling    2m3s                kubelet            Pulling image "registry.cn-beijing.aliyuncs.com/kubesphereio/ks-apiserver:v3.1.1"
  Normal   Pulled     99s                 kubelet            Successfully pulled image "registry.cn-beijing.aliyuncs.com/kubesphereio/ks-apiserver:v3.1.1" in 23.839356408s
  Normal   Created    44s (x4 over 99s)   kubelet            Created container ks-apiserver
  Normal   Started    44s (x4 over 99s)   kubelet            Started container ks-apiserver
  Warning  BackOff    15s (x11 over 97s)  kubelet            Back-off restarting failed container
  Normal   Pulled     1s (x4 over 98s)    kubelet            Container image "registry.cn-beijing.aliyuncs.com/kubesphereio/ks-apiserver:v3.1.1" already present on machine`


$ kubectl logs ks-apiserver-64f5ffb787-vg5h9 -n kubesphere-system
W1013 09:42:09.264572       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
W1013 09:42:09.266720       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
Error: failed to connect to redis service, please check redis status, error: EOF
2023/10/13 09:42:09 failed to connect to redis service, please check redis status, error: EOF

原因追溯

redis-ha 状态

shell 复制代码
kubesphere-system              redis-ha-haproxy-7cdc76dff9-gbh5b              1/1     Running            85         171d
kubesphere-system              redis-ha-haproxy-7cdc76dff9-wc86v              1/1     Running            0          34h
kubesphere-system              redis-ha-haproxy-7cdc76dff9-xw28x              1/1     Running            18         158d
kubesphere-system              redis-ha-server-0                              2/2     Running            29         171d
kubesphere-system              redis-ha-server-1                              2/2     Running            0          9m58s
kubesphere-system              redis-ha-server-2                              2/2     Running            0          9m53s

redis-ha 日志

shell 复制代码
$ kubectl -n kubesphere-system logs -l app=redis-ha-haproxy
[WARNING] 285/000146 (8) : Server check_if_redis_is_master_2/R2 is DOWN, reason: Layer4 connection problem, info: "Connection refused at step 1 of tcp-check (connect)", check duration: 2ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 285/000146 (8) : Server bk_redis_master/R2 is DOWN, reason: Layer4 connection problem, info: "Connection refused at step 1 of tcp-check (connect)", check duration: 2ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 285/000147 (8) : Server check_if_redis_is_master_0/R0 is DOWN, reason: Layer7 timeout, info: " at step 5 of tcp-check (expect string '10.233.62.231')", check duration: 1000ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 285/000147 (8) : backend 'check_if_redis_is_master_0' has no server available!
[WARNING] 285/000147 (8) : Server check_if_redis_is_master_1/R0 is DOWN, reason: Layer7 timeout, info: " at step 5 of tcp-check (expect string '10.233.30.237')", check duration: 1000ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 285/000147 (8) : backend 'check_if_redis_is_master_1' has no server available!
[WARNING] 285/000147 (8) : Server bk_redis_master/R0 is DOWN, reason: Layer7 timeout, info: " at step 5 of tcp-check (expect string 'role:master')", check duration: 1000ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 285/000147 (8) : backend 'bk_redis_master' has no server available!
[WARNING] 285/013414 (8) : Server check_if_redis_is_master_0/R1 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 285/013423 (8) : Server check_if_redis_is_master_0/R2 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[ALERT] 284/235728 (8) : backend 'check_if_redis_is_master_2' has no server available!
[WARNING] 284/235905 (8) : Server check_if_redis_is_master_2/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 284/235915 (8) : Server check_if_redis_is_master_2/R0 is DOWN, reason: Layer7 timeout, info: " at step 2 of tcp-check (send)", check duration: 1001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 284/235915 (8) : backend 'check_if_redis_is_master_2' has no server available!
[WARNING] 285/000013 (8) : Server check_if_redis_is_master_2/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 285/000023 (8) : Server check_if_redis_is_master_2/R0 is DOWN, reason: Layer7 timeout, info: " at step 2 of tcp-check (send)", check duration: 1000ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 285/000023 (8) : backend 'check_if_redis_is_master_2' has no server available!
[WARNING] 285/000133 (8) : Server check_if_redis_is_master_2/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 285/013414 (8) : Server check_if_redis_is_master_0/R1 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 285/013422 (8) : Server check_if_redis_is_master_0/R2 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[ALERT] 284/235728 (8) : backend 'check_if_redis_is_master_2' has no server available!
[WARNING] 284/235905 (8) : Server check_if_redis_is_master_2/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 284/235915 (8) : Server check_if_redis_is_master_2/R0 is DOWN, reason: Layer7 timeout, info: " at step 2 of tcp-check (send)", check duration: 1001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 284/235915 (8) : backend 'check_if_redis_is_master_2' has no server available!
[WARNING] 285/000013 (8) : Server check_if_redis_is_master_2/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 285/000023 (8) : Server check_if_redis_is_master_2/R0 is DOWN, reason: Layer7 timeout, info: " at step 2 of tcp-check (send)", check duration: 1001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 285/000023 (8) : backend 'check_if_redis_is_master_2' has no server available!
[WARNING] 285/000133 (8) : Server check_if_redis_is_master_2/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 2ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 285/013414 (8) : Server check_if_redis_is_master_0/R1 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 285/013423 (8) : Server check_if_redis_is_master_0/R2 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 1ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

$ kubectl -n kubesphere-system  exec -it redis-ha-server-0 redis-cli info replication
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulting container name to redis.
Use 'kubectl describe pod/redis-ha-server-0 -n kubesphere-system' to see all of the containers in this pod.
# Replication
role:slave
master_host:10.233.60.98
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_repl_offset:1698241
master_link_down_since_seconds:1697161808
slave_priority:100
slave_read_only:1
connected_slaves:0
min_slaves_good_slaves:0
master_replid:8321092349f590a2cc6603e90bb214fb2cfdc74f
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1698241
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

$ kubectl -n kubesphere-system  exec -it redis-ha-server-0 -- sh -c 'for i in `seq 0 2`; do nc -vz redis-ha-server-$i.redis-ha.kubesphere-system.svc 6379; done'
Defaulting container name to redis.
Use 'kubectl describe pod/redis-ha-server-0 -n kubesphere-system' to see all of the containers in this pod.
redis-ha-server-0.redis-ha.kubesphere-system.svc (10.233.99.40:6379) open
redis-ha-server-1.redis-ha.kubesphere-system.svc (10.233.98.59:6379) open
redis-ha-server-2.redis-ha.kubesphere-system.svc (10.233.97.45:6379) open
[root@master1 ~]# kubectl -n kubesphere-system logs -l app=redis-ha
error: a container name must be specified for pod redis-ha-server-0, choose one of: [redis sentinel] or one of the init containers: [config-init]
[root@master1 ~]# kubectl -n kubesphere-system logs -l app=redis-ha-server-0
[root@master1 ~]# 
[root@master1 ~]# 
[root@master1 ~]# kubectl -n kubesphere-controls-system exec -it `kubectl -n kubesphere-controls-system get po -l kubesphere.io/username=admin -o jsonpath="{.items[0].metadata.name}"` -- sh -c 'nc -vz redis.kubesphere-system.svc:6379'
redis.kubesphere-system.svc:6379 (10.233.6.110:6379) open

redis-ha 干预

到上面感觉 redis-ha 完全正常,但是 ks-apiserver 好像无法找到redis-ha,索性重启 redis-ha 试试

shell 复制代码
$ kubectl delete pods -n kubesphere-system -l app=redis-ha

结果发现并没有什么卵用。

ks-apiserver 干预

shell 复制代码
$ kubectl delete pods -n kubesphere-system -l app=ks-apiserver

结果发现并没有什么卵用。

kubelet 干预

到这里已经准备看 kubelet 日志信息,但领导又催得紧,就干脆先直接重启试试了

shell 复制代码
# 所有主节点上面执行
$ systemctl restart kubelet.service

结果发现,三个 ks-apiserver pod 已经恢复了俩,有个还在恢复中,索性再次重启 ks-apiserver,然后就完全恢复了。

ks-apiserver 新版

这里查看新版 kubesphere v3.4.0 并无依赖 redis-ha, 想必这里应该是个bug。

shell 复制代码
$ kubectl -n kubesphere-system get po 
NAME                                    READY   STATUS    RESTARTS      AGE
ks-apiserver-bc897dfdf-l59k7            1/1     Running   0             14d
ks-console-6c64b9cb79-m7n2z             1/1     Running   1 (18d ago)   27d
ks-controller-manager-8f5855c98-kd7kg   1/1     Running   4 (18d ago)   27d
ks-installer-5b59ccbf49-6d745           1/1     Running   1 (18d ago)   27d

kubelet 日志查看

这里直接使用vscode 插件直接分析日志了

bash 复制代码
GitHub Copilot: 根据 kubelet.logs 中的日志信息,可以看到以下事件:

1. Reflector ListAndWatch 失败,无法获取 kubesphere-token-gbwkf 的 Secret,错误信息为 "an error on the server ("") has prevented the request from succeeding (get secrets)"。
2. 无法写入事件,错误信息为 "Post "https://lb.kubesphere.local:6443/api/v1/namespaces/kube-system/events": read tcp xxxxxx:58772->xxxxxx:6443: use of closed network connection"。
3. 更新租约失败,错误信息为 "Operation cannot be fulfilled on leases.coordination.k8s.io "master1": the object has been modified; please apply your changes to the latest version and try again"。
4. kubelet 服务停止。
5. kubelet 服务启动,但是出现了一些警告信息,包括一些 GA 特性将在未来版本中被移除等。

以上信息并没有明确指出错误的原因,但是可以看到 kubelet 服务在启动和停止时都出现了一些问题,可能是由于网络或者配置等问题导致的。建议进一步排查 kubelet 服务的配置和网络环境是否正确。

最后总结

1、并无察觉到什么原因导致,可能是因为集群网络故障导致redis-ha 故障,而后自动恢复后,ks-apiserver 无法自动恢复,需要手动重启 kubelet 和 ks-apiserver 来解决。

2、这个并未在日志中找到明显异常,或者自己找的方式不太对,完全凭经验碰运气解决的,如果大佬们有更好的思路,可以分享出来学习下。

3、要问为啥用这么旧的版本,历史遗留问题,懂的都懂,后面应该会迁移,但啥时候呢,鬼知道。

相关推荐
it噩梦16 分钟前
springboot 工程使用proguard混淆
java·spring boot·后端
从种子到参天大树1 小时前
SpringBoot源码阅读系列(二):自动配置原理深度解析
后端
从种子到参天大树1 小时前
SpringBoot源码阅读系列(一):启动流程概述
后端
m0_748254881 小时前
Spring Boot实现多数据源连接和切换
spring boot·后端·oracle
庄周de蝴蝶2 小时前
一次 MySQL IF 函数的误用导致的生产小事故
后端·mysql
韩数2 小时前
Nping: 支持图表实时展示的多地址并发终端命令行 Ping
后端·rust·github
18号房客2 小时前
云原生后端开发(一)
后端·云原生
胡尔摩斯.3 小时前
SpringMVC
java·开发语言·后端·spring·代理模式
Bony-4 小时前
Go语言高并发实战案例分析
开发语言·后端·golang
ac-er88884 小时前
Golang并发机制以及它所使⽤的CSP并发模型
开发语言·后端·golang