K8S集群中PLEG问题排查

一、背景

k8s集群排障真的很麻烦

今天集群有同事找我,节点报 PLEG is not healthy 集群中有的节点出现了NotReady,这是什么原因呢?

二、kubernetes源码分析

PLEG is not healthy 也是一个经常出现的问题

POD 生命周期事件生成器

先说下PLEG 这部分代码在kubelet 里,我们看一下在kubelet中的注释:

复制代码
// GenericPLEG is an extremely simple generic PLEG that relies solely on
// periodic listing to discover container changes. It should be used
// as temporary replacement for container runtimes do not support a proper
// event generator yet.
//
// Note that GenericPLEG assumes that a container would not be created,
// terminated, and garbage collected within one relist period. If such an
// incident happens, GenenricPLEG would miss all events regarding this
// container. In the case of relisting failure, the window may become longer.
// Note that this assumption is not unique -- many kubelet internal components
// rely on terminated containers as tombstones for bookkeeping purposes. The
// garbage collector is implemented to work with such situations. However, to
// guarantee that kubelet can handle missing container events, it is
// recommended to set the relist period short and have an auxiliary, longer
// periodic sync in kubelet as the safety net.
type GenericPLEG struct {
	// The period for relisting.
	relistPeriod time.Duration
	// The container runtime.
	runtime kubecontainer.Runtime
	// The channel from which the subscriber listens events.
	eventChannel chan *PodLifecycleEvent
	// The internal cache for pod/container information.
	podRecords podRecords
	// Time of the last relisting.
	relistTime atomic.Value
	// Cache for storing the runtime states required for syncing pods.
	cache kubecontainer.Cache
	// For testability.
	clock clock.Clock
	// Pods that failed to have their status retrieved during a relist. These pods will be
	// retried during the next relisting.
	podsToReinspect map[types.UID]*kubecontainer.Pod
}

也就是说kubelet 会定时把 拉取pod 的列表,然后记录下结果。

运行代码后会执行一个定时任务,定时调用relist函数

复制代码
// Start spawns a goroutine to relist periodically.
func (g *GenericPLEG) Start() {
	go wait.Until(g.relist, g.relistPeriod, wait.NeverStop)
}

relist函数里关键代码:

复制代码
	// Get all the pods.
	podList, err := g.runtime.GetPods(true)
	if err != nil {
		klog.ErrorS(err, "GenericPLEG: Unable to retrieve pods")
		return
	}

	g.updateRelistTime(timestamp)

我们可以看到kubelet 定期调用 docker.sock 或者containerd.sock 去调用CRI 去拉取pod列表,然后更新下relist时间。

我们在看Health 函数,是被定时调用的健康检查处理函数:

复制代码
// Healthy check if PLEG work properly.
// relistThreshold is the maximum interval between two relist.
func (g *GenericPLEG) Healthy() (bool, error) {
	relistTime := g.getRelistTime()
	if relistTime.IsZero() {
		return false, fmt.Errorf("pleg has yet to be successful")
	}
	// Expose as metric so you can alert on `time()-pleg_last_seen_seconds > nn`
	metrics.PLEGLastSeen.Set(float64(relistTime.Unix()))
	elapsed := g.clock.Since(relistTime)
	if elapsed > relistThreshold {
		return false, fmt.Errorf("pleg was last seen active %v ago; threshold is %v", elapsed, relistThreshold)
	}
	return true, nil
}

他是用当前时间 减去 relist更新时间,得到的时间如果超过relistThreshold就代表可能不健康

复制代码
	// The threshold needs to be greater than the relisting period + the
	// relisting time, which can vary significantly. Set a conservative
	// threshold to avoid flipping between healthy and unhealthy.
	relistThreshold = 3 * time.Minute

进一步思考这个问题,我们就把问题锁定在了CRI 容器运行时的地方

三、锁定错误

这个问题出错的根源是在容器运行时超时,意味着dockerd 或者 contaienrd 出现故障,我们到那台机器上看到kubelet 的日志发现很多CRI 超时的 不可用的日志

复制代码
Nov 02 13:41:43 app04 kubelet[8411]: E1102 13:41:43.111882    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.036729    8411 kubelet.go:2396] "Container runtime not ready" runtimeReady="RuntimeReady=false reason:DockerDaemonNotReady messag
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.112993    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.113027    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.113041    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114281    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114319    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114335    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.344912    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.345214    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.345501    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.630715    8411 kubelet.go:2040] "Skipping pod synchronization" err="[container runtime is down, PLEG is not healthy: pleg was las
Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115226    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115265    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115280    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116608    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116647    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116667    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.081612    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.081611    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082134    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082201    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082378    8411 remote_runtime.go:6

想办法重启运行时 或者去排查containerd

复制代码
Nov 02 12:58:45 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:46 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:47 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:48 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:49 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:50 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:51 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:52 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:53 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:54 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:55 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:56 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:57 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:58 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:59 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:00 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:01 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:02 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:03 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:04 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

发现是CRI 服务端接受太多套接字,导致accept 失败了,可以适当调大ulimit

相关推荐
用户298698530145 分钟前
Java 实现 Word 文档文本与图片提取的方法
java·后端
Java之美36 分钟前
一次k8s升级引发的DevicePlugin注册失败
云原生·kubernetes
SimonKing1 小时前
铁子,IntelliJ IDEA 2026.1.3来了,升不升?
java·后端·程序员
咖啡八杯12 小时前
GoF设计模式——策略模式
java·后端·spring·设计模式
用户1285261160220 小时前
我把祖传Java项目重构后,接口响应从3s砍到了200ms,只改了这几行代码
java
Linsk20 小时前
组件 = 模板 + 业务逻辑
java·前端·vue.js
星沉远浦21 小时前
用Gemini高效解决Java代码报错难以定位的问题
java
用户298698530141 天前
Word 文档字符级格式化:Java 实现方案详解
java·后端