K8S集群中PLEG问题排查

一、背景

k8s集群排障真的很麻烦

今天集群有同事找我，节点报 PLEG is not healthy 集群中有的节点出现了NotReady，这是什么原因呢？

二、kubernetes源码分析

PLEG is not healthy 也是一个经常出现的问题

POD 生命周期事件生成器

先说下PLEG 这部分代码在kubelet 里，我们看一下在kubelet中的注释:

复制代码

// GenericPLEG is an extremely simple generic PLEG that relies solely on
// periodic listing to discover container changes. It should be used
// as temporary replacement for container runtimes do not support a proper
// event generator yet.
//
// Note that GenericPLEG assumes that a container would not be created,
// terminated, and garbage collected within one relist period. If such an
// incident happens, GenenricPLEG would miss all events regarding this
// container. In the case of relisting failure, the window may become longer.
// Note that this assumption is not unique -- many kubelet internal components
// rely on terminated containers as tombstones for bookkeeping purposes. The
// garbage collector is implemented to work with such situations. However, to
// guarantee that kubelet can handle missing container events, it is
// recommended to set the relist period short and have an auxiliary, longer
// periodic sync in kubelet as the safety net.
type GenericPLEG struct {
	// The period for relisting.
	relistPeriod time.Duration
	// The container runtime.
	runtime kubecontainer.Runtime
	// The channel from which the subscriber listens events.
	eventChannel chan *PodLifecycleEvent
	// The internal cache for pod/container information.
	podRecords podRecords
	// Time of the last relisting.
	relistTime atomic.Value
	// Cache for storing the runtime states required for syncing pods.
	cache kubecontainer.Cache
	// For testability.
	clock clock.Clock
	// Pods that failed to have their status retrieved during a relist. These pods will be
	// retried during the next relisting.
	podsToReinspect map[types.UID]*kubecontainer.Pod
}

也就是说kubelet 会定时把拉取pod 的列表，然后记录下结果。

运行代码后会执行一个定时任务，定时调用relist函数

复制代码

// Start spawns a goroutine to relist periodically.
func (g *GenericPLEG) Start() {
	go wait.Until(g.relist, g.relistPeriod, wait.NeverStop)
}

relist函数里关键代码:

复制代码

	// Get all the pods.
	podList, err := g.runtime.GetPods(true)
	if err != nil {
		klog.ErrorS(err, "GenericPLEG: Unable to retrieve pods")
		return
	}

	g.updateRelistTime(timestamp)

我们可以看到kubelet 定期调用 docker.sock 或者containerd.sock 去调用CRI 去拉取pod列表，然后更新下relist时间。

我们在看Health 函数，是被定时调用的健康检查处理函数：

复制代码

// Healthy check if PLEG work properly.
// relistThreshold is the maximum interval between two relist.
func (g *GenericPLEG) Healthy() (bool, error) {
	relistTime := g.getRelistTime()
	if relistTime.IsZero() {
		return false, fmt.Errorf("pleg has yet to be successful")
	}
	// Expose as metric so you can alert on `time()-pleg_last_seen_seconds > nn`
	metrics.PLEGLastSeen.Set(float64(relistTime.Unix()))
	elapsed := g.clock.Since(relistTime)
	if elapsed > relistThreshold {
		return false, fmt.Errorf("pleg was last seen active %v ago; threshold is %v", elapsed, relistThreshold)
	}
	return true, nil
}

他是用当前时间减去 relist更新时间，得到的时间如果超过relistThreshold就代表可能不健康

复制代码

	// The threshold needs to be greater than the relisting period + the
	// relisting time, which can vary significantly. Set a conservative
	// threshold to avoid flipping between healthy and unhealthy.
	relistThreshold = 3 * time.Minute

进一步思考这个问题，我们就把问题锁定在了CRI 容器运行时的地方

三、锁定错误

这个问题出错的根源是在容器运行时超时，意味着dockerd 或者 contaienrd 出现故障，我们到那台机器上看到kubelet 的日志发现很多CRI 超时的不可用的日志

复制代码

Nov 02 13:41:43 app04 kubelet[8411]: E1102 13:41:43.111882    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.036729    8411 kubelet.go:2396] "Container runtime not ready" runtimeReady="RuntimeReady=false reason:DockerDaemonNotReady messag
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.112993    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.113027    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:44 app04 kubelet[8411]: E1102 13:41:44.113041    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114281    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114319    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.114335    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.344912    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.345214    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.345501    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:45 app04 kubelet[8411]: E1102 13:41:45.630715    8411 kubelet.go:2040] "Skipping pod synchronization" err="[container runtime is down, PLEG is not healthy: pleg was las
Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115226    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115265    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:46 app04 kubelet[8411]: E1102 13:41:46.115280    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116608    8411 remote_runtime.go:351] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown des
Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116647    8411 kuberuntime_sandbox.go:285] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = Cannot connect to
Nov 02 13:41:47 app04 kubelet[8411]: E1102 13:41:47.116667    8411 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = Cannot connect to the
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.081612    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.081611    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082134    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082201    8411 remote_runtime.go:673] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = Cannot con
Nov 02 13:41:48 app04 kubelet[8411]: E1102 13:41:48.082378    8411 remote_runtime.go:6

想办法重启运行时或者去排查containerd

复制代码

Nov 02 12:58:45 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:46 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:47 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:48 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:49 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:50 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:51 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:52 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:53 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:54 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:55 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:56 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:57 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:58 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:58:59 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:00 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:01 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:02 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:03 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Nov 02 12:59:04 app04 dockerd[8435]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s

发现是CRI 服务端接受太多套接字，导致accept 失败了，可以适当调大ulimit