目录
背景
前几天测试同事找到我,说他在测别的功能时发现我写的一个自定义资源控制器一直在重启,让我排查一下
类似于这种场景:
perl
$ kubectl -n my-namespace get pod
NAME READY STATUS RESTARTS AGE
my-contorller-1 1/1 Running 2(2m ago) 3h15m
my-contorller-2 1/1 Running 1(5m ago) 3h15m
经过一番排查,定位到崩溃日志:
ini
"setup: problem running manager" err="leader election lost"
问题原因就找到了:控制器选举机制故障导致pod崩溃重启
那么进一步的问题就是,为什么选举机制出故障了呢?为什么pod会直接崩溃呢?
还原现场
问题日志
mycontroller日志
日志
获取configmap,耗时5285ms
I1108 21:05:36.894625 1 round_trippers.go:466] curl -v -XGET 'https://10.192.0.1/api/v1/namespaces/mynamespace1/configmaps/yufatang'
I1108 21:05:42.179809 1 round_trippers.go:570] HTTP Statistics: GetConnection 0 ms ServerProcessing 5284 ms Duration 5285 ms
获取lease,10000-5285=4721ms,时间到期,上下文超时
I1108 21:05:42.180257 1 round_trippers.go:466] curl -v -XGET 'https://10.192.0.1/apis/coordination.k8s.io/v1/namespaces/mynamespace1/leases/yufatang'
I1108 21:05:46.895662 1 round_trippers.go:570] HTTP Statistics: GetConnection 0 ms Duration 4721 ms
E1108 21:05:46.895924 1 leaderelection.go:330] error retrieving resource lock mynamespace1/yufatang: Get "https://10.192.0.1/apis/coordination.k8s.io/v1/namespaces/mynamespace1/leases/yufatang": context deadline exceeded
I1108 21:05:46.895977 1 leaderelection.go:283] failed to renew lease mynamespace1/yufatang: timed out waiting for the condition
上下文超时退出,进程崩溃退出,pod重启
E1108 21:05:46.896140 1 logr.go:279] "setup: problem running manager" err="leader election lost"
对应的 APIServer 日志
bash
对应于上面的获取cnfigmap
I1108 21:05:42.178452 1 trace.go:205] Trace[701725521]: "Get" url:/api/v1/namespaces/mynamespace1/configmaps/yufatang,user-agent:operator/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election,audit-id:322792b4-c839-4626-b1c7-5219d934d45d,client:10.244.132.9,accept:application/json, */*,protocol:HTTP/2.0 (08-Nov-2023 21:05:36.897) (total time: 5280ms):
Trace[701725521]: ---"About to write a response" 5280ms (21:05:42.178)
Trace[701725521]: [5.280607945s] [5.280607945s] END
对应于上面的获取lease
I1108 21:05:46.899793 1 trace.go:205] Trace[1216648764]: "Get" url:/apis/coordination.k8s.io/v1/namespaces/mynamespace1/leases/yufatang,user-agent:operator/v0.0.0 (linux/amd64) kubernetes/$Format/leader-election,audit-id:e956aef9-4a3c-4a0e-9d41-9836b8440b96,client:10.244.132.9,accept:application/json, */*,protocol:HTTP/2.0 (08-Nov-2023 21:05:42.181) (total time: 4717ms):
Trace[1216648764]: [4.717883864s] [4.717883864s] END
E1108 21:05:46.903204 1 timeout.go:141] post-timeout activity - time-elapsed: 7.29645ms, GET "/apis/coordination.k8s.io/v1/namespaces/mynamespace1/leases/yufatang" result: <nil>
E1108 21:05:46.905219 1 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
E1108 21:05:46.905552 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http: Handler timeout"}: http: Handler timeout
可以看到在controller日志中连续两个耗时较长的请求后,第二个请求上下文超时了,然后程序就崩溃了。
而且有一个关键信息就是上下文超时的时候,这两个请求的时间之和恰好是10s,到这里就开始猜测了,10s这个数字太巧了,是不是有一个时间变量控制了这个选举动作的上下文时限, 于是回头看一下管理器创建时的有关选举的参数,找到如下几个参数:
less
// LeaseDuration is the duration that non-leader candidates will
// wait to force acquire leadership. This is measured against time of
// last observed ack. Default is 15 seconds.
LeaseDuration *time.Duration
// RenewDeadline is the duration that the acting controlplane will retry
// refreshing leadership before giving up. Default is 10 seconds.
RenewDeadline *time.Duration
// RetryPeriod is the duration the LeaderElector clients should wait
// between tries of actions. Default is 2 seconds.
RetryPeriod *time.Duration
可以看到,RenewDeadline
是控制平面(acting controlplane)在放弃之前尝试刷新领导权的持续时间。 默认为10s。
破案了,10s,这不就是那个崩溃日志提示的信息嘛,正好对应上了。
怎么解决呢?
第一想法就是调大这个参数,不太确定,去搜一下:
发现有很多人碰到了这个问题,给出的解决办法大致都是调大这个参数时间,到这里浅层问题(为什么崩溃,如何解决)就算是回答了。
调大参数总感觉是个权宜之计,并且有个问题就是,为什么调大这个参数有效呢?它是如何工作的?
接下来我们结合源码讨论一下这个问题:
选举原理
首先大致讲一下控制器的选主原理:
在k8s.io/client-go
这个包的设计里,控制器选举的本质是多个pod抢一个资源锁,可以理解为分布式锁
什么,分布式锁不太懂?不懂没关系,我下次我开单篇讲分布式锁
回到正题,k8s是怎么设计这个分布式锁的呢?
如下图所示,本文谈到的k8s控制器选举机制示意图,以一个双副本的pod为例:

选举原理示意图
图中几个动作:
- 程序刚启动时,Pod A先抢到该lease,将自己设为leader,随后每隔两秒刷新一次
- Pod B 未拿到该lease,则两秒重试一次,尝试获取lease
- Pod A 某次尝试选举动作超时(默认10s),崩溃重启
- Pod B在刷新几次后发现lease信息没有变化,且租期已过期,于是拿到该lease,将自己设为leader
这里面会有几个注意点,依次聊一下:
第一次抢到lease这个行为如何判定呢?
k8s的资源类型都是带有resourceVersion
的,请看图中橙色线条代表更新请求,更新lease是需要携带版本号的
两个初启动的进程更新lease时设置版本号为1,发送更新请求必然会有一个前后时间差,后到达的请求中资源版本号<=现有的版本号,则会被apiserver拒绝该请求,于是也就有了第一个leader
租期怎么算?如何判断过期?
在LeaderElector
这个对象里有一个字段observedTime
,用于表示观察时间,是一个本地时间
ruby
//源码地址:https://github.com/kubernetes/client-go/blob/master/tools/leaderelection/leaderelection.go#L181
// LeaderElector is a leader election client.
type LeaderElector struct {
...
observedTime time.Time
}
当获取到的lease信息有变化时,更新该字段的值,设置为当时的时间,维护在本地
然后用该时间加上获取到的lease信息中的LeaseDuration
字段值,与当前时间做对比
ruby
//源码地址:https://github.com/kubernetes/client-go/blob/master/tools/leaderelection/leaderelection.go#L319
if len(oldLeaderElectionRecord.HolderIdentity) > 0 &&
le.observedTime.Add(time.Second*time.Duration(oldLeaderElectionRecord.LeaseDurationSeconds)).After(now.Time) &&
!le.IsLeader() {
klog.V(4).Infof("lock is held by %v and has not yet expired", oldLeaderElectionRecord.HolderIdentity)
return false
}
显然,类似上图中PodA 重启这种现象,PodB 多次获取后会发现资源没有变,那这个观察时间也就一直不更新,直到这个值加上租期后小于当前时间,判定该lease过期,将自己设置为Leader
为什么上下文超时是程序崩溃呢?为什么不是退避重试选举动作?
这两个问题是正着提出的但需要反着回答:
- 退避重试选举动作其实是有的
在源码中这段逻辑为一个由上下文控制的循环函数:
scss
//源码地址:https://github.com/kubernetes/client-go/blob/master/tools/leaderelection/leaderelection.go#L245
// acquire loops calling tryAcquireOrRenew and returns true immediately when tryAcquireOrRenew succeeds.
// Returns false if ctx signals done.
func (le *LeaderElector) acquire(ctx context.Context) bool {
...
wait.JitterUntil(func() {
succeeded = le.tryAcquireOrRenew(ctx)
le.maybeReportTransition()
if !succeeded {
klog.V(4).Infof("failed to acquire lease %v", desc)
return
}
le.config.Lock.RecordEvent("became leader")
le.metrics.leaderOn(le.config.Name)
klog.Infof("successfully acquired lease %v", desc)
cancel()
}, le.config.RetryPeriod, JitterFactor, true, ctx.Done())
}
可以看到,代码的第二个参数和最后一个参数,分别是失败重试间隔和结束控制
- 为什么上下文超时的结果是程序崩溃呢?
总要有个结束吧?
这段设计在源码中体现在:
vbnet
//源码地址:https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/manager/internal.go#L560
OnStoppedLeading: func() {
if cm.onStoppedLeading != nil {
cm.onStoppedLeading()
}
// Make sure graceful shutdown is skipped if we lost the leader lock without
// intending to.
cm.gracefulShutdownTimeout = time.Duration(0)
// Most implementations of leader election log.Fatal() here.
// Since Start is wrapped in log.Fatal when called, we can just return
// an error here which will cause the program to exit.
cm.errChan <- errors.New("leader election lost")
},
在选举客户端启动时,注册了这个回调函数,用于执行停止领导节点的动作, 你可以理解为一种优雅退出,尽管它貌似并不优雅
好了,到这里最核心的选举原理就算是讲完了,接下来,直接上源码,聊一聊控制器初始化到选举的全过程
对于在主进程如下声明的控制管理器,它是如何一步一步启动的呢?
css
mgr, err := manager.New(config.GetConfigOrDie(), manager.Options{
LeaderElection: true,
LeaderElectionID: "yufatang",
})
源码分析
本文源码均有对应位置的链接,有疑问可直接复制并查看
下面是管理器的启动函数,接收者为controllerManager,负责各种组件的启动,比如webhook,httpsServer,选举等等,这些组件都属于一个名为runnables的对象。
看源码注释可以看到,对于这个Start函数而言,要么上下文超时退出,要么组件error导致退出,这就是上面说的选举异常程序就退出
scss
//https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/manager/internal.go#L318
// Start starts the manager and waits indefinitely.
// There is only two ways to have start return:
// An error has occurred during in one of the internal operations,
// such as leader election, cache start, webhooks, and so on.
// Or, the context is cancelled.
func (cm *controllerManager) Start(ctx context.Context) (err error)
具体而言,启动选举的代码就是这段:
go
//https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/manager/internal.go#L423
// Start the leader election and all required runnables.
{
ctx, cancel := context.WithCancel(context.Background())
cm.leaderElectionCancel = cancel
go func() {
//资源锁一般而言就是前面注册时设置了LeaderElection=true就不为nil
//复杂点的话可以区分资源锁类型:leaseLock, MutliLock
if cm.resourceLock != nil { //设置了LeaderElection=true就走这段,先抢leader,抢到再干活
if err := cm.startLeaderElection(ctx); err != nil {
cm.errChan <- err
}
} else { // 没设置那就直接干活
// Treat not having leader election enabled the same as being elected.
if err := cm.startLeaderElectionRunnables(); err != nil {
cm.errChan <- err
}
close(cm.elected)
}
}()
}
然后就是选举机制初始化的过程:
go
//https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/manager/internal.go#L560
func (cm *controllerManager) startLeaderElection(ctx context.Context) (err error) {
l, err := leaderelection.NewLeaderElector(leaderelection.LeaderElectionConfig{
Lock: cm.resourceLock,
LeaseDuration: cm.leaseDuration,
RenewDeadline: cm.renewDeadline,
RetryPeriod: cm.retryPeriod,
Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: func(_ context.Context) {
if err := cm.startLeaderElectionRunnables(); err != nil {
cm.errChan <- err
return
}
close(cm.elected)
},
OnStoppedLeading: func() {
if cm.onStoppedLeading != nil {
cm.onStoppedLeading()
}
// Make sure graceful shutdown is skipped if we lost the leader lock without
// intending to.
cm.gracefulShutdownTimeout = time.Duration(0)
// Most implementations of leader election log.Fatal() here.
// Since Start is wrapped in log.Fatal when called, we can just return
// an error here which will cause the program to exit.
cm.errChan <- errors.New("leader election lost")
},
},
ReleaseOnCancel: cm.leaderElectionReleaseOnCancel,
Name: cm.leaderElectionID,
})
if err != nil {
return err
}
// Start the leader elector process
go func() {
l.Run(ctx)
<-ctx.Done()
close(cm.leaderElectionStopped)
}()
return nil
}
可以看到,这段代码主要就是初始化了一个LeaderElection选举器,将用户注册的参数赋值进去,并注册了两个回调函数,一个是启动Runnables的函数,用于成为leader后开始干活,另一个就是异常情况的优雅退出函数了
然后启动一个协程开始Run:
scss
//https://github.com/kubernetes/client-go/blob/master/tools/leaderelection/leaderelection.go#L204
// Run starts the leader election loop. Run will not return
// before leader election loop is stopped by ctx or it has
// stopped holding the leader lease
func (le *LeaderElector) Run(ctx context.Context) {
defer runtime.HandleCrash()
defer le.config.Callbacks.OnStoppedLeading()
if !le.acquire(ctx) {
return // ctx signalled done
}
ctx, cancel := context.WithCancel(ctx)
defer cancel()
go le.config.Callbacks.OnStartedLeading(ctx)
le.renew(ctx)
}
这个函数要么获取leader失败返回,并执行优雅退出函数,要么就是上下文超时退出
获取leader的逻辑:
scss
//https://github.com/kubernetes/client-go/blob/master/tools/leaderelection/leaderelection.go#L245
// acquire loops calling tryAcquireOrRenew and returns true immediately when tryAcquireOrRenew succeeds.
// Returns false if ctx signals done.
func (le *LeaderElector) acquire(ctx context.Context) bool {
ctx, cancel := context.WithCancel(ctx)
defer cancel()
succeeded := false
desc := le.config.Lock.Describe()
klog.Infof("attempting to acquire leader lease %v...", desc)
wait.JitterUntil(func() {
succeeded = le.tryAcquireOrRenew(ctx)
le.maybeReportTransition()
if !succeeded {
klog.V(4).Infof("failed to acquire lease %v", desc)
return
}
le.config.Lock.RecordEvent("became leader")
le.metrics.leaderOn(le.config.Name)
klog.Infof("successfully acquired lease %v", desc)
cancel()
}, le.config.RetryPeriod, JitterFactor, true, ctx.Done())
return succeeded
}
这实际就是一个由上下文控制的循环函数,wait.JitterUntil
这个函数顾名思义就是知道选举成功为止,并且接受上下文控制,如果上下文超时则退出,acquire
函数返回false,否则就每隔le.config.RetryPeriod
获取一次
实质执行获取leader的函数为tryAcquireOrRenew
,函数比较长,就不放源码了,给个伪代码,结合前文画的选举示意图,理解一下这个过程:
vbnet
//https://github.com/kubernetes/client-go/blob/master/tools/leaderelection/leaderelection.go#L319
// tryAcquireOrRenew tries to acquire a leader lease if it is not already acquired,
// else it tries to renew the lease if it has already been acquired. Returns true
// on success else returns false.
func (le *LeaderElector) tryAcquireOrRenew(ctx context.Context) bool {
1. 获取历史选举锁记录
a. 不存在那就创建,并且设备本地维护的观测记录,
前面提到过,理解为本地记了一个小本本,用于对比集群记录和锁存活时间的,
然后就可以直接返回true了
2. 有历史记录,那就和本地的对比
a. 有改动:那就更新本地的小本本,并且更新观测时间
b. 用本地观测时间+租期和当前时间作对比,看锁是否过期,如果没过期,且锁被人持有,那就返回false
3.判断自己当前是不是leader?
a.是:根据历史记录更新锁的获取时间(成为Leader的时间)和转换次数
b.不是:说明发生了转换,将转换次数加1
4.更新集群的锁记录
这里和之前提到的ResourceLock有关系,可能存在好几种资源需要更新,比如本文日志里就更新了configmap,lease两种资源,因为当时代码用的v0.7-v0.11版本的controller-runtime,默认为configmapLeaseLock,v0.12及以后默认为leaseLock
}
5.更新请求成功,再更新本地的观测记录
}
发生了上下文超时,那程序就崩溃退出
如果没发生上下文超时,那要么一直循环尝试获取leader,要么就返回true,进入正常工作和刷新锁租期逻辑中Run
刷新锁租期和上面获取锁的逻辑没啥区别,就是循环执行tryAcquireOrRenew
函数。
正常工作那就是正式进入控制器的工作逻辑了,即对管理的资源对象进行监控和调协,容后再叙。
结语
本文通过一个线上问题,从排查的日志着手,介绍了问题的现象,原因及相应解决措施。并通过梳理涉及到的源码,详细介绍了一个注册了leaderElection机制的控制管理器的启动全过程:
- 控制管理器初始化
- 选举机制初始化
- 执行选举,尝试获取leader
- 获取成功,进入正常工作逻辑
希望本文能对大家有所帮助。