案例分析 | SurfaceFlinger 大片Runnable引起的卡顿

问题现象

连续启动退出应用场景下，遇到了Sf 大片Runnable 导致的卡顿

摘取其中一段为例

这一帧Sf 花在composite 将近1.5s

Runnable 473ms，这违反直觉，因为Sf 线程优先级是RT 97，这段区域内Sf 优先级也没有掉下来

看下这段的选核行为

js 复制代码

TimerDispatch-1729 ( 1241) [000] .... 625.956417: sched_waking: comm=surfaceflinger pid=1241 prio=97 target_cpu=6
//idle_exit_latency=1 说明此时是idle 并且也是online 的(online=1)，也没有被Pause
//也没有处于high irq
TimerDispatch-1729 ( 1241) [000] .... 625.956434: sched_cpu_util: cpu=7 nr_running=0 cpu_util=511 cpu_max_util=427 capacity=825 capacity_orig=1024 idle_exit_latency=1 online=1 paused=0 nr_rtg_high_prio_tasks=0 busy_with_softirqs=0 high_irq_ctrl=1 high_irq_load=0 irqload=31 min_highirq_load=1024 irq_ratio=100
//Sf 限制只能跑4~7，且由于此时CPU7 是idle，故选到了7上
TimerDispatch-1729 ( 1241) [000] .... 625.956437: sched_select_task_rq_rt: pid=1241 policy=16384 target_cpu=7 lowest_mask=240 rt_aggre_preempt_enable=0 idle_cpus=128 cfs_cpus=112 rt_cpus=0 cfs_lowest_cpu=5 cfs_lowest_prio=124 cfs_lowest_pid=1891 rt_lowest_cpu=-1 rt_lowest_prio=97 rt_lowest_pid=-1 task_util_est=264 uclamp_min=308 uclamp_max=409 uclamp_task_util=308 sd_flag=8 sync=0 task_mask=240 cpuctl_grp_id=2 cpuset_grp_id=8 act_mask=255

//原先跑在CPU6 上, 这次跑CPU7，但是cpu7 上一直没有等到IPI 中断

/66ms 后再次被唤醒，target_cpu 依旧是7
TimerDispatch-1729 ( 1241) [001] .... 626.023055: sched_waking: comm=surfaceflinger pid=1241 prio=97 target_cpu=7
//此时CPU7 短暂的退出idle
TimerDispatch-1729 ( 1241) [001] .... 626.023073: sched_cpu_util: cpu=7 nr_running=1 cpu_util=616 cpu_max_util=616 capacity=982 capacity_orig=1024 idle_exit_latency=4294967295 online=1 paused=0 nr_rtg_high_prio_tasks=0 busy_with_softirqs=0 high_irq_ctrl=1 high_irq_load=0 irqload=34 min_highirq_load=1024 irq_ratio=100

//这次选核选到了CPU 5上， LB_RT_LOWEST_PRIO_NORMAL，因为此时CPU 5跑的CFS TaskSnapshotPer 优先级130是最低的
TimerDispatch-1729 ( 1241) [001] .... 626.023077: sched_select_task_rq_rt: pid=1241 **policy=32769** target_cpu=5 lowest_mask=240 rt_aggre_preempt_enable=0 idle_cpus=0 cfs_cpus=240 rt_cpus=0 cfs_lowest_cpu=5 cfs_lowest_prio=130 cfs_lowest_pid=2294 rt_lowest_cpu=-1 rt_lowest_prio=97 rt_lowest_pid=-1 task_util_est=264 uclamp_min=110 uclamp_max=409 uclamp_task_util=110 sd_flag=8 sync=0 task_mask=240 cpuctl_grp_id=2 cpuset_grp_id=8 act_mask=255
//实际跑到了CPU 5上, 距离开始被唤醒已经过去了66ms 之久
TaskSnapshotPer-2294 ( 1875) [005] .... 626.023125: sched_switch: prev_comm=TaskSnapshotPer prev_pid=2294 prev_prio=130 prev_state=R ==> next_comm=surfaceflinger next_pid=1241 next_prio=97

所以我们自然而然会想到，这个阶段中断可能出问题了

结合trace

单次处理SCHED_SOFTIRQ 就花费了1.7s

js 复制代码

//out_balance 为0，说明本次无需pull balance
<idle>-0     (-----) [007] .... 625.568063: sched_find_busiest_group: src_cpu=4 dst_cpu=7 out_balance=0 reason=1

surfaceflinger-1241 ( 1241) [007] .... 625.569155: sched_waking: comm=ksoftirqd/7 pid=60 prio=120 target_cpu=7 surfaceflinger-1241 ( 1241) [007] .... 625.569163: sched_wakeup: comm=ksoftirqd/7 pid=60 prio=120 target_cpu=007

625.576026: softirq_entry: vec=7 [action=SCHED]

这里顺便解释下SCHED_SOFTIRQ，这个软中断在调度上很常见，一般tick 时都会触发 vec=7 [action=SCHED] include/linux/interrupt.h

h 复制代码

enum
{
	HI_SOFTIRQ=0,
	TIMER_SOFTIRQ,
	NET_TX_SOFTIRQ,
	NET_RX_SOFTIRQ,
	BLOCK_SOFTIRQ,
	IRQ_POLL_SOFTIRQ,
	TASKLET_SOFTIRQ,
	SCHED_SOFTIRQ,
	HRTIMER_SOFTIRQ,
	RCU_SOFTIRQ,    /* Preferable RCU should always be the last softirq */
	NR_SOFTIRQS
};

一般两种场景下会走

idle balance
periodic balance 也就是tick balance

第一种可以排除掉，因为CPU 7 是从Sf 切到ksoftirqd 的，而不是从swapper/7 切过去。
surfaceflinger-1241 ( 1241) [007] .... 625.570755: sched_switch: prev_comm=surfaceflinger prev_pid=1241 prev_prio=97 prev_state=S ==> next_comm=ksoftirqd/7 next_pid=60 next_prio=120

SCHED_SOFTIRQ 对应的函数run_rebalance_domains，迟迟没有softirq_ext，说明run_rebalance_domains 函数迟迟没有走完。

以下code 基于kernel 6.1

scss 复制代码

trace_softirq_entry(vec_nr);
h->action(h);
trace_softirq_exit(vec_nr);

c 复制代码

__init void init_sched_fair_class(void)
{
#ifdef CONFIG_SMP
	int i;

	for_each_possible_cpu(i) {
		zalloc_cpumask_var_node(&per_cpu(load_balance_mask, i), GFP_KERNEL, cpu_to_node(i));
		zalloc_cpumask_var_node(&per_cpu(select_rq_mask,    i), GFP_KERNEL, cpu_to_node(i));
	}

	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
#ifdef CONFIG_NO_HZ_COMMON
	nohz.next_balance = jiffies;
	nohz.next_blocked = jiffies;
	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
#endif
#endif /* SMP */
}

简单归纳下

trigger_load_balance()在每个CPU上通过scheduler_tick() 周期执行。

在当前运行队列下一个定期调度再平衡事件到达后，它引发一个软中断。负载均衡真正的工作由run_rebalance_domains()->rebalance_domains()完成，在软中断上下文中执行（SCHED_SOFTIRQ）。

接下来就得调查rebalance_domains 耗时的原因，大概率是等锁，不过可以先排除是等rq lock，因为这段时间内一直有task queue 进来。

未完待续。。。