containerd的CPU过高的问题排查

概述

发现线下一台k8s的node的containerd的CPU基本在95%-130%之间来回浮动，看到别的node的containerd的CPU占用率基本都在20%以下，或者是基本上是:1%-5%的样子.

排查过程

定位CPU过高的进程

使用top命令查看发现containerd的CPU居高不下，基本上在100%---120%之间浮动
使用pidstat命令仔细查看

bash 复制代码

linux > pidstat -p 1163 1 #抓取1163进程，按照1s的间隔打印CPU消耗

33分27秒   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
33分28秒     0      1163   68.00   56.00    0.00    0.00  124.00     2  containerd
33分29秒     0      1163   46.00   49.00    0.00    0.00   95.00     2  containerd
33分30秒     0      1163    1.00    1.00    0.00    0.00    2.00     2  containerd
33分31秒     0      1163    4.00    1.00    0.00    0.00    5.00     2  containerd
33分32秒     0      1163    1.00    1.00    0.00    0.00    2.00     2  containerd
33分33秒     0      1163    4.00    3.00    0.00    0.00    7.00     2  containerd
33分34秒     0      1163   58.00   60.00    0.00    0.00  118.00     2  containerd
33分35秒     0      1163   56.00   58.00    0.00    0.00  114.00     2  containerd
33分36秒     0      1163   59.00   57.00    0.00    0.00  116.00     2  containerd
33分37秒     0      1163   59.00   60.00    0.00    0.00  119.00     2  containerd
33分38秒     0      1163   69.00   57.00    0.00    0.00  126.00     2  containerd
33分39秒     0      1163   50.00   49.00    0.00    0.00   99.00     2  containerd
33分40秒     0      1163    2.00    1.00    0.00    0.00    3.00     2  containerd
33分41秒     0      1163    3.00    1.00    0.00    0.00    4.00     2  containerd
33分42秒     0      1163    1.00    1.00    0.00    0.00    2.00     2  containerd
33分43秒     0      1163    3.00    2.00    0.00    0.00    5.00     2  containerd
33分44秒     0      1163   56.00   61.00    0.00    0.00  117.00     2  containerd
33分45秒     0      1163   60.00   58.00    0.00    0.00  118.00     2  containerd
33分46秒     0      1163   57.00   60.00    0.00    0.00  117.00     2  containerd
33分47秒     0      1163   55.00   63.00    0.00    0.00  118.00     2  containerd
33分48秒     0      1163   71.00   59.00    0.00    0.00  130.00     2  containerd
33分49秒     0      1163   51.00   50.00    0.00    0.00  101.00     2  containerd
33分50秒     0      1163    1.00    1.00    0.00    0.00    2.00     2  containerd
33分51秒     0      1163    4.00    2.00    0.00    0.00    6.00     2  containerd
33分52秒     0      1163    2.00    1.00    0.00    0.00    3.00     2  containerd
33分53秒     0      1163    3.00    3.00    0.00    0.00    6.00     2  containerd
33分54秒     0      1163   58.00   60.00    0.00    0.00  118.00     2  containerd


#1.其实输出的是:14时33分xx秒格式，为了不换行，去掉了:14时
#2.从输出来看，基本上隔4-5s的时间，就会出现大概:6s的CPU运行在: 100%-120%之间

使用perf抓取对应进程的CPU采样信息

在目标node上执行perf命令抓取对应pid的CPU的采样信息

bash 复制代码

linux > sudo perf record -F 99 -p 1179 -g -- sleep 60

linux > sudo perf script -i perf.data > out.perf

安装FlameGraph

bash 复制代码

linux > git clone https://github.com/brendangregg/FlameGraph.git

生成火焰图

bash 复制代码

linux > cd FlameGraph //[上一个步骤git clone的FlameGraph的仓库]
linux > ./stackcollapse-perf.pl out.perf | ./flamegraph.pl > flamegraph.svg

生成的CPU的火焰图如下

分析这个火焰图能够得到耗费CPU最多的地方是

bash 复制代码

os.Lstat这个系统调用

查找containerd的源码对应的操作

查找containerd对应的源码的部分

go 复制代码

// DiskUsage counts the number of inodes and disk usage for the resources under
// path.
//看代码的意思是获取磁盘的使用量
func DiskUsage(ctx context.Context, roots ...string) (Usage, error) {
	return diskUsage(ctx, roots...)
}

持续追踪系统的调用代码

go 复制代码

func diskUsage(ctx context.Context, roots ...string) (Usage, error) {
	...省略
	for _, root := range roots {
        //重点是下面这块:filepath.Walk这个方法的调用
		if err := filepath.Walk(root, func(path string, fi os.FileInfo, err error) error {
			...省略
		}); 
	}

	...省略
}

//继续展开代码
func Walk(root string, fn WalkFunc) error {
	...省略
	if err != nil {
		...省略
	} else {
		err = walk(root, info, fn)
	}
    ...省略
}


func walk(path string, info fs.FileInfo, walkFn WalkFunc) error {
	...省略
	for _, name := range names {
		...省略:下面这个
		fileInfo, err := lstat(filename)
		...省略
	}
	return nil
}


func Lstat(name string) (FileInfo, error) {
	testlog.Stat(name)
	return lstatNolog(name)
}

...省略中间多个追踪，最终结果是:

func fstatat(fd int, path string, stat *Stat_t, flags int) (err error) {
	var _p0 *byte
	_p0, err = BytePtrFromString(path)
	if err != nil {
		return
	}
	_, _, e1 := Syscall6(SYS_NEWFSTATAT, uintptr(fd), uintptr(unsafe.Pointer(_p0)), uintptr(unsafe.Pointer(stat)), uintptr(flags), 0, 0)
	if e1 != 0 {
		err = errnoErr(e1)
	}
	return
}

#可以看到系统调用的方法是:sys_newfstatat这个

使用bpftrace抓取调用细节

安装bpftrace(自行安装)
提供bpftrace的脚本

bash 复制代码

#1179是node的containerd的进程的id
tracepoint:syscalls:sys_enter_newfstatat {
    if (pid == 1179) {
        printf("1179 filename: %s, flag: %d\n", str(args->filename,100), args->flag);
    }
}

#上面文件保存为:trace_lstat.bt

提供bash脚本

bash 复制代码

#!/bin/sh
export BPFTRACE_MAX_STRLEN=200
bpftrace trace_lstat.bt

#原因是，需要引入一个BPFTRACE_MAX_STRLEN的环境变量，因为bpftrace的str方法默认只会打印100的长度好像，但是需要通过这个来设置才行

看下输出的数据信息

bash 复制代码

...省略
1179 filename: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/43491/fs/home/cpaas/xxl-job/202, flag: 256
1179 filename: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/43491/fs/home/cpaas/xxl-job/202, flag: 256
1179 filename: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/43491/fs/home/cpaas/xxl-job/202, flag: 256
1179 filename: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/43491/fs/home/cpaas/xxl-job/202, flag: 256
1179 filename: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/43491/fs/home/cpaas/xxl-job/202, flag: 256
1179 filename: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/43491/fs/home/cpaas/xxl-job/202, flag: 256
1179 filename: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapsho
...省略

分析下就是如下的结果:
1. 这个文件总共380562行
linux > # find /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs -type f|wc -l

#输出的文件个数
1104159


那么基本能够确定问题出在这个地方了

bash 复制代码

#使用AI提供的如下的bash脚本进行统计文件个数

#!/bin/bash

# 指定根目录
root_dir="/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots"

# 遍历根目录下的所有子文件夹
for dir in "$root_dir"/*/; do
    # 去除路径末尾的斜杠
    dir=${dir%/}
    # 提取子文件夹名称
    folder_name=$(basename "$dir")
    # 使用 find 命令统计文件夹中的文件数量
    file_count=$(find "$dir" -type f | wc -l)
    # 检查文件数量是否大于 10000
    if [ $file_count -gt 10000 ]; then
        echo "$folder_name: $file_count 个文件"
    fi
done


#1.输出结果如下:统计文件个数 > 10000的(1w)
83674: 21389 个文件
85266: 16118 个文件
85333: 22828 个文件
85932: 11998 个文件
86634: 430300 个文件
86635: 430193 个文件

找到如下的文件夹的个数很多

bash 复制代码

linux > find /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/86634/fs/home/export-import-server/xxl-job -type f|wc -l
430842

分析下这个下面的文件个数统计

bash 复制代码

#还是上面 6 步骤的bash统计脚本，只是换下root_url即可，得到的结果为
2025-03-22: 55409 个文件
2025-03-23: 103679 个文件
2025-03-24: 103680 个文件
2025-03-25: 103330 个文件
2025-03-26: 64412 个文件

随便进入一个文件夹查看下,看到都是xxl-job的日志。因为xxl-job每次执行都会产生一个很小的日志文件，如果任务多，并且任务频繁(例如5个任务，然后每个任务5s执行一次，那么每天就是:86400).看上面竟然10w多的日志文件个数，估计任务更多更频繁.
手动执行下清理前面几天的日志即可，不过应用程序也要作出改动，显示日志产生的个数以及保留的时间
清理xxl-job的过期无用的日志文件之后，查看containerd进程消耗的CPU

bash 复制代码

[root@h-172-16-15-144-DEV-k8s-node fs]# pidstat -p 1163 1
Linux 5.14.0-479.el9.x86_64 (h-172-16-15-144-DEV-k8s-node) 	2025年03月26日 	_x86_64_	(16 CPU)

13分21秒   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
13分22秒     0      1163    9.00    1.00    0.00    0.00   10.00     2  containerd
13分23秒     0      1163    0.00    1.00    0.00    0.00    1.00     2  containerd
13分24秒     0      1163    8.00    9.00    0.00    0.00   17.00     2  containerd
13分25秒     0      1163    3.00    1.00    0.00    0.00    4.00     2  containerd
13分26秒     0      1163    1.00    1.00    0.00    0.00    2.00     2  containerd
13分27秒     0      1163    2.00    1.00    0.00    0.00    3.00     2  containerd
13分28秒     0      1163    1.00    1.00    0.00    0.00    2.00     2  containerd
13分29秒     0      1163    1.00    0.00    0.00    0.00    1.00     2  containerd
13分30秒     0      1163    0.00    1.00    0.00    0.00    1.00     2  containerd
13分31秒     0      1163   10.00    0.00    0.00    0.00   10.00     2  containerd
13分32秒     0      1163    2.00    1.00    0.00    0.00    3.00     2  containerd
13分33秒     0      1163    0.00    0.00    0.00    0.00    0.00     2  containerd
13分34秒     0      1163    9.00    9.00    0.00    0.00   18.00     2  containerd
13分35秒     0      1163    1.00    1.00    0.00    0.00    2.00     2  containerd
13分36秒     0      1163    0.00    0.00    0.00    0.00    0.00     2  containerd
13分37秒     0      1163    1.00    1.00    0.00    0.00    2.00     2  containerd
13分38秒     0      1163    1.00    0.00    0.00    0.00    1.00     2  containerd
13分39秒     0      1163    3.00    1.00    0.00    0.00    4.00     2  containerd
13分40秒     0      1163    0.00    1.00    0.00    0.00    1.00     2  containerd
13分41秒     0      1163   11.00    1.00    0.00    0.00   12.00     2  containerd
13分42秒     0      1163    1.00    0.00    0.00    0.00    1.00     2  containerd
13分43秒     0      1163    0.00    0.00    0.00    0.00    0.00     2  containerd
13分44秒     0      1163    8.00   10.00    0.00    0.00   18.00     2  containerd
13分45秒     0      1163    0.00    0.00    0.00    0.00    0.00     2  containerd
13分46秒     0      1163    1.00    0.00    0.00    0.00    1.00     2  containerd
13分47秒     0      1163   11.00    2.00    0.00    0.00   13.00     2  containerd
13分48秒     0      1163    0.00    0.00    0.00    0.00    0.00     2  containerd
13分49秒     0      1163    3.00    2.00    0.00    0.00    5.00     2  containerd
13分50秒     0      1163    1.00    0.00    0.00    0.00    1.00     2  containerd
13分51秒     0      1163    1.00    0.00    0.00    0.00    1.00     2  containerd
13分52秒     0      1163    2.00    1.00    0.00    0.00    3.00     2  containerd
13分53秒     0      1163    0.00    1.00    0.00    0.00    1.00     2  containerd
13分54秒     0      1163   18.00    9.00    0.00    0.00   27.00     2  containerd
13分55秒     0      1163    2.00    1.00    0.00    0.00    3.00     2  containerd
13分56秒     0      1163    2.00    2.00    0.00    0.00    4.00     2  containerd
13分57秒     0      1163    1.00    1.00    0.00    0.00    2.00     2  containerd
13分58秒     0      1163    0.00    0.00    0.00    0.00    0.00     2  containerd
13分59秒     0      1163    2.00    1.00    0.00    0.00    3.00     2  containerd
14分00秒     0      1163    1.00    0.00    0.00    0.00    1.00     2  containerd

#可以看到CPU降下来了，最高才27%

原因分析

由于containerd开启了一个定时任务去获取自己管辖的磁盘的使用大小:就是DiskUsage方法
然后需要调用系统调用的newfstatat进行统计
然后由于文件个数太多了，导致非常耗费CPU
这个文件很多的snapshots下面有很多pod的container写的日志，xxl-job的日志，产生了几百万个，文件很小，但是文件个数很多，没有及时的清理，导致containerd每次扫描都非常耗费CPU