K8s containerd 镜像源配置：用 1ms-helper 处理 ImagePullBackOff

环境和问题现象

这次记录一个 K8s 新节点初始化时遇到的镜像拉取问题。节点已经是 Ready，Pod 也被调度到了目标 Worker，但容器一直拉不起来：

text 复制代码

Failed to pull image
context deadline exceeded
Back-off pulling image
ImagePullBackOff

这类问题通常不是业务 YAML 的第一层错误，而是目标节点上的 containerd 没有正确配置镜像源。K8s 里镜像最终是在节点运行时里拉取，所以排查要回到节点上的 containerd。

本文只写 K8s / containerd 配置，不展开其它运行时。

参考文档：

配置助手：https://mdoc.cc/mliev/1ms/v1.0.0/71
K8s 设置：https://mdoc.cc/mliev/1ms/v1.0.0/17

1. 先确认节点和 containerd 版本

在集群侧看节点：

bash 复制代码

kubectl get node -owide
kubectl get pod -A -owide | grep -E 'ImagePullBackOff|ErrImagePull|ContainerCreating'
kubectl get events -A --sort-by=.lastTimestamp

进入目标节点后看 containerd：

bash 复制代码

containerd --version
systemctl status containerd

containerd 版本很关键：

版本	常见配置方式
`< 1.5`	`/etc/containerd/config.toml` 里配置 `registry.mirrors`
`>= 1.5`	`/etc/containerd/config.toml` 指向 `config_path`，各 registry 使用 `hosts.toml`

2. 用 1ms-helper 走标准化配置

先安装配置助手：

bash 复制代码

curl -sSL https://static.1ms.run/1ms-helper/install.sh | bash

然后在每个需要拉镜像的 K8s 节点上执行：

bash 复制代码

sudo 1ms-helper config:k8s

这一步会做几件事：

检测 Kubernetes 集群和节点角色。
检测 containerd 版本。
按新版/旧版 containerd 生成对应配置。
优先处理 K8s 核心镜像仓库。
必要时重启 containerd 和 kubelet。

多节点集群不能只在控制面节点跑一次。Pod 被调度到哪台节点，镜像就在哪台节点上拉取。

3. containerd 1.5 及以后配置片段

确认 /etc/containerd/config.toml 里有 config_path：

toml 复制代码

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/etc/containerd/certs.d"

创建目录：

bash 复制代码

BASE=/etc/containerd/certs.d
for r in docker.io ghcr.io gcr.io registry.k8s.io nvcr.io quay.io; do
  mkdir -p $BASE/$r
done

registry.k8s.io：

toml 复制代码

# /etc/containerd/certs.d/registry.k8s.io/hosts.toml
server = "https://registry.k8s.io"

[host."https://k8s.1ms.run"]
  capabilities = ["pull", "resolve"]

docker.io：

toml 复制代码

# /etc/containerd/certs.d/docker.io/hosts.toml
server = "https://registry-1.docker.io"

[host."https://docker.1ms.run"]
  capabilities = ["pull", "resolve"]

ghcr.io：

toml 复制代码

# /etc/containerd/certs.d/ghcr.io/hosts.toml
server = "https://ghcr.io"

[host."https://ghcr.1ms.run"]
  capabilities = ["pull", "resolve"]

quay.io：

toml 复制代码

# /etc/containerd/certs.d/quay.io/hosts.toml
server = "https://quay.io"

[host."https://quay.1ms.run"]
  capabilities = ["pull", "resolve"]

配置后重启：

bash 复制代码

systemctl restart containerd
systemctl status containerd

4. containerd 1.5 之前配置片段

旧版本常见写法是在 /etc/containerd/config.toml 中写 mirrors：

toml 复制代码

[plugins."io.containerd.grpc.v1.cri".registry]
  [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
      endpoint = ["https://docker.1ms.run"]

    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."k8s.gcr.io"]
      endpoint = ["https://k8s.1ms.run"]

    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."gcr.io"]
      endpoint = ["https://gcr.1ms.run"]

    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."ghcr.io"]
      endpoint = ["https://ghcr.1ms.run"]

重启服务：

bash 复制代码

systemctl restart containerd
systemctl status containerd

5. 验证命令

用 crictl pull 直接从节点运行时验证，不要只看 kubectl apply 的结果：

bash 复制代码

crictl pull docker.io/library/redis:7-alpine
crictl pull registry.k8s.io/pause:3.10
crictl pull ghcr.io/containerd/nerdctl:latest
crictl pull nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04

再回到集群侧看事件：

bash 复制代码

kubectl get pod -A | grep -E 'ImagePullBackOff|ErrImagePull'
kubectl get events -A --sort-by=.lastTimestamp

如果 crictl pull 可以成功，但业务 Pod 仍然失败，再看：

镜像名称和 tag 是否真实存在。
namespace 下的 ImagePullSecret 是否正确。
ServiceAccount 是否挂了正确的 secret。
业务 YAML 里的 image 地址是否写错。
节点 DNS 是否异常，可以用 1ms-helper check:dns 做初步检查。

总结

K8s 的镜像拉取问题不要只盯 Deployment YAML。标准 K8s + containerd 场景下，镜像源配置主要在节点运行时。

我的处理顺序是：

查事件，确认是否为 ImagePullBackOff。
到目标节点确认 containerd 版本。
用 1ms-helper config:k8s 标准化配置。
看新版 hosts.toml 或旧版 mirrors 是否正确。
用 crictl pull 做多源验证。

这套流程适合写进 K8s 新节点初始化清单，后面扩容 Worker 节点时能少踩很多配置漂移的问题。