metrics-server 部署报错

问题表现

swift 复制代码
E1126 07:37:20.876269       1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.174.135:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.174.135 because it doesn't contain any IP SANs" node="server-06"
E1126 07:37:20.881132       1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.174.136:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.174.136 because it doesn't contain any IP SANs" node="server-07"
E1126 07:37:20.886884       1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.174.133:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.174.133 because it doesn't contain any IP SANs" node="server-04"
E1126 07:37:20.891583       1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.174.134:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.174.134 because it doesn't contain any IP SANs" node="server-05"
I1126 07:37:24.595189       1 server.go:192] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I1126 07:37:34.596509       1 server.go:192] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E1126 07:37:35.842888       1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.174.131:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.174.131 because it doesn't contain any IP SANs" node="server-03"
E1126 07:37:35.843603       1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.174.136:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.174.136 because it doesn't contain any IP SANs" node="server-07"
E1126 07:37:35.865009       1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.174.137:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.174.137 because it doesn't contain any IP SANs" node="server-08"
E1126 07:37:35.869804       1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.174.135:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.174.135 because it doesn't contain any IP SANs" node="server-06"
E1126 07:37:35.875473       1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.174.133:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.174.133 because it doesn't contain any IP SANs" node="server-04"
E1126 07:37:35.880808       1 scraper.go:149] "Failed to scrape node" err="Get \"https://192.168.174.134:10250/metrics/resource\": tls: failed to verify certificate: x509: cannot validate certificate for 192.168.174.134 because it doesn't contain any IP SANs" node="server-05"

问题解释

控制平面节点初始化或者 join 以及工作节点 join 都是自签 kubelet server 端的证书,第三方组件请求 kubelet 获取信息时 TLS 握手校验 kubelet server 证书时,发现请求的 ip 不在证书的 sans 中,则无法通过证书校验,TLS 握手失败。有个配置可以将 kubelet server 证书改为请求 Kubernetes CA 签发

ini 复制代码
# 默认行为(没有 serverTLSBootstrap)
kubelet 启动
  ↓
自己生成证书(自签名)
  ↓
Issuer: CN = server-03-ca@1764041318  # ❌ 自己签发自己
Subject: CN = server-03@1764041318
  ↓
其他组件不信任这个证书
  ↓
需要 --kubelet-insecure-tls 跳过验证

配置 serverTLSBootstrap 为 true 后

ini 复制代码
kubelet 启动
  ↓
不再自己生成证书
  ↓
而是请求 Kubernetes CA 签发
  ↓
Issuer: CN = kubernetes  # ✅ 由可信 CA 签发
Subject: CN = system:node:server-03, O = system:nodes
  ↓
所有组件都信任这个证书
  ↓
不需要 --kubelet-insecure-tls

解决方案

编辑每个节点的 kubelet 配置:

bash 复制代码
# 编辑 kubelet 配置
vim /var/lib/kubelet/config.yaml

添加或修改:

vbnet 复制代码
serverTLSBootstrap: true

重启 kubelet

复制代码
systemctl restart kubelet

在主节点批准 CSR 请求

csharp 复制代码
# 查看待批准的 CSR
kubectl get csr

# 批准所有待处理的 CSR
kubectl get csr | grep Pending | awk '{print $1}' | xargs kubectl certificate approve

问题解决

metrics-server 能正常获取到节点指标

vbscript 复制代码
root@server-03:~# kubectl top nodes
NAME        CPU(cores)   CPU(%)   MEMORY(bytes)   MEMORY(%)   
server-03   108m         5%       1054Mi          27%         
server-04   110m         5%       828Mi           21%         
server-05   98m          4%       907Mi           24%         
server-06   26m          1%       392Mi           10%         
server-07   18m          0%       369Mi           9%          
server-08   20m          1%       367Mi           9% 
相关推荐
聆风吟º2 小时前
CANN开源项目深度实践:基于amct-toolkit实现自动化模型量化与精度保障策略
运维·开源·自动化·cann
较劲男子汉6 小时前
CANN Runtime零拷贝传输技术源码实战 彻底打通Host与Device的数据传输壁垒
运维·服务器·数据库·cann
风流倜傥唐伯虎6 小时前
Spring Boot Jar包生产级启停脚本
java·运维·spring boot
Doro再努力6 小时前
【Linux操作系统10】Makefile深度解析:从依赖推导到有效编译
android·linux·运维·服务器·编辑器·vim
senijusene6 小时前
Linux软件编程:IO编程,标准IO(1)
linux·运维·服务器
忧郁的橙子.6 小时前
02-本地部署Ollama、Python
linux·运维·服务器
醇氧6 小时前
【linux】查看发行版信息
linux·运维·服务器
No8g攻城狮7 小时前
【Linux】Windows11 安装 WSL2 并运行 Ubuntu 22.04 详细操作步骤
linux·运维·ubuntu
酷酷的崽7987 小时前
CANN 生态可维护性与可观测性:构建生产级边缘 AI 系统的运维体系
运维·人工智能
做人不要太理性7 小时前
CANN Runtime 运行时组件深度解析:任务调度机制、存储管理策略与维测体系构建逻辑
android·运维·魔珐星云