现象
升级aws eks的kubernetes版本后执行kubectl logs
或者kubectl exec
相关命令会出现报错
remote error: tls: internal error
执行kubectl get csr -A
查看csr出现一直pending的状态,并且出现问题的pod都在新创建出来的eks node节点上
kubectl get csr -A
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-lxtzg 23h kubernetes.io/kubelet-serving <none> Pending
csr-m2dlr 16h kubernetes.io/kubelet-serving <none> Pending
csr-m66ft 21h kubernetes.io/kubelet-serving <none> Pending
csr-m6nnk 17h kubernetes.io/kubelet-serving <none> Pending
csr-m8j6l 22h kubernetes.io/kubelet-serving <none> Pending
csr-mb4kz 21h kubernetes.io/kubelet-serving <none> Pending
csr-mcw4w 23h kubernetes.io/kubelet-serving <none> Pending
原因
执行kubectl describe cm aws-auth -n kube-system
命令发现aws-auth
有些role重复了,删掉重复的role就行了, 用 kubectl edit cm aws-auth -n kube-system
命令可以直接编辑,或者kubectl get cm aws-auth -n kube-system -o yaml > aws-auth.yaml
保存成yaml文件修改后直接apply
yaml
Name: aws-auth
Namespace: kube-system
Labels: <none>
Annotations: <none>
Data
====
mapRoles:
----
- groups:
- system:masters
rolearn: arn:aws:iam::xxxxxx:role/Fed_Account
- groups:
- system:bootstrappers
- system:nodes
rolearn: arn:aws:iam::xxxxxx:role/eks-stg-shared-worker
username: system:node:{{EC2PrivateDNSName}}
mapUsers:
----
[]
BinaryData
====
Events: <none>
其他
如果主机名类型值是subnet的资源名称也可能会导致这种错误,或者role的policy设置不正确,例如:
json
"Condition": {
"ArnLike": {
"aws:SourceArn": "arn:aws:eks:region-code:your-account-id:cluster/cluster-name"
}
}
可以试着禁用policy然后重启pod.
对于pending状态的csr可以尝试手动approve
kubectl get csr | grep Pending | awk '{print $1}' | xargs -I {} kubectl certificate approve {}
但是这对于我们的eks并没有效果
更多讨论可以参考下面的链接
https://github.com/awslabs/amazon-eks-ami/issues/610
https://stackoverflow.com/questions/71696747/kubectl-exec-logs-on-eks-returns-remote-error-tls-internal-error