在AWS中国区实现EKS跨VPC跨区域实现节点加入集群的实践

本文档记录了在 AWS 中国区实现 EKS 跨区域节点加入的完整过程,包括:

  • 北京同区域跨 VPC 节点加入

  • 宁夏区域节点跨区域节加入北京 EKS 集群

    Nodes (4):
    ip-10-200-1-254.cn-northwest-1.compute.internal Ready: True ← 宁夏跨区域节点 ✅
    ip-172-31-18-54.cn-north-1.compute.internal Ready: True ← 北京跨 VPC 节点 ✅
    ip-192-168-3-84.cn-north-1.compute.internal Ready: True ← 北京托管节点组
    fargate-ip-192-168-46-192... Ready: True ← Fargate 节点

同AZ跨VPC

创建 Access Entry:

bash 复制代码
aws eks create-access-entry \
  --cluster-name clustername \
  --principal-arn arn:aws-cn:iam::xxxxxxxxxxx:role/myEKSNodeRole \
  --type EC2_LINUX \
  --kubernetes-groups system:nodes \
  --username system:node:{{EC2PrivateDNSName}} \
  --region cn-north-1

AL2023 AMI nodeadm 配置

yaml 复制代码
apiVersion: node.eks.aws/v1alpha1 
kind: NodeConfig
spec:
  cluster:
    name: clustername
    apiServerEndpoint: https://F4B05xxxxxxxF79A8CA3EE4.yl4.cn-north-1.eks.amazonaws.com.cn
    certificateAuthority: LS0tLS1CRUdJTi...
    cidr: 10.100.0.0/16

北京同区域跨 VPC 节点成功加入集群,状态 Ready: True

跨region

宁夏区节点尝试加入北京 EKS 集群时,kubelet 日志显示:

复制代码
E0522 15:30:00.123456 12345 round_trippers.go:553] 
Request: POST https://F4B054AExxxxxx79A8CA3EE4.yl4.cn-north-1.eks.amazonaws.com.cn/api/v1/nodes
Response: 401 Unauthorized

401 Unauthorized 表示认证失败。这很奇怪,因为:

  1. 同一个 IAM 角色 myEKSNodeRole 在北京节点上正常工作
  2. VPC Peering 网络已经打通,可以 ping 通 EKS API endpoint
  3. STS presigned URL 可以正常解析出 IAM identity

初步假设:

  1. 网络不通 → 已验证 VPC Peering 可达
  2. Token 生成问题 → 同角色在北京可生成有效 token
  3. IAM 权限问题 → 同角色在北京节点正常工作

在两个节点上分别生成 token 并测试:

bash 复制代码
# 北京节点生成 token
TOKEN_BEIJING=$(curl -s "http://169.254.169.254/latest/meta-data/identity-credentials/ec2/security-credentials/ec2-instance" \
  -H "X-aws-ec2-metadata-token: $TOKEN" | jq -r .Token)

# 宁夏节点生成 token  
TOKEN_NINGXIA=$(同上命令)

# 测试 STS 验证
curl -s "https://sts.cn-north-1.amazonaws.com.cn/?Action=GetCallerIdentity&Version=2011-06-15&X-Amz-Security-Token=$TOKEN"
# 两者都返回正确的 IAM identity

# 测试 EKS API
curl -s -H "Authorization: Bearer $TOKEN_NINGXIA" \
  "https://EKS-ENDPOINT/api/v1/nodes"
# 北京 token: 403 Forbidden (认证成功,授权取决于角色)
# 宁夏 token: 401 Unauthorized (认证失败)

STS 接受宁夏 token 并正确解析出 IAM identity,说明 IAM 层面的认证是成功的。但 EKS API 返回 401,说明问题出在 EKS 的认证 webhook 上。

北京 token 返回 403 Forbidden(认证成功,只是没有权限),而宁夏 token 返回 401 Unauthorized(认证失败)。这说明 EKS 对两者的处理方式不同。

基于以上差异,我们推测EC2_LINUX access entry 的认证流程是:

  1. 首先验证 IAM 签名(这步通过了,因为 STS 验证成功)
  2. 然后解析 username 模板 {``{EC2PrivateDNSName}}
  3. 为了解析这个模板,EKS 需要在集群所在区域 调用 DescribeInstances API 查找实例
  4. 宁夏实例 ID i-07aexxxxxxxe2a 在北京区域的 EC2 中不存在,所以认证失败

这解释了为什么:

  • STS 验证成功(STS 只检查签名)
  • EKS webhook 拒绝(webhook 还要检查实例是否存在)

删除原有 Access Entry

bash 复制代码
aws eks delete-access-entry \
  --cluster-name clustername \
  --principal-arn arn:aws-cn:iam::xxxxxxxxxxx:role/myEKSNodeRole \
  --region cn-north-1

创建 STANDARD 类型 Access Entry

bash 复制代码
aws eks create-access-entry \
  --cluster-name clustername \
  --principal-arn arn:aws-cn:iam::xxxxxxxxxxx:role/myEKSNodeRole \
  --type STANDARD \
  --username "eks-cross-region-node" \
  --region cn-north-1

STANDARD 类型的 access entry 只验证 IAM 签名,不执行 EC2 实例验证。这正是跨区域场景需要的。

但是 STANDARD 类型有一个限制:不能使用 system: 前缀的 kubernetes-groups(如 system:nodes)。这意味着我们不能直接使用内置的节点权限,需要手动创建 RBAC。

创建自定义 RBAC,参考 system:node ClusterRole 的权限设计,但做适当精简。

bash 复制代码
# 获取管理员 token
TOKEN=$(aws eks get-token --cluster-name clustername --region cn-north-1 --query status.token --output text)

# 创建 ClusterRole
curl -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  "https://EKS-ENDPOINT/apis/rbac.authorization.k8s.io/v1/clusterroles" -d '{
    "apiVersion": "rbac.authorization.k8s.io/v1",
    "kind": "ClusterRole",
    "metadata": {"name": "cross-region-node-role"},
    "rules": [
      {"apiGroups": [""], "resources": ["nodes"], "verbs": ["get","list","watch","create","update","patch","delete"]},
      {"apiGroups": [""], "resources": ["nodes/status"], "verbs": ["patch","update"]},
      {"apiGroups": [""], "resources": ["pods"], "verbs": ["get","list","watch","delete"]},
      {"apiGroups": [""], "resources": ["pods/status"], "verbs": ["patch","update"]},
      {"apiGroups": [""], "resources": ["events"], "verbs": ["create","patch","update"]},
      {"apiGroups": [""], "resources": ["configmaps","secrets","endpoints","services"], "verbs": ["get","list","watch"]},
      {"apiGroups": [""], "resources": ["serviceaccounts/token"], "verbs": ["create"]},
      {"apiGroups": ["coordination.k8s.io"], "resources": ["leases"], "verbs": ["get","list","watch","create","update","patch","delete"]},
      {"apiGroups": ["storage.k8s.io"], "resources": ["csinodes","csidrivers","storageclasses","volumeattachments"], "verbs": ["get","list","watch","create","update","patch"]},
      {"apiGroups": ["certificates.k8s.io"], "resources": ["certificatesigningrequests"], "verbs": ["get","list","watch","create"]},
      {"apiGroups": ["authentication.k8s.io"], "resources": ["tokenreviews"], "verbs": ["create"]},
      {"apiGroups": ["authorization.k8s.io"], "resources": ["subjectaccessreviews"], "verbs": ["create"}
    ]
  }'

# 创建 ClusterRoleBinding
curl -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  "https://EKS-ENDPOINT/apis/rbac.authorization.k8s.io/v1/clusterrolebindings" -d '{
    "apiVersion": "rbac.authorization.k8s.io/v1",
    "kind": "ClusterRoleBinding",
    "metadata": {"name": "cross-region-node-binding"},
    "roleRef": {
      "apiGroup": "rbac.authorization.k8s.io",
      "kind": "ClusterRole",
      "name": "cross-region-node-role"
    },
    "subjects": [{
      "kind": "User",
      "name": "eks-cross-region-node",
      "apiGroup": "rbac.authorization.k8s.io"
    }]
  }'

重启节点 kubelet

bash 复制代码
systemctl restart kubelet

验证结果

bash 复制代码
$ kubectl get nodes
NAME                                               STATUS   ROLES    AGE   VERSION
ip-10-200-1-254.cn-northwest-1.compute.internal    Ready    <none>   1m    v1.28.x 
ip-172-31-18-54.cn-north-1.compute.internal        Ready    <none>   1h    v1.28.x
ip-192-168-3-84.cn-north-1.compute.internal        Ready    <none>   2d    v1.28.x
相关推荐
主机哥哥1 天前
2026年腾讯云秒杀活动抢购攻略
云计算·腾讯云
花千烬1 天前
crictl info 连不上 containerd 怎么办?endpoint、socket 与权限一次查清
云计算
AKAMAI4 天前
每百万 Token 成本砍六成,出海 AI 团队开始重算推理这笔账
人工智能·云计算
A小辣椒15 天前
AWS Clould Support Engineer就职面试题
aws
tiancaijiben17 天前
阿里云Kubernetes集群托管完全指南:从创建到生产级运维
云计算
亚林瓜子17 天前
AWS WAF中如何放行某个触发了托管规则的接口
aws·waf
互联网推荐官17 天前
上海软件定制开发公司推荐:从PaaS工程化路径看D-coding的技术取舍
云原生·云计算·paas·软件开发·开发经验·上海
sbjdhjd17 天前
从零搭建企业级 CI/CD(下):Jenkins+GitLab+Harbor 全链路实战指南
git·servlet·ci/cd·云原生·云计算·gitlab·jenkins
tiancaijiben17 天前
阿里云应用实时监控服务ARMS完全接入指南:从探针部署到全链路可观测
云计算
xingyuzhisuan18 天前
算力租赁平台 GPU 资源隔离方案:显存抢占问题深度排查与解决
大数据·云计算·gpu算力