在AWS中国区实现EKS跨VPC跨区域实现节点加入集群的实践

本文档记录了在 AWS 中国区实现 EKS 跨区域节点加入的完整过程,包括:

  • 北京同区域跨 VPC 节点加入

  • 宁夏区域节点跨区域节加入北京 EKS 集群

    Nodes (4):
    ip-10-200-1-254.cn-northwest-1.compute.internal Ready: True ← 宁夏跨区域节点 ✅
    ip-172-31-18-54.cn-north-1.compute.internal Ready: True ← 北京跨 VPC 节点 ✅
    ip-192-168-3-84.cn-north-1.compute.internal Ready: True ← 北京托管节点组
    fargate-ip-192-168-46-192... Ready: True ← Fargate 节点

同AZ跨VPC

创建 Access Entry:

bash 复制代码
aws eks create-access-entry \
  --cluster-name clustername \
  --principal-arn arn:aws-cn:iam::xxxxxxxxxxx:role/myEKSNodeRole \
  --type EC2_LINUX \
  --kubernetes-groups system:nodes \
  --username system:node:{{EC2PrivateDNSName}} \
  --region cn-north-1

AL2023 AMI nodeadm 配置

yaml 复制代码
apiVersion: node.eks.aws/v1alpha1 
kind: NodeConfig
spec:
  cluster:
    name: clustername
    apiServerEndpoint: https://F4B05xxxxxxxF79A8CA3EE4.yl4.cn-north-1.eks.amazonaws.com.cn
    certificateAuthority: LS0tLS1CRUdJTi...
    cidr: 10.100.0.0/16

北京同区域跨 VPC 节点成功加入集群,状态 Ready: True

跨region

宁夏区节点尝试加入北京 EKS 集群时,kubelet 日志显示:

复制代码
E0522 15:30:00.123456 12345 round_trippers.go:553] 
Request: POST https://F4B054AExxxxxx79A8CA3EE4.yl4.cn-north-1.eks.amazonaws.com.cn/api/v1/nodes
Response: 401 Unauthorized

401 Unauthorized 表示认证失败。这很奇怪,因为:

  1. 同一个 IAM 角色 myEKSNodeRole 在北京节点上正常工作
  2. VPC Peering 网络已经打通,可以 ping 通 EKS API endpoint
  3. STS presigned URL 可以正常解析出 IAM identity

初步假设:

  1. 网络不通 → 已验证 VPC Peering 可达
  2. Token 生成问题 → 同角色在北京可生成有效 token
  3. IAM 权限问题 → 同角色在北京节点正常工作

在两个节点上分别生成 token 并测试:

bash 复制代码
# 北京节点生成 token
TOKEN_BEIJING=$(curl -s "http://169.254.169.254/latest/meta-data/identity-credentials/ec2/security-credentials/ec2-instance" \
  -H "X-aws-ec2-metadata-token: $TOKEN" | jq -r .Token)

# 宁夏节点生成 token  
TOKEN_NINGXIA=$(同上命令)

# 测试 STS 验证
curl -s "https://sts.cn-north-1.amazonaws.com.cn/?Action=GetCallerIdentity&Version=2011-06-15&X-Amz-Security-Token=$TOKEN"
# 两者都返回正确的 IAM identity

# 测试 EKS API
curl -s -H "Authorization: Bearer $TOKEN_NINGXIA" \
  "https://EKS-ENDPOINT/api/v1/nodes"
# 北京 token: 403 Forbidden (认证成功,授权取决于角色)
# 宁夏 token: 401 Unauthorized (认证失败)

STS 接受宁夏 token 并正确解析出 IAM identity,说明 IAM 层面的认证是成功的。但 EKS API 返回 401,说明问题出在 EKS 的认证 webhook 上。

北京 token 返回 403 Forbidden(认证成功,只是没有权限),而宁夏 token 返回 401 Unauthorized(认证失败)。这说明 EKS 对两者的处理方式不同。

基于以上差异,我们推测EC2_LINUX access entry 的认证流程是:

  1. 首先验证 IAM 签名(这步通过了,因为 STS 验证成功)
  2. 然后解析 username 模板 {``{EC2PrivateDNSName}}
  3. 为了解析这个模板,EKS 需要在集群所在区域 调用 DescribeInstances API 查找实例
  4. 宁夏实例 ID i-07aexxxxxxxe2a 在北京区域的 EC2 中不存在,所以认证失败

这解释了为什么:

  • STS 验证成功(STS 只检查签名)
  • EKS webhook 拒绝(webhook 还要检查实例是否存在)

删除原有 Access Entry

bash 复制代码
aws eks delete-access-entry \
  --cluster-name clustername \
  --principal-arn arn:aws-cn:iam::xxxxxxxxxxx:role/myEKSNodeRole \
  --region cn-north-1

创建 STANDARD 类型 Access Entry

bash 复制代码
aws eks create-access-entry \
  --cluster-name clustername \
  --principal-arn arn:aws-cn:iam::xxxxxxxxxxx:role/myEKSNodeRole \
  --type STANDARD \
  --username "eks-cross-region-node" \
  --region cn-north-1

STANDARD 类型的 access entry 只验证 IAM 签名,不执行 EC2 实例验证。这正是跨区域场景需要的。

但是 STANDARD 类型有一个限制:不能使用 system: 前缀的 kubernetes-groups(如 system:nodes)。这意味着我们不能直接使用内置的节点权限,需要手动创建 RBAC。

创建自定义 RBAC,参考 system:node ClusterRole 的权限设计,但做适当精简。

bash 复制代码
# 获取管理员 token
TOKEN=$(aws eks get-token --cluster-name clustername --region cn-north-1 --query status.token --output text)

# 创建 ClusterRole
curl -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  "https://EKS-ENDPOINT/apis/rbac.authorization.k8s.io/v1/clusterroles" -d '{
    "apiVersion": "rbac.authorization.k8s.io/v1",
    "kind": "ClusterRole",
    "metadata": {"name": "cross-region-node-role"},
    "rules": [
      {"apiGroups": [""], "resources": ["nodes"], "verbs": ["get","list","watch","create","update","patch","delete"]},
      {"apiGroups": [""], "resources": ["nodes/status"], "verbs": ["patch","update"]},
      {"apiGroups": [""], "resources": ["pods"], "verbs": ["get","list","watch","delete"]},
      {"apiGroups": [""], "resources": ["pods/status"], "verbs": ["patch","update"]},
      {"apiGroups": [""], "resources": ["events"], "verbs": ["create","patch","update"]},
      {"apiGroups": [""], "resources": ["configmaps","secrets","endpoints","services"], "verbs": ["get","list","watch"]},
      {"apiGroups": [""], "resources": ["serviceaccounts/token"], "verbs": ["create"]},
      {"apiGroups": ["coordination.k8s.io"], "resources": ["leases"], "verbs": ["get","list","watch","create","update","patch","delete"]},
      {"apiGroups": ["storage.k8s.io"], "resources": ["csinodes","csidrivers","storageclasses","volumeattachments"], "verbs": ["get","list","watch","create","update","patch"]},
      {"apiGroups": ["certificates.k8s.io"], "resources": ["certificatesigningrequests"], "verbs": ["get","list","watch","create"]},
      {"apiGroups": ["authentication.k8s.io"], "resources": ["tokenreviews"], "verbs": ["create"]},
      {"apiGroups": ["authorization.k8s.io"], "resources": ["subjectaccessreviews"], "verbs": ["create"}
    ]
  }'

# 创建 ClusterRoleBinding
curl -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  "https://EKS-ENDPOINT/apis/rbac.authorization.k8s.io/v1/clusterrolebindings" -d '{
    "apiVersion": "rbac.authorization.k8s.io/v1",
    "kind": "ClusterRoleBinding",
    "metadata": {"name": "cross-region-node-binding"},
    "roleRef": {
      "apiGroup": "rbac.authorization.k8s.io",
      "kind": "ClusterRole",
      "name": "cross-region-node-role"
    },
    "subjects": [{
      "kind": "User",
      "name": "eks-cross-region-node",
      "apiGroup": "rbac.authorization.k8s.io"
    }]
  }'

重启节点 kubelet

bash 复制代码
systemctl restart kubelet

验证结果

bash 复制代码
$ kubectl get nodes
NAME                                               STATUS   ROLES    AGE   VERSION
ip-10-200-1-254.cn-northwest-1.compute.internal    Ready    <none>   1m    v1.28.x 
ip-172-31-18-54.cn-north-1.compute.internal        Ready    <none>   1h    v1.28.x
ip-192-168-3-84.cn-north-1.compute.internal        Ready    <none>   2d    v1.28.x
相关推荐
认真的薛薛6 小时前
Terraform: AWS VPC+可SSH登录EC2
ssh·aws·terraform
shinelord明8 小时前
【云计算】k8sclient API 镜像操作 Java 类封装
java·kubernetes·云计算
认真的薛薛8 小时前
Terraform:AWS VPC
云原生·aws·terraform
NeedJava9 小时前
阿里云 ECS 美国服务器里的大文件传到国内 OSS 服务器
服务器·阿里云·云计算
l167751685411 小时前
天翼云服务器失联排查完整报告_事件报告
运维·服务器·云原生·云计算
数智化精益手记局12 小时前
仓库管理软件核心功能拆解:企业如何利用仓库管理软件解决库存积压与错发难题
大数据·人工智能·云计算
yyuuuzz12 小时前
境外云服务器使用常见问题梳理
运维·服务器·网络·aws
智慧医养结合软件开源1 天前
以数据可视化,赋能康养服务精细化运营
人工智能·信息可视化·云计算·生活