参考资料
Amazon ECS Managed Instances 是一种托管式基础设施,让 Amazon ECS 能够自动管理运行容器工作负载的 EC2 实例,自动处理容量管理、操作系统维护和安全补丁。通过ECS Managed Instances用户可以选择所使用的实例。此外在托管实例上,用户可以在实例上进行多个打包在一起的任务。还可以访问所选择的实例类型,裸机、GPU 等,如果工作负载需要,可以选择特定的类型。与 Fargate 的关键区别:
- 可选择实例类型(GPU、裸机、网络优化型等)
- 支持多任务打包到同一实例,降低成本
- 支持特权容器和全部 Linux capabilities
- 每 14 天自动安全补丁更新
ECS托管实例的角色和权限
ECS Managed Instances 需要两个 IAM 角色:
基础设施角色(ECS 服务使用)
- 角色名:ecsInfrastructureRoleForManagedInstances
- 信任实体:ecs.amazonaws.com
- 附加策略:AmazonECSInfrastructureRolePolicyForManagedInstances
实例角色(EC2 实例使用)
- 角色名:ECSManagedInstancesRole
- 信任实体:ec2.amazonaws.com
- 附加策略:AmazonECSInstanceRolePolicyForManagedInstances
测试过程中发现有一个PassRole 权限问题,AWS 托管策略 AmazonECSInfrastructureRolePolicyForManagedInstances 中的 PassRole 权限仅允许传递名称匹配 ecsInstanceRole* 的角色:
json
{
"Sid": "PassInstanceRoleForManagedInstances",
"Effect": "Allow",
"Action": ["iam:PassRole"],
"Resource": ["arn:aws-cn:iam::*:role/ecsInstanceRole*"],
"Condition": {
"StringLike": {
"iam:PassedToService": "ec2.*"
}
}
}
如果实例角色名称不以 ecsInstanceRole 开头(如本例中的 ECSManagedInstancesRole)会在service的事件中出现报错.任务的启动失败原因中也会出现报错

需要为基础设施角色添加内联策略:
bash
aws iam put-role-policy \
--role-name ecsInfrastructureRoleForManagedInstances \
--policy-name PassRoleForManagedInstances \
--policy-document '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws-cn:iam::<ACCOUNT_ID>:role/<YOUR_INSTANCE_ROLE_NAME>"
}]
}'
部署 httpinfo 服务
这里需要注意,如果是之前的已经存在的兼容EC2的任务定义,任务并不会调度到托管实例上

任务定义中的兼容性字段如下

如下命令创建的任务定义能够在注册后的兼容性中自动包含 EC2 和 MANAGED_INSTANCES
bash
aws ecs register-task-definition \
--family httpinfo \
--network-mode awsvpc \
--requires-compatibilities EC2 \
--cpu 256 --memory 512 \
--execution-role-arn arn:aws-cn:iam::<ACCOUNT_ID>:role/ecsTaskExecutionRole \
--container-definitions '[{
"name": "httpinfo",
"image": "<ACCOUNT_ID>.dkr.ecr.cn-north-1.amazonaws.com.cn/nginx:latest",
"essential": true,
"portMappings": [{"containerPort": 80, "protocol": "tcp"}],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/httpinfo",
"awslogs-region": "cn-north-1",
"awslogs-stream-prefix": "httpinfo"
}
}
}]'
创建服务,对应控制台配置如下

命令行操作如下
bash
aws ecs create-service \
--cluster worktest \
--service-name httpinfo-svc \
--task-definition httpinfo \
--desired-count 1 \
--scheduling-strategy REPLICA \
--capacity-provider-strategy '[{"capacityProvider":"managed_ng1","weight":1,"base":1}]' \
--network-configuration '{
"awsvpcConfiguration": {
"subnets": ["subnet-07xxx302a37","subnet-02xxxxxxacdd"]
}
}'
实例启动后示例如下

示例的tag中会体现出该实例为托管实例,并且实例是无法ssh和ssm登陆的

ECS Managed Instances 的 EC2 实例是由 ECS 服务直接通过 EC2 Fleet 启动的,不走 Auto Scaling Group。从 EC2 实例的标签和属性可以看到完整的启动链路:
| 属性 | 值 |
|---|---|
| 启动方式 | EC2 Fleet (fleet-1c847fxxxxcbc950) |
| Launch Template | lt-00a1xxx27008 (v2) |
| 实例类型 | c5a.large (2 vCPU / 4GB,ECS 自动选择) |
| AMI | ami-0bxxxxae428 (ECS 优化 AMI) |
| 实例角色 | ECSManagedInstancesRole |
| Operator | Managed: true, Principal: ecs.amazonaws.com |
部署 netshoot 特权容器
任务定义(启用特权模式 + ECS Exec)
bash
aws ecs register-task-definition \
--family netshoot \
--network-mode awsvpc \
--requires-compatibilities EC2 \
--cpu 256 --memory 512 \
--execution-role-arn arn:aws-cn:iam::<ACCOUNT_ID>:role/ecsTaskExecutionRole \
--task-role-arn arn:aws-cn:iam::<ACCOUNT_ID>:role/ecsTaskRole \
--container-definitions '[{
"name": "netshoot",
"image": "<ACCOUNT_ID>.dkr.ecr.cn-north-1.amazonaws.com.cn/netshoot:latest",
"essential": true,
"command": ["sleep","3600"],
"privileged": true,
"linuxParameters": {
"capabilities": {
"add": ["NET_ADMIN","SYS_ADMIN","NET_RAW","SYS_PTRACE"]
}
},
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/netshoot",
"awslogs-region": "cn-north-1",
"awslogs-stream-prefix": "netshoot"
}
}
}]'
关键配置说明:
privileged: true--- 启用特权模式NET_ADMIN--- 网络管理(抓包、修改路由等)SYS_ADMIN--- 系统管理NET_RAW--- 原始套接字(tcpdump 需要)SYS_PTRACE--- 进程跟踪
Task Role 需要 SSM 权限
ECS Exec 依赖 SSM,task role 需要以下权限:
bash
aws iam put-role-policy \
--role-name ecsTaskRole \
--policy-name ECSExecPolicy \
--policy-document '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"ssmmessages:CreateControlChannel",
"ssmmessages:CreateDataChannel",
"ssmmessages:OpenControlChannel",
"ssmmessages:OpenDataChannel"
],
"Resource": "*"
}]
}'
运行任务(启用 ECS Exec)
bash
aws ecs run-task \
--cluster worktest \
--task-definition netshoot \
--count 1 \
--enable-execute-command \
--capacity-provider-strategy '[{"capacityProvider":"managed_ng1","weight":1}]' \
--network-configuration '{
"awsvpcConfiguration": {
"subnets": ["subnet-077cxxxxxx02a37","subnet-027xxxxxxxacdd"]
}
}'
进入容器执行抓包
bash
aws ecs execute-command \
--cluster worktest \
--task <TASK_ID> \
--container netshoot \
--interactive \
--command "/bin/bash" \
--region cn-north-1
# 容器内可用的网络诊断工具:
tcpdump -i any port 80 -nn
tshark -i any -f "port 80"
ip addr show # 查看网络接口
ip route # 查看路由表
ss -tlnp # 查看监听端口
curl <httpinfo-task-ip> # 测试 HTTP 连通性
此外检查托管实例的 HttpPutResponseHopLimit 配置是 1,并且托管实例有资源策略保护,不允许外部修改 metadata 配置。
此前由于在创建托管实例时选择了公有子网,导致awsvpc任务无法访问公网,ecs exec失效,因此将task修改为host网络模式来测试抓包。可以正常执行
The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.
Starting session with SessionId: ecs-execute-command-uae6enqrk6ggvfvn5jdi7ou3fq
# tcpdump -i any port 80 -nn