被线上故障电话叫醒后，我花一下午搭了套零人工告警系统

周五晚上 11 点接到电话：CPU 95%，用户投诉卡顿。

开电脑、登服务器、手动扩容------前后折腾 20 分钟。事后想想，这种事完全可以自动化。

第二天花了一个下午，用 CloudWatch 搞定了：CPU 飙高自动扩容，账单超了自动 Slack 通知。从此再没被半夜电话叫醒过。

CloudWatch 干嘛的？一句话说清楚

三件事：收指标 → 设告警 → 做动作。

markdown 复制代码

指标超阈值 → CloudWatch Alarm → SNS Topic → Lambda/Email/Slack
                                      ↓
                              Auto Scaling Action

EC2 的 CPU、Lambda 的错误率、RDS 的连接数，都是指标。指标超了设定值，就触发告警。告警触发后可以发通知、自动扩容、跑 Lambda------想干嘛干嘛。

场景一：CPU 飙了自动扩容

前提：你有一个 Auto Scaling Group（ASG），跑着 2-6 台 EC2。

创建扩容策略

python 复制代码

import boto3

autoscaling = boto3.client('autoscaling', region_name='ap-northeast-1')

response = autoscaling.put_scaling_policy(
    AutoScalingGroupName='my-web-asg',
    PolicyName='scale-out-on-high-cpu',
    PolicyType='SimpleScaling',
    AdjustmentType='ChangeInCapacity',
    ScalingAdjustment=1,
    Cooldown=300  # 扩完 5 分钟别再扩
)

scale_out_arn = response['PolicyARN']
print(f'扩容策略 ARN: {scale_out_arn}')

设 CloudWatch 告警

python 复制代码

cloudwatch = boto3.client('cloudwatch', region_name='ap-northeast-1')

cloudwatch.put_metric_alarm(
    AlarmName='high-cpu-alarm',
    AlarmDescription='CPU > 80% 持续 3 分钟自动扩容',
    MetricName='CPUUtilization',
    Namespace='AWS/EC2',
    Statistic='Average',
    Period=60,
    EvaluationPeriods=3,
    Threshold=80.0,
    ComparisonOperator='GreaterThanThreshold',
    Dimensions=[
        {
            'Name': 'AutoScalingGroupName',
            'Value': 'my-web-asg'
        }
    ],
    AlarmActions=[scale_out_arn],
    Unit='Percent'
)

为什么是连续 3 分钟？因为瞬时毛刺很常见------数据库慢查询、GC 暂停都可能短暂拉高 CPU。连续 3 分钟才是真的扛不住了。

Cooldown=300 也很关键：新实例从启动到进入负载均衡至少要 2-3 分钟，5 分钟冷却防止重复扩容。

别忘了缩容

流量下来不缩容，等于白花钱：

python 复制代码

response = autoscaling.put_scaling_policy(
    AutoScalingGroupName='my-web-asg',
    PolicyName='scale-in-on-low-cpu',
    PolicyType='SimpleScaling',
    AdjustmentType='ChangeInCapacity',
    ScalingAdjustment=-1,
    Cooldown=300
)

scale_in_arn = response['PolicyARN']

cloudwatch.put_metric_alarm(
    AlarmName='low-cpu-alarm',
    AlarmDescription='CPU < 30% 持续 10 分钟自动缩容',
    MetricName='CPUUtilization',
    Namespace='AWS/EC2',
    Statistic='Average',
    Period=60,
    EvaluationPeriods=10,
    Threshold=30.0,
    ComparisonOperator='LessThanThreshold',
    Dimensions=[
        {
            'Name': 'AutoScalingGroupName',
            'Value': 'my-web-asg'
        }
    ],
    AlarmActions=[scale_in_arn],
    Unit='Percent'
)

缩容窗口设 10 分钟，比扩容的 3 分钟宽松。道理很简单------误扩容多花几毛钱，误缩容可能把服务搞抖。

场景二：账单超了 Slack 通知

这个我觉得比 CPU 告警还实用。谁没被月底账单吓过一跳？

SNS 主题 + 邮件订阅

python 复制代码

sns = boto3.client('sns', region_name='us-east-1')  # 账单指标只在 us-east-1！

topic = sns.create_topic(Name='billing-alerts')
topic_arn = topic['TopicArn']

sns.subscribe(
    TopicArn=topic_arn,
    Protocol='email',
    Endpoint='your-email@example.com'
)
# 注意去邮箱点确认

账单告警

python 复制代码

cloudwatch_billing = boto3.client('cloudwatch', region_name='us-east-1')

cloudwatch_billing.put_metric_alarm(
    AlarmName='monthly-bill-over-100',
    AlarmDescription='当月账单超 $100 告警',
    MetricName='EstimatedCharges',
    Namespace='AWS/Billing',
    Statistic='Maximum',
    Period=21600,        # 6 小时查一次
    EvaluationPeriods=1,
    Threshold=100.0,
    ComparisonOperator='GreaterThanThreshold',
    Dimensions=[
        {
            'Name': 'Currency',
            'Value': 'USD'
        }
    ],
    AlarmActions=[topic_arn],
    Unit='None'
)

坑：

账单指标只在 us-east-1，不管你资源在哪个 Region
要先去控制台开：Billing → Billing Preferences → Receive Billing Alerts
EstimatedCharges 是当月累计值，不是每天增量

接 Slack

邮件太容易漏。用个 Lambda 转发到 Slack：

python 复制代码

# lambda_function.py
import json
import urllib.request

SLACK_WEBHOOK_URL = 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'

def lambda_handler(event, context):
    message = event['Records'][0]['Sns']['Message']
    
    try:
        alarm = json.loads(message)
        alarm_name = alarm.get('AlarmName', 'Unknown')
        new_state = alarm.get('NewStateValue', 'Unknown')
        reason = alarm.get('NewStateReason', '')
        
        text = f"🚨 *CloudWatch Alert*\n"
        text += f"*告警:* {alarm_name}\n"
        text += f"*状态:* {new_state}\n"
        text += f"*原因:* {reason[:200]}"
    except:
        text = f"⚠️ CloudWatch 告警:\n{message[:500]}"
    
    payload = json.dumps({'text': text}).encode('utf-8')
    req = urllib.request.Request(
        SLACK_WEBHOOK_URL,
        data=payload,
        headers={'Content-Type': 'application/json'}
    )
    urllib.request.urlopen(req)
    
    return {'statusCode': 200}

SNS 订阅 Lambda：

python 复制代码

sns.subscribe(
    TopicArn=topic_arn,
    Protocol='lambda',
    Endpoint='arn:aws:lambda:us-east-1:123456789012:function:slack-notifier'
)

场景三：业务指标也能监控

CloudWatch 不只监控亚马逊云科技自带指标。你的业务指标推上去一样能设告警。

比如 API 响应时间：

python 复制代码

cloudwatch.put_metric_data(
    Namespace='MyApp/API',
    MetricData=[
        {
            'MetricName': 'ResponseTime',
            'Value': 235.5,
            'Unit': 'Milliseconds',
            'Dimensions': [
                {
                    'Name': 'Endpoint',
                    'Value': '/api/v1/users'
                }
            ]
        }
    ]
)

每次请求结束推一个数据点，然后设告警------超过 1 秒报警：

python 复制代码

cloudwatch.put_metric_alarm(
    AlarmName='api-slow-response',
    AlarmDescription='API 响应 > 1s',
    MetricName='ResponseTime',
    Namespace='MyApp/API',
    Statistic='Average',
    Period=60,
    EvaluationPeriods=5,
    Threshold=1000.0,
    ComparisonOperator='GreaterThanThreshold',
    Dimensions=[
        {
            'Name': 'Endpoint',
            'Value': '/api/v1/users'
        }
    ],
    AlarmActions=[topic_arn],
    Unit='Milliseconds'
)

花多少钱？

项目	免费额度	超出价格
基础监控（5分钟间隔）	EC2 全免	---
详细监控（1分钟间隔）	---	$0.30/实例/月
告警	前 10 个免费	$0.10/告警/月
自定义指标	---	$0.30/指标/月
API 调用	前 100 万次免费	$0.01/千次

10 个告警 + 几个自定义指标，一个月不到 $5。

我的必备告警清单

告警	阈值	理由
CPU > 80%	连续 3 分钟	扩容
CPU < 30%	连续 10 分钟	缩容省钱
磁盘 > 85%	连续 5 分钟	磁盘满了服务挂
内存 > 90%	连续 5 分钟	OOM 前救
5xx > 1%	连续 2 分钟	服务异常
P99 > 2s	连续 5 分钟	用户体验差
月账单 > 80%预算	6 小时查	防月底惊喜
Lambda 错误 > 5%	连续 3 分钟	函数异常

5 个坑

账单指标只在 us-east-1 --- 很多人在别的 Region 建账单告警，死活没数据
三种告警状态 --- OK/ALARM/INSUFFICIENT_DATA，新建的初始状态是第三种，别慬
内存要装 Agent --- EC2 默认不推内存指标，需要装 CloudWatch Agent
告警名全局唯一 --- 同账号同区域不能重名，建议 {环境}-{服务}-{指标} 命名
SNS 邮件要确认 --- 创建订阅后去邮箱点链接，否则收不到

代码基于亚马逊云科技 CloudWatch + Auto Scaling + SNS + Lambda，boto3 验证通过。

📌 CloudWatch 免费套餐：10 个告警 + 100 万次 API 调用 + EC2 基础监控，入门够用。 🔗 文档：docs.aws.amazon.com/cloudwatch/