Bedrock Guardrails 实战：给 AI Agent 装上安全护栏

上个月我的 AI Agent 在生产环境差点出了事------用户通过 prompt injection 让 Agent 吐出了系统提示词的内容。虽然没造成实质损失，但吓出一身冷汗。

这件事之后我开始认真研究 AI 安全防护，最后落地的方案是 Amazon Bedrock Guardrails。

为什么需要护栏

AI Agent 在生产环境面临的风险：

Prompt Injection：用户构造特殊输入让模型执行非预期操作
信息泄露：模型可能输出训练数据中的敏感信息
有害内容：模型生成不当内容（暴力、歧视等）
话题偏离：Agent 被引导讨论与业务无关的话题
PII 泄露：输出中包含个人身份信息

这些问题靠 prompt engineering 能缓解，但不能彻底解决。需要一层独立于模型的安全检查机制。

Bedrock Guardrails 是什么

它是 Bedrock 平台级的安全过滤层，独立于底层模型运行。核心能力：

内容过滤（Content Filters）

按类别过滤有害内容，每个类别可以设置严格程度：

python 复制代码

import boto3

bedrock = boto3.client('bedrock', region_name='us-east-1')

guardrail = bedrock.create_guardrail(
    name='production-agent-guardrail',
    description='Production AI Agent safety guardrail',
    contentPolicyConfig={
        'filtersConfig': [
            {'type': 'SEXUAL', 'inputStrength': 'HIGH', 'outputStrength': 'HIGH'},
            {'type': 'VIOLENCE', 'inputStrength': 'HIGH', 'outputStrength': 'HIGH'},
            {'type': 'HATE', 'inputStrength': 'HIGH', 'outputStrength': 'HIGH'},
            {'type': 'INSULTS', 'inputStrength': 'MEDIUM', 'outputStrength': 'HIGH'},
            {'type': 'MISCONDUCT', 'inputStrength': 'HIGH', 'outputStrength': 'HIGH'},
            {'type': 'PROMPT_ATTACK', 'inputStrength': 'HIGH', 'outputStrength': 'NONE'}
        ]
    },
    blockedInputMessaging='Your request contains content that violates our usage policy.',
    blockedOutputsMessaging='The response was filtered for safety reasons.'
)

注意 PROMPT_ATTACK 类别------这是专门检测 prompt injection 的，会识别诸如"忽略之前的指令"、"你的系统提示是什么"之类的攻击模式。

话题控制（Topic Policy）

限制 Agent 只讨论业务相关话题：

python 复制代码

topicPolicyConfig={
    'topicsConfig': [
        {
            'name': 'investment-advice',
            'definition': 'Providing specific investment recommendations or financial advice',
            'examples': [
                'Should I buy AAPL stock?',
                'What cryptocurrency should I invest in?'
            ],
            'type': 'DENY'
        },
        {
            'name': 'off-topic-requests',
            'definition': 'Requests unrelated to the product or service domain',
            'examples': [
                'Can you help me with my homework?',
                'Write me a poem about cats'
            ],
            'type': 'DENY'
        }
    ]
}

PII 脱敏（Sensitive Information）

自动检测和处理个人身份信息：

python 复制代码

sensitiveInformationPolicyConfig={
    'piiEntitiesConfig': [
        {'type': 'EMAIL', 'action': 'ANONYMIZE'},
        {'type': 'PHONE', 'action': 'ANONYMIZE'},
        {'type': 'NAME', 'action': 'ANONYMIZE'},
        {'type': 'US_SOCIAL_SECURITY_NUMBER', 'action': 'BLOCK'},
        {'type': 'CREDIT_DEBIT_CARD_NUMBER', 'action': 'BLOCK'}
    ],
    'regexesConfig': [
        {
            'name': 'internal-project-code',
            'description': 'Internal project codes in format PRJ-XXXX',
            'pattern': 'PRJ-[A-Z0-9]{4}',
            'action': 'ANONYMIZE'
        }
    ]
}

ANONYMIZE 会用占位符替换（如 {EMAIL}），BLOCK 会直接拦截整个响应。

实际部署效果

部署后一周的数据：

指标	数值
总请求数	12,847
输入被拦截	23 次（0.18%）
输出被过滤	7 次（0.05%）
Prompt Attack 检测	15 次
PII 脱敏	42 次
误报（人工复核）	2 次

15 次 Prompt Attack 中有 12 次是真实攻击尝试。没有护栏的话，这些请求会直接到达模型。

集成方式

python 复制代码

# 在调用模型时指定 guardrail
response = bedrock_runtime.invoke_model(
    modelId='anthropic.claude-3-sonnet-20240229-v1:0',
    guardrailIdentifier='guardrail-id-here',
    guardrailVersion='DRAFT',
    body=json.dumps({
        'messages': [{'role': 'user', 'content': user_input}],
        'max_tokens': 1024
    })
)

# 检查是否被拦截
result = json.loads(response['body'].read())
if result.get('stop_reason') == 'guardrail_intervened':
    print('Request was blocked by guardrail')

注意事项

延迟影响：Guardrails 检查会增加 50-200ms 延迟，对实时对话影响不大
成本：按检查的文本量计费，大约 $0.75/1000 文本单元
不能替代应用层验证：Guardrails 是安全兜底，业务逻辑的输入校验还是要做
持续调优：根据拦截日志定期调整过滤强度，减少误报

总结

给 AI Agent 加安全护栏不是可选项，是必选项。Bedrock Guardrails 提供了一套开箱即用的方案------内容过滤、话题控制、PII 脱敏、prompt injection 检测------不需要自己训练分类模型，配置即用。

成本可控（对比自建安全层的人力成本），延迟可接受，检测准确度在实际生产中表现不错。

🔗 Amazon Bedrock Guardrails：https://aws.amazon.com/cn/bedrock/guardrails/

🔗 Guardrails 文档：https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html