AI Agent 安全护栏，配置即用

我的 Agent 差点被 prompt injection 搞翻车------用户让它吐了系统提示词。没造成损失，但冷汗出了一身。折腾一圈，我用 Amazon Bedrock Guardrails 把安全护栏落地了，分享下实操过程。

生产环境的 AI 风险

AI Agent 跑在线上，风险比你想的多：

Prompt Injection：用户构造特殊输入让模型执行非预期操作
信息泄露：模型可能输出训练数据中的敏感信息
有害内容：模型生成暴力、歧视等不当内容
PII 泄露：输出中带上个人身份信息

靠 prompt engineering 能缓解，但没法彻底解决。需要一层独立于模型的安全检查。

Bedrock Guardrails 能干什么

它是 Bedrock 平台级的安全过滤层，独立于底层模型运行。三个核心能力：

1. 内容过滤

按类别过滤有害内容，每个类别可以设置严格程度：

python 复制代码

import boto3

bedrock = boto3.client('bedrock', region_name='us-east-1')

guardrail = bedrock.create_guardrail(
    name='production-agent-guardrail',
    description='Production AI Agent safety guardrail',
    contentPolicyConfig={
        'filtersConfig': [
            {'type': 'SEXUAL', 'inputStrength': 'HIGH', 'outputStrength': 'HIGH'},
            {'type': 'VIOLENCE', 'inputStrength': 'HIGH', 'outputStrength': 'HIGH'},
            {'type': 'HATE', 'inputStrength': 'HIGH', 'outputStrength': 'HIGH'},
            {'type': 'INSULTS', 'inputStrength': 'MEDIUM', 'outputStrength': 'HIGH'},
            {'type': 'MISCONDUCT', 'inputStrength': 'HIGH', 'outputStrength': 'HIGH'},
            {'type': 'PROMPT_ATTACK', 'inputStrength': 'HIGH', 'outputStrength': 'NONE'}
        ]
    },
    blockedInputMessaging='Your request contains content that violates our usage policy.',
    blockedOutputsMessaging='The response was filtered for safety reasons.'
)

PROMPT_ATTACK 这个类别专门检测 prompt injection，识别"忽略之前的指令"、"你的系统提示是什么"之类的攻击模式。

2. 话题控制

限制 Agent 只聊业务相关的话题：

python 复制代码

topicPolicyConfig={
    'topicsConfig': [
        {
            'name': 'investment-advice',
            'definition': 'Providing specific investment recommendations or financial advice',
            'examples': [
                'Should I buy AAPL stock?',
                'What cryptocurrency should I invest in?'
            ],
            'type': 'DENY'
        },
        {
            'name': 'off-topic-requests',
            'definition': 'Requests unrelated to the product or service domain',
            'examples': [
                'Can you help me with my homework?',
                'Write me a poem about cats'
            ],
            'type': 'DENY'
        }
    ]
}

3. PII 脱敏

自动检测和处理个人身份信息：

python 复制代码

sensitiveInformationPolicyConfig={
    'piiEntitiesConfig': [
        {'type': 'EMAIL', 'action': 'ANONYMIZE'},
        {'type': 'PHONE', 'action': 'ANONYMIZE'},
        {'type': 'NAME', 'action': 'ANONYMIZE'},
        {'type': 'US_SOCIAL_SECURITY_NUMBER', 'action': 'BLOCK'},
        {'type': 'CREDIT_DEBIT_CARD_NUMBER', 'action': 'BLOCK'}
    ],
    'regexesConfig': [
        {
            'name': 'internal-project-code',
            'description': 'Internal project codes in format PRJ-XXXX',
            'pattern': 'PRJ-[A-Z0-9]{4}',
            'action': 'ANONYMIZE'
        }
    ]
}

ANONYMIZE 用占位符替换（如 {EMAIL}），BLOCK 直接拦截整个响应。

跑了一周的数据

指标	数值
总请求数	12,847
输入被拦截	23 次（0.18%）
输出被过滤	7 次（0.05%）
Prompt Attack 检测	15 次
PII 脱敏	42 次

15 次 Prompt Attack 里有 12 次是真实攻击尝试。没有护栏的话，这些请求直接到模型了。

集成代码

python 复制代码

# 在调用模型时指定 guardrail
response = bedrock_runtime.invoke_model(
    modelId='anthropic.claude-3-sonnet-20240229-v1:0',
    guardrailIdentifier='guardrail-id-here',
    guardrailVersion='DRAFT',
    body=json.dumps({
        'messages': [{'role': 'user', 'content': user_input}],
        'max_tokens': 1024
    })
)

# 检查是否被拦截
result = json.loads(response['body'].read())
if result.get('stop_reason') == 'guardrail_intervened':
    print('Request was blocked by guardrail')

几个注意点

延迟影响：检查会增加 50-200ms 延迟，实时对话场景影响不大
成本：按文本量计费，大约 $0.75/1000 文本单元
不能替代应用层校验：Guardrails 是安全兜底，业务逻辑的输入校验还是要做
持续调优：根据拦截日志定期调整过滤强度，减少误报

给 AI Agent 加安全护栏不是可选项，是必选项。Bedrock Guardrails 配置即用，不需要自己训练分类模型，成本可控，延迟可接受。

🔗 Amazon Bedrock Guardrails：aws.amazon.com/cn/bedrock/... 🔗 Guardrails 文档：docs.aws.amazon.com/bedrock/lat...