Azure AI Foundry 全栈可观测性：Metrics / Traces / Logs / SLO / 成本归因

系列：Azure AI Foundry 深度实战
期号：#5
发布日期 ：2026-03-04
字数：约 30,000 字
阅读时间：约 60 分钟

摘要

在 Blog #4 中，我们系统讲解了从单体 Agent 到多 Agent 协作的完整架构演进。

然而，当 AI Agent 真正进入生产环境，一个新的核心挑战浮出水面------你如何知道你的 Agent 正在正常工作？

它的响应是否符合预期质量？
某次工具调用失败是偶发还是系统性问题？
本月 GPT-4.1 的 Token 费用为何暴增 300%？
SLA 承诺的 P99 延迟是否达标？
内容安全模块触发了多少次，哪些 prompt 触发了暴力/仇恨内容过滤？

这些问题，都需要**全栈可观测性（Full-Stack Observability）**来回答。

本文将系统讲解 Azure AI Foundry 可观测性体系的五大支柱：

支柱	技术	用途
Traces（链路追踪）	OpenTelemetry + GenAI 语义约定	调试、根因分析
Metrics（指标）	Azure Monitor / Application Insights	性能监控、趋势分析
Logs（日志）	Log Analytics / KQL	异常诊断、审计
Evaluation（评估）	Azure AI Evaluation SDK	质量/安全量化
Cost Attribution（成本归因）	Token 用量追踪 + 成本 KQL	费用优化、预算管控

核心主张 ：可观测性不是锦上添花，而是 AI Agent 生产化的前提条件。不可观测的 Agent，就是不可信赖的 Agent。

[为什么 AI Agent 需要全栈可观测性](#为什么 AI Agent 需要全栈可观测性)
[Azure AI Foundry 可观测性架构全景](#Azure AI Foundry 可观测性架构全景)
[OpenTelemetry + GenAI 语义约定](#OpenTelemetry + GenAI 语义约定)
[Traces 链路追踪实战](#Traces 链路追踪实战)
[Metrics 指标体系](#Metrics 指标体系)
[Logs 日志管理与 KQL 查询](#Logs 日志管理与 KQL 查询)
[Azure AI Evaluation SDK：质量与安全评估](#Azure AI Evaluation SDK：质量与安全评估)
[SLO / SLA 设计与告警配置](#SLO / SLA 设计与告警配置)
[成本归因与 Token 用量优化](#成本归因与 Token 用量优化)
[Grafana + Azure Monitor 仪表盘实战](#Grafana + Azure Monitor 仪表盘实战)
[Agent 监控仪表盘（Foundry 内置）](#Agent 监控仪表盘（Foundry 内置）)
持续评估与生产流量监控
安全可观测性：内容安全事件追踪
企业落地最佳实践与检查清单
[总结与 Blog #6 预告](#6 预告)
参考资料

一、为什么 AI Agent 需要全栈可观测性

1.1 传统软件 vs AI Agent：可观测性的本质差异

传统软件的行为是确定性的：相同输入 → 相同输出。可观测性主要关注：

服务是否在线（Uptime）
响应时间是否达标（Latency）
错误率是否可接受（Error Rate）

AI Agent 的行为是非确定性的。即使是相同的 prompt，不同时刻、不同温度参数下，模型输出可能大相径庭。更重要的是：

复制代码

用户 prompt
    ↓
[Orchestrator Agent]  ← 可能选择不同的子 Agent
    ↓
[Tool Call 1]  ← 可能返回过期数据
    ↓
[Model Inference]  ← 输出质量随机变化
    ↓
[Safety Filter]  ← 可能误拦截合法内容
    ↓
最终响应

任何一个环节的问题，都可能导致用户看到错误、有害或无关的输出。

1.2 生产环境中的 AI Agent 失败模式

基于 Microsoft 内部案例研究（参考 CarMax Skye 2.0 项目），AI Agent 在生产环境中的典型失败模式包括：

失败类型	发生频率	是否可通过传统监控发现
工具调用失败（API 超时/权限错误）	高	✅ 可以
模型输出质量下降（hallucination 增加）	中	❌ 不能
Retrieval 相关性退化（向量索引更新后）	中	❌ 不能
安全过滤误触发（正常请求被拦截）	低	❌ 不能
Token 用量异常暴增（prompt 注入攻击）	低	部分可以
Agent 协作死循环（多 Agent 系统）	低	❌ 不能
上下文窗口溢出导致截断	中	❌ 不能

结论：AI Agent 可观测性必须超越传统 APM，引入语义级别的追踪和评估。

1.3 可观测性的商业价值

以 CarMax 的 Skye 2.0 为例，通过实施完整的可观测性体系：

净正向反馈提升 10 个百分点
问题自助解决率提高 25%
从问题发现到修复的时间缩短 60%

"Trustworthy agents require measurable tracing and evaluation, not just smart models."

--- Dave R., Azure & AI MVP

1.4 三大关键问题

在设计 AI Agent 可观测性体系时，需要回答三个核心问题：

复制代码

┌──────────────────────────────────────────────────────┐
│           AI Agent 可观测性三大核心问题               │
├──────────────────────────────────────────────────────┤
│  1. 发生了什么？         → Traces（链路追踪）         │
│     "这次 Agent 调用经过了哪些步骤？哪步慢？哪步失败？"  │
├──────────────────────────────────────────────────────┤
│  2. 输出质量如何？       → Evaluation（评估）         │
│     "回答是否准确、相关、安全、连贯？"                  │
├──────────────────────────────────────────────────────┤
│  3. 资源消耗多少？       → Metrics + Cost（指标+成本）  │
│     "花了多少 Token？延迟多少？成本怎么分配？"          │
└──────────────────────────────────────────────────────┘

二、Azure AI Foundry 可观测性架构全景

2.1 架构总览

Azure AI Foundry 的可观测性体系建立在以下核心组件之上：

复制代码

┌─────────────────────────────────────────────────────────────────────┐
│                  Azure AI Foundry 可观测性全栈架构                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │                    AI 应用层（数据源）                         │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌───────────────────────┐ │   │
│  │  │ MAF Agent   │  │  SK Agent   │  │ LangChain/LangGraph    │ │   │
│  │  │             │  │             │  │ Agent                  │ │   │
│  │  └──────┬──────┘  └──────┬──────┘  └──────────┬────────────┘ │   │
│  └─────────┼────────────────┼───────────────────┼──────────────┘   │
│            │                │                   │                   │
│  ┌─────────▼────────────────▼───────────────────▼──────────────┐   │
│  │              OpenTelemetry SDK（Instrumentation Layer）       │   │
│  │   gen_ai.* 语义约定 | execute_task spans | A2A spans         │   │
│  └─────────────────────────┬────────────────────────────────────┘   │
│                             │                                        │
│  ┌──────────────────────────▼─────────────────────────────────┐    │
│  │              Azure Monitor / Application Insights            │    │
│  │                                                              │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌───────────┐  │    │
│  │  │  Traces  │  │ Metrics  │  │   Logs   │  │ Workbooks │  │    │
│  │  └──────────┘  └──────────┘  └──────────┘  └───────────┘  │    │
│  └──────────────────────────┬─────────────────────────────────┘    │
│                             │                                        │
│  ┌──────────────────────────▼─────────────────────────────────┐    │
│  │              分析与可视化层                                   │    │
│  │                                                              │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │    │
│  │  │ Foundry 内置  │  │   Grafana    │  │  Azure Monitor   │  │    │
│  │  │ Agent Monitor │  │  Dashboard   │  │    Alerts/SLO    │  │    │
│  │  └──────────────┘  └──────────────┘  └──────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │              评估层（Evaluation Layer）                       │    │
│  │                                                              │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │    │
│  │  │ Offline Eval  │  │  Online Eval  │  │   Red Team Scan  │  │    │
│  │  │ (Batch SDK)   │  │ (Continuous)  │  │   (Scheduled)    │  │    │
│  │  └──────────────┘  └──────────────┘  └──────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

2.2 核心组件关系

组件	职责	对应 Azure 服务
OpenTelemetry SDK	埋点、采集原始遥测数据	-
Application Insights	存储 Traces、Metrics、Logs	Azure Monitor
Log Analytics Workspace	KQL 查询、长期日志存储	Azure Monitor
Azure AI Evaluation SDK	离线/在线质量评估	Azure AI Foundry
Foundry Agent Monitor	内置可视化仪表盘	Azure AI Foundry
Grafana	高级可视化、SLO 管理	Azure Managed Grafana
Azure Monitor Alerts	阈值告警、SLA 通知	Azure Monitor

2.3 数据流全景

复制代码

[Agent 执行] 
    ↓ OTel SDK 自动埋点
[OTLP Exporter]
    ↓ 导出到
[Azure Monitor Exporter]
    ↓ 写入
[Application Insights]
    ├── traces 表（链路数据）
    ├── customMetrics 表（自定义指标）
    ├── exceptions 表（异常日志）
    └── customEvents 表（业务事件）
    ↓ 通过
[Log Analytics Workspace]
    ↓ KQL 查询
[Workbooks / Grafana / Alerts]

三、OpenTelemetry + GenAI 语义约定：标准化追踪基础

3.1 为什么需要 GenAI 语义约定

在没有统一标准的情况下，不同框架产生的 Span 属性名称各不相同：

LangChain 可能叫 llm.model_name
Semantic Kernel 可能叫 semantic_kernel.model
自定义代码可能叫 model_id

这导致跨框架的追踪数据无法统一分析。
OpenTelemetry GenAI 语义约定 （gen_ai.*）解决了这个问题。

参考：Semantic Conventions for Generative AI Spans

3.2 核心 Span 属性一览

LLM 调用 Span 属性：

属性名	类型	描述	示例
`gen_ai.system`	string	AI 提供商	`"az.ai.inference"`
`gen_ai.request.model`	string	请求的模型	`"gpt-4.1"`
`gen_ai.response.model`	string	实际响应的模型	`"gpt-4.1-2025-04-14"`
`gen_ai.usage.input_tokens`	int	输入 Token 数	`1250`
`gen_ai.usage.output_tokens`	int	输出 Token 数	`380`
`gen_ai.request.temperature`	float	温度参数	`0.7`
`gen_ai.request.max_tokens`	int	最大输出 Token	`4096`
`gen_ai.response.finish_reason`	string	结束原因	`"stop"`

Agent Span 属性（Microsoft 扩展，已提交 OTel 规范）：

属性名	类型	描述
`gen_ai.agent.name`	string	Agent 名称
`gen_ai.agent.id`	string	Agent 唯一标识
`gen_ai.tool.name`	string	工具名称
`gen_ai.tool.call.id`	string	工具调用 ID
`tool.call.arguments`	string	工具调用参数（JSON）
`tool.call.results`	string	工具调用结果
`tool_definitions`	string	已注册工具定义

Agent 协作 Span（新增，多 Agent 专用）：

Span 名称	描述
`execute_task`	Agent 执行任务
`agent_to_agent_interaction`	A2A 调用
`agent.state.management`	状态读写
`agent_planning`	规划步骤
`agent_orchestration`	编排决策

参考：Azure AI Foundry: Advancing OpenTelemetry

3.3 一个完整 Trace 的 Span 树

对于一次典型的 RAG Agent 调用，Trace 的 Span 树如下：

复制代码

Trace: user_query_abc123
│
├── [Span] agent_run (orchestrator_agent)
│   │  gen_ai.agent.name = "research_agent"
│   │  gen_ai.agent.id = "agent_001"
│   │  duration = 4.2s
│   │
│   ├── [Span] agent_planning
│   │      gen_ai.usage.input_tokens = 450
│   │      gen_ai.usage.output_tokens = 120
│   │      duration = 0.8s
│   │
│   ├── [Span] tool_call: search_knowledge_base
│   │      gen_ai.tool.name = "search_knowledge_base"
│   │      tool.call.arguments = {"query": "Azure pricing 2026"}
│   │      tool.call.results = "[{...}, {...}]"
│   │      duration = 0.5s
│   │
│   ├── [Span] chat (gpt-4.1)
│   │      gen_ai.request.model = "gpt-4.1"
│   │      gen_ai.usage.input_tokens = 1250
│   │      gen_ai.usage.output_tokens = 380
│   │      gen_ai.response.finish_reason = "stop"
│   │      duration = 2.1s
│   │
│   └── [Span] safety_check
│          content_safety.result = "safe"
│          duration = 0.3s
│
└── [Event] Evaluation
       groundedness = 0.92
       relevance = 0.88
       fluency = 0.95

3.4 安装与配置 OpenTelemetry

bash 复制代码

# 安装必要依赖
pip install azure-ai-projects>=2.0.0b3
pip install opentelemetry-sdk
pip install azure-monitor-opentelemetry-exporter
pip install opentelemetry-instrumentation-openai-v2

python 复制代码

# otel_setup.py - OpenTelemetry 初始化
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from azure.monitor.opentelemetry.exporter import AzureMonitorTraceExporter

def setup_otel_tracing(
    connection_string: str,
    service_name: str = "azure-ai-agent",
    enable_content_recording: bool = False  # 生产环境建议 False
) -> TracerProvider:
    """
    初始化 OpenTelemetry 链路追踪
    
    Args:
        connection_string: Application Insights 连接字符串
        service_name: 服务名称，用于标识 Agent 应用
        enable_content_recording: 是否记录 prompt/response 内容
                                  开发环境设 True，生产环境设 False
    Returns:
        TracerProvider 实例
    """
    from opentelemetry.sdk.resources import Resource
    
    resource = Resource.create({
        "service.name": service_name,
        "service.version": os.getenv("APP_VERSION", "1.0.0"),
        "deployment.environment": os.getenv("ENVIRONMENT", "production"),
    })
    
    # 创建 Azure Monitor Exporter
    exporter = AzureMonitorTraceExporter(
        connection_string=connection_string
    )
    
    # 创建 TracerProvider
    provider = TracerProvider(resource=resource)
    provider.add_span_processor(BatchSpanProcessor(exporter))
    
    # 设置为全局 TracerProvider
    trace.set_tracer_provider(provider)
    
    # 设置内容记录标志（通过环境变量传递给 SDK）
    if enable_content_recording:
        os.environ["AZURE_TRACING_GEN_AI_CONTENT_RECORDING_ENABLED"] = "true"
    else:
        os.environ.pop("AZURE_TRACING_GEN_AI_CONTENT_RECORDING_ENABLED", None)
    
    print(f"✅ OTel Tracing 初始化完成: {service_name}")
    return provider


# 使用示例
if __name__ == "__main__":
    provider = setup_otel_tracing(
        connection_string=os.environ["APPLICATION_INSIGHTS_CONNECTION_STRING"],
        service_name="my-foundry-agent",
        enable_content_recording=(os.getenv("ENVIRONMENT") == "development")
    )

四、Traces 链路追踪实战

4.1 Microsoft Agent Framework (MAF) 追踪配置

python 复制代码

# maf_tracing.py - MAF Agent 追踪配置
import os
import asyncio
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
    AgentEventHandler,
    MessageDeltaChunk,
    RunStep,
    ThreadMessage,
    ThreadRun,
)
from azure.identity import DefaultAzureCredential
from opentelemetry import trace

# 初始化 OTel（参考上节）
from otel_setup import setup_otel_tracing

setup_otel_tracing(
    connection_string=os.environ["APPLICATION_INSIGHTS_CONNECTION_STRING"],
    service_name="maf-research-agent",
    enable_content_recording=False,
)

# 启用 MAF 框架级别的追踪
# MAF 自动为每个 Agent 调用、工具调用、模型推理创建 Span
os.environ["AZURE_TRACING_GEN_AI_CONTENT_RECORDING_ENABLED"] = "false"

tracer = trace.get_tracer(__name__)

async def run_agent_with_tracing(user_query: str) -> str:
    """
    带完整追踪的 MAF Agent 调用
    """
    with tracer.start_as_current_span("user_session") as session_span:
        session_span.set_attribute("user.query.length", len(user_query))
        session_span.set_attribute("session.id", f"sess_{os.urandom(4).hex()}")
        
        project_client = AIProjectClient(
            endpoint=os.environ["AZURE_AI_PROJECT_ENDPOINT"],
            credential=DefaultAzureCredential(),
        )
        
        async with project_client:
            agent = await project_client.agents.get_agent(
                agent_id=os.environ["AZURE_AI_AGENT_ID"]
            )
            
            # 创建 Thread（每次对话一个 Thread = 一个 Trace）
            thread = await project_client.agents.threads.create()
            
            # 添加用户消息
            await project_client.agents.messages.create(
                thread_id=thread.id,
                role="user",
                content=user_query,
            )
            
            # 运行 Agent（MAF 自动创建 execute_task span）
            run = await project_client.agents.runs.create_and_process(
                thread_id=thread.id,
                agent_id=agent.id,
            )
            
            # 记录运行结果
            session_span.set_attribute("run.status", run.status)
            session_span.set_attribute("run.id", run.id)
            
            if run.status == "failed":
                session_span.set_status(
                    trace.StatusCode.ERROR,
                    description=str(run.last_error)
                )
                raise RuntimeError(f"Agent run failed: {run.last_error}")
            
            # 获取最终消息
            messages = await project_client.agents.messages.list(
                thread_id=thread.id
            )
            
            for msg in messages:
                if msg.role == "assistant":
                    return msg.content[0].text.value
            
            return ""

4.2 Semantic Kernel Agent 追踪配置

python 复制代码

# sk_tracing.py - Semantic Kernel 追踪配置
import os
import asyncio
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion
from semantic_kernel.agents import ChatCompletionAgent, AgentGroupChat
from opentelemetry import trace
from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor

# SK 启用 OTel 追踪（一行代码）
OpenAIInstrumentor().instrument()

tracer = trace.get_tracer("sk-agent")

def create_sk_kernel_with_tracing() -> Kernel:
    """创建带追踪的 SK Kernel"""
    kernel = Kernel()
    
    # Azure OpenAI 服务（自动被 OTel 拦截追踪）
    kernel.add_service(
        AzureChatCompletion(
            service_id="gpt4-1",
            deployment_name=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT"],
            endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
            api_key=os.environ["AZURE_OPENAI_API_KEY"],
        )
    )
    
    return kernel


async def run_sk_agent_with_tracing(
    kernel: Kernel,
    agent_name: str,
    instructions: str,
    user_message: str,
) -> str:
    """带追踪的 SK Agent 调用"""
    
    with tracer.start_as_current_span(
        f"sk_agent_run_{agent_name}",
        attributes={
            "gen_ai.agent.name": agent_name,
            "user.message.length": len(user_message),
        }
    ) as span:
        agent = ChatCompletionAgent(
            kernel=kernel,
            name=agent_name,
            instructions=instructions,
        )
        
        from semantic_kernel.contents import ChatHistory
        history = ChatHistory()
        history.add_user_message(user_message)
        
        result = ""
        async for response in agent.invoke(history):
            result += response.content
        
        span.set_attribute("response.length", len(result))
        span.set_status(trace.StatusCode.OK)
        
        return result

4.3 LangChain / LangGraph 追踪配置

python 复制代码

# langchain_tracing.py - LangChain + Azure AI 追踪
import os
from langchain_azure_ai.callbacks import AzureAIOpenTelemetryTracer
from langchain_openai import AzureChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.tools import tool

# 创建 Azure AI OTel 追踪器
tracer_callback = AzureAIOpenTelemetryTracer(
    enable_content_recording=False  # 生产环境设 False
)

# 创建 Azure OpenAI LLM（追踪自动注入）
llm = AzureChatOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT"],
    api_version=os.environ["AZURE_OPENAI_VERSION"],
    callbacks=[tracer_callback],  # 注入追踪回调
)

# 定义工具
@tool
def search_knowledge_base(query: str) -> str:
    """在知识库中搜索信息"""
    # 实际实现...
    return f"Knowledge base result for: {query}"

@tool
def calculate_cost(tokens: int, model: str) -> float:
    """计算 Token 成本"""
    rates = {
        "gpt-4.1": {"input": 2.0, "output": 8.0},   # $/1M tokens
        "gpt-4.1-mini": {"input": 0.4, "output": 1.6},
    }
    rate = rates.get(model, {"input": 2.0, "output": 8.0})
    return tokens * rate["input"] / 1_000_000

tools = [search_knowledge_base, calculate_cost]

# 创建 Agent
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful Azure AI assistant with access to search tools."),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    callbacks=[tracer_callback],  # 注入追踪回调
    verbose=False,
)

# 运行（追踪自动收集）
result = agent_executor.invoke(
    {"input": "查询 Azure OpenAI GPT-4.1 的最新定价"},
    config={"callbacks": [tracer_callback]},
)

4.4 OpenAI Agents SDK 追踪配置

python 复制代码

# openai_agents_tracing.py - OpenAI Agents SDK 追踪
import os
import asyncio
from openai import AsyncAzureOpenAI
from agents import Agent, Runner, set_tracing_processor
from agents.tracing import TracingProcessor
from opentelemetry import trace as otel_trace
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

# 自定义 OTel 追踪处理器
class AzureMonitorTracingProcessor(TracingProcessor):
    """将 OpenAI Agents SDK 的追踪事件桥接到 OTel"""
    
    def __init__(self):
        self.tracer = otel_trace.get_tracer("openai-agents")
        self._spans = {}
    
    def on_trace_start(self, trace_data):
        span = self.tracer.start_span(
            "agent_run",
            attributes={
                "gen_ai.agent.name": trace_data.get("agent_name", ""),
                "trace.workflow_name": trace_data.get("workflow_name", ""),
            }
        )
        self._spans[trace_data["trace_id"]] = span
    
    def on_trace_end(self, trace_data):
        span = self._spans.pop(trace_data["trace_id"], None)
        if span:
            span.set_attribute("gen_ai.usage.input_tokens", 
                             trace_data.get("input_tokens", 0))
            span.set_attribute("gen_ai.usage.output_tokens",
                             trace_data.get("output_tokens", 0))
            span.end()
    
    def on_span_start(self, span_data):
        parent_span = self._spans.get(span_data.get("trace_id"))
        with otel_trace.use_span(parent_span, end_on_exit=False):
            child_span = self.tracer.start_span(
                span_data.get("span_type", "unknown"),
                attributes=self._extract_attributes(span_data)
            )
        self._spans[span_data["span_id"]] = child_span
    
    def on_span_end(self, span_data):
        span = self._spans.pop(span_data.get("span_id"), None)
        if span:
            span.end()
    
    def _extract_attributes(self, data: dict) -> dict:
        attrs = {}
        if "model" in data:
            attrs["gen_ai.request.model"] = data["model"]
        if "tool_name" in data:
            attrs["gen_ai.tool.name"] = data["tool_name"]
        return attrs
    
    def force_flush(self):
        pass
    
    def shutdown(self):
        pass


# 注册自定义追踪处理器
set_tracing_processor(AzureMonitorTracingProcessor())

# 创建 Agent
client = AsyncAzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version="2025-01-01-preview",
)

agent = Agent(
    name="azure_assistant",
    instructions="You are a helpful Azure cloud expert.",
    model="gpt-4.1",
)

async def run_with_tracing(user_input: str) -> str:
    with otel_trace.get_tracer(__name__).start_as_current_span("user_request"):
        result = await Runner.run(agent, user_input)
        return result.final_output

4.5 在 Foundry Portal 验证 Trace

追踪数据写入后，可在 Azure AI Foundry Portal 查看：

进入 Azure AI Foundry → 选择项目
左侧导航 → Observability → Traces
可以看到每次 Agent 调用的完整 Span 树
点击某个 Span 查看详细属性（Token 数量、延迟、工具调用参数等）

⚠️ 注意：Trace 数据从写入到在 Portal 显示，最长可能需要 25 分钟。开发测试时需耐心等待。

五、Metrics 指标体系：关键性能与质量度量

5.1 AI Agent 核心指标分类

复制代码

┌─────────────────────────────────────────────────────────────┐
│                   AI Agent 指标体系                          │
├─────────────────────────┬───────────────────────────────────┤
│     性能指标             │           质量指标                 │
├─────────────────────────┼───────────────────────────────────┤
│ P50/P95/P99 延迟        │ Groundedness Score                 │
│ 请求成功率               │ Relevance Score                    │
│ 工具调用成功率            │ Coherence Score                    │
│ Token 吞吐量（TPM）      │ Fluency Score                      │
│ 并发 Run 数量            │ Task Adherence Score               │
├─────────────────────────┼───────────────────────────────────┤
│     成本指标             │           安全指标                 │
├─────────────────────────┼───────────────────────────────────┤
│ 输入 Token 总量          │ 内容安全触发率                     │
│ 输出 Token 总量          │ Jailbreak 尝试次数                 │
│ 按模型分摊成本            │ PII 检测次数                      │
│ 按用户/部门分摊成本       │ Red Team 评分                     │
│ 成本趋势（日/周/月）      │ 敏感数据暴露次数                   │
└─────────────────────────┴───────────────────────────────────┘

5.2 通过 Python SDK 自定义指标

python 复制代码

# custom_metrics.py - 自定义 AI Agent 指标
import time
import os
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from azure.monitor.opentelemetry.exporter import AzureMonitorMetricExporter

# 初始化 Metrics Provider
metric_exporter = AzureMonitorMetricExporter(
    connection_string=os.environ["APPLICATION_INSIGHTS_CONNECTION_STRING"]
)

reader = PeriodicExportingMetricReader(
    exporter=metric_exporter,
    export_interval_millis=30_000  # 每 30 秒导出一次
)

provider = MeterProvider(metric_readers=[reader])
metrics.set_meter_provider(provider)

meter = metrics.get_meter("azure-ai-agent-metrics")

# 定义指标
# 1. 请求延迟直方图（P50/P95/P99）
request_latency = meter.create_histogram(
    name="agent.request.duration",
    description="Agent 请求完成时间（毫秒）",
    unit="ms",
)

# 2. Token 用量计数器
token_input_counter = meter.create_counter(
    name="agent.token.input.total",
    description="输入 Token 累计数量",
    unit="token",
)

token_output_counter = meter.create_counter(
    name="agent.token.output.total",
    description="输出 Token 累计数量",
    unit="token",
)

# 3. 请求结果计数器
request_counter = meter.create_counter(
    name="agent.request.total",
    description="Agent 请求总数",
    unit="request",
)

# 4. 工具调用成功率
tool_call_counter = meter.create_counter(
    name="agent.tool_call.total",
    description="工具调用总数",
    unit="call",
)

# 5. 评估分数仪表盘（Gauge）
eval_score_gauge = meter.create_observable_gauge(
    name="agent.eval.score",
    description="最近一批次的评估分数",
    unit="score",
)

# 使用示例
class AgentMetricsCollector:
    """Agent 指标收集器"""
    
    def __init__(self, agent_name: str, model: str, environment: str):
        self.agent_name = agent_name
        self.model = model
        self.environment = environment
        self._base_attrs = {
            "agent.name": agent_name,
            "model": model,
            "environment": environment,
        }
    
    def record_request(
        self,
        duration_ms: float,
        status: str,  # "success" | "failed" | "timeout"
        input_tokens: int,
        output_tokens: int,
        user_id: str = "anonymous",
        department: str = "unknown",
    ):
        """记录一次 Agent 请求的完整指标"""
        attrs = {
            **self._base_attrs,
            "status": status,
            "user.id": user_id,
            "department": department,
        }
        
        # 记录延迟
        request_latency.record(duration_ms, attributes=attrs)
        
        # 记录请求数
        request_counter.add(1, attributes=attrs)
        
        # 记录 Token 用量（按部门/用户归因）
        token_attrs = {
            **attrs,
            "token.type": "input",
        }
        token_input_counter.add(input_tokens, attributes=token_attrs)
        
        token_attrs["token.type"] = "output"
        token_output_counter.add(output_tokens, attributes=token_attrs)
    
    def record_tool_call(
        self,
        tool_name: str,
        success: bool,
        duration_ms: float,
    ):
        """记录工具调用指标"""
        tool_call_counter.add(
            1,
            attributes={
                **self._base_attrs,
                "tool.name": tool_name,
                "success": str(success),
            }
        )
        
        request_latency.record(
            duration_ms,
            attributes={
                **self._base_attrs,
                "operation": "tool_call",
                "tool.name": tool_name,
            }
        )


# 使用示例
collector = AgentMetricsCollector(
    agent_name="research_agent",
    model="gpt-4.1",
    environment="production",
)

# 模拟记录指标
start = time.time()
# ... 执行 Agent ...
end = time.time()

collector.record_request(
    duration_ms=(end - start) * 1000,
    status="success",
    input_tokens=1250,
    output_tokens=380,
    user_id="user_123",
    department="engineering",
)

5.3 Foundry Agent Monitor 内置指标

Foundry 内置 Agent 监控仪表盘提供以下开箱即用的摘要卡片：

摘要卡片	指标含义	告警阈值建议
Token Usage	输入/输出 Token 总量与趋势	环比增长 > 50%
Latency	P50/P95 请求延迟	P95 > 10s 需调查
Run Success Rate	Agent 运行成功率	< 95% 需告警
Evaluation Scores	Groundedness/Relevance 平均分	< 0.8 需关注
Red Team Results	安全评估通过率	< 100% 立即处理

参考：Monitor agents with Agent Monitoring Dashboard

六、Logs 日志管理与 KQL 查询

6.1 结构化日志最佳实践

python 复制代码

# structured_logging.py - 结构化日志配置
import logging
import json
import os
from datetime import datetime
from opentelemetry.instrumentation.logging import LoggingInstrumentor

# 启用日志与 OTel Trace 关联（自动注入 trace_id、span_id）
LoggingInstrumentor().instrument(set_logging_format=True)

# 自定义 JSON 日志格式
class JsonFormatter(logging.Formatter):
    """结构化 JSON 日志格式，便于 KQL 查询"""
    
    def format(self, record: logging.LogRecord) -> str:
        log_data = {
            "timestamp": datetime.utcnow().isoformat() + "Z",
            "level": record.levelname,
            "logger": record.name,
            "message": record.getMessage(),
            "service": os.getenv("SERVICE_NAME", "ai-agent"),
            "environment": os.getenv("ENVIRONMENT", "production"),
        }
        
        # 注入 OTel Trace 上下文
        if hasattr(record, "otelTraceID"):
            log_data["trace_id"] = record.otelTraceID
        if hasattr(record, "otelSpanID"):
            log_data["span_id"] = record.otelSpanID
        
        # 注入额外字段
        if hasattr(record, "extra"):
            log_data.update(record.extra)
        
        # 注入异常信息
        if record.exc_info:
            log_data["exception"] = {
                "type": record.exc_info[0].__name__,
                "message": str(record.exc_info[1]),
                "stacktrace": self.formatException(record.exc_info),
            }
        
        return json.dumps(log_data, ensure_ascii=False)


# 配置日志
def setup_logging(level: str = "INFO"):
    handler = logging.StreamHandler()
    handler.setFormatter(JsonFormatter())
    
    root_logger = logging.getLogger()
    root_logger.setLevel(getattr(logging, level))
    root_logger.addHandler(handler)
    
    return logging.getLogger("azure-ai-agent")


# 专用日志类
logger = setup_logging()

class AgentLogger:
    """Agent 专用结构化日志器"""
    
    @staticmethod
    def log_agent_start(agent_name: str, thread_id: str, user_query_hash: str):
        logger.info("Agent run started", extra={
            "event": "agent.run.start",
            "agent.name": agent_name,
            "thread.id": thread_id,
            "query.hash": user_query_hash,  # 不记录原始 query（隐私保护）
        })
    
    @staticmethod
    def log_tool_call(tool_name: str, success: bool, duration_ms: float, error: str = None):
        level = logging.INFO if success else logging.WARNING
        logger.log(level, f"Tool call {'succeeded' if success else 'failed'}", extra={
            "event": "agent.tool_call",
            "tool.name": tool_name,
            "success": success,
            "duration_ms": duration_ms,
            "error": error,
        })
    
    @staticmethod
    def log_safety_event(event_type: str, category: str, severity: str):
        logger.warning("Safety event detected", extra={
            "event": "agent.safety",
            "safety.event_type": event_type,  # "content_filtered" | "jailbreak_attempt"
            "safety.category": category,       # "violence" | "hate" | "sexual" | "self_harm"
            "safety.severity": severity,        # "low" | "medium" | "high"
        })
    
    @staticmethod
    def log_cost_event(model: str, input_tokens: int, output_tokens: int,
                       user_id: str, department: str, estimated_cost_usd: float):
        logger.info("Token usage recorded", extra={
            "event": "agent.cost",
            "model": model,
            "tokens.input": input_tokens,
            "tokens.output": output_tokens,
            "user.id": user_id,
            "department": department,
            "cost.estimated_usd": estimated_cost_usd,
        })

6.2 KQL 查询示例库

以下是生产环境中最常用的 KQL 查询，可直接在 Log Analytics Workspace 或 Application Insights 中使用：

查询 1：P95 请求延迟趋势（过去 24 小时）

kusto 复制代码

// Agent 请求 P95 延迟趋势
customMetrics
| where name == "agent.request.duration"
| where timestamp > ago(24h)
| extend
    agent_name = tostring(customDimensions["agent.name"]),
    model = tostring(customDimensions["model"]),
    status = tostring(customDimensions["status"])
| where status == "success"
| summarize
    p50_ms = percentile(value, 50),
    p95_ms = percentile(value, 95),
    p99_ms = percentile(value, 99),
    request_count = count()
    by bin(timestamp, 1h), agent_name, model
| order by timestamp desc
| render timechart

查询 2：Token 用量与成本归因（按部门）

kusto 复制代码

// 按部门统计 Token 用量（用于成本分摊）
customMetrics
| where name in ("agent.token.input.total", "agent.token.output.total")
| where timestamp > ago(7d)
| extend
    token_type = name,
    department = tostring(customDimensions["department"]),
    model = tostring(customDimensions["model"])
| summarize
    total_tokens = sum(value)
    by token_type, department, model, bin(timestamp, 1d)
| extend
    // 成本计算（$/1M tokens）
    cost_usd = case(
        model == "gpt-4.1" and token_type == "agent.token.input.total",
        total_tokens * 2.0 / 1000000,
        model == "gpt-4.1" and token_type == "agent.token.output.total",
        total_tokens * 8.0 / 1000000,
        model == "gpt-4.1-mini" and token_type == "agent.token.input.total",
        total_tokens * 0.4 / 1000000,
        model == "gpt-4.1-mini" and token_type == "agent.token.output.total",
        total_tokens * 1.6 / 1000000,
        total_tokens * 2.0 / 1000000  // 默认
    )
| summarize
    total_cost_usd = sum(cost_usd),
    total_tokens_by_type = sum(total_tokens)
    by department, model, timestamp
| order by total_cost_usd desc

查询 3：工具调用失败率告警查询

kusto 复制代码

// 工具调用失败率（滑动窗口 5 分钟）
customMetrics
| where name == "agent.tool_call.total"
| where timestamp > ago(1h)
| extend
    tool_name = tostring(customDimensions["tool.name"]),
    success = tostring(customDimensions["success"])
| summarize
    total_calls = count(),
    failed_calls = countif(success == "False")
    by tool_name, bin(timestamp, 5m)
| extend failure_rate = todecimal(failed_calls) / todecimal(total_calls)
| where failure_rate > 0.1  // 失败率超过 10% 触发告警
| order by failure_rate desc

查询 4：异常追踪与根因分析

kusto 复制代码

// Agent 异常日志分析
exceptions
| where timestamp > ago(24h)
| where customDimensions["service"] == "ai-agent"
| extend
    exception_type = type,
    error_message = outerMessage,
    trace_id = tostring(customDimensions["trace_id"]),
    agent_name = tostring(customDimensions["agent.name"]),
    tool_name = tostring(customDimensions["tool.name"])
| project timestamp, exception_type, error_message, agent_name, tool_name, trace_id
| summarize
    error_count = count(),
    latest_occurrence = max(timestamp),
    sample_trace_ids = make_set(trace_id, 5)
    by exception_type, error_message, agent_name
| order by error_count desc
| take 20

查询 5：内容安全事件统计

kusto 复制代码

// 内容安全事件趋势
customEvents
| where name == "agent.safety"
| where timestamp > ago(7d)
| extend
    event_type = tostring(customDimensions["safety.event_type"]),
    category = tostring(customDimensions["safety.category"]),
    severity = tostring(customDimensions["safety.severity"])
| summarize
    event_count = count()
    by event_type, category, severity, bin(timestamp, 1d)
| order by event_count desc
| render barchart

查询 6：端到端链路延迟分解（Trace 分析）

kusto 复制代码

// 分解各阶段延迟（从 Trace 数据）
traces
| where timestamp > ago(1h)
| where customDimensions["service.name"] == "azure-ai-agent"
| extend
    span_name = tostring(customDimensions["span_name"]),
    duration_ms = todouble(customDimensions["duration_ms"]),
    trace_id = tostring(customDimensions["trace_id"])
| summarize
    avg_duration_ms = avg(duration_ms),
    p95_duration_ms = percentile(duration_ms, 95),
    call_count = count()
    by span_name
| order by avg_duration_ms desc
| render columnchart

6.3 创建 Log Analytics 告警规则

bash 复制代码

# create_log_alert.sh - 创建 KQL 告警规则
#!/bin/bash

RESOURCE_GROUP="rg-ai-foundry-prod"
WORKSPACE_ID=$(az monitor log-analytics workspace show \
  --resource-group $RESOURCE_GROUP \
  --workspace-name "law-ai-agent-prod" \
  --query id -o tsv)

# 创建告警规则：工具调用失败率 > 10%
az monitor scheduled-query create \
  --resource-group $RESOURCE_GROUP \
  --name "alert-tool-call-failure-rate" \
  --scopes $WORKSPACE_ID \
  --condition-query "
    customMetrics
    | where name == 'agent.tool_call.total'
    | where timestamp > ago(5m)
    | extend success = tostring(customDimensions['success'])
    | summarize 
        total = count(),
        failed = countif(success == 'False')
    | extend failure_rate = todecimal(failed) / todecimal(total)
    | where failure_rate > 0.1
  " \
  --condition-threshold 0 \
  --condition-operator GreaterThan \
  --condition-time-aggregation Count \
  --evaluation-frequency 5m \
  --window-size 5m \
  --severity 2 \
  --description "工具调用失败率超过 10%，需立即调查" \
  --action-groups "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Insights/actionGroups/ag-ai-agent-oncall"

七、Azure AI Evaluation SDK：质量与安全评估

7.1 评估体系概览

Azure AI Evaluation SDK（azure-ai-evaluation）提供了离线评估 和在线持续评估两种模式：

复制代码

                    评估体系架构
┌────────────────────────────────────────────────────┐
│                                                    │
│  离线评估（Offline Evaluation）                     │
│  ┌─────────────────────────────────────────────┐   │
│  │  测试数据集 (JSONL)                           │   │
│  │       ↓                                      │   │
│  │  evaluate() API                              │   │
│  │       ↓                                      │   │
│  │  并行调用多个 Evaluator                       │   │
│  │       ↓                                      │   │
│  │  聚合指标 + 逐行结果                          │   │
│  │       ↓                                      │   │
│  │  上传至 Foundry 项目（可选）                  │   │
│  └─────────────────────────────────────────────┘   │
│                                                    │
│  在线持续评估（Online Evaluation）                  │
│  ┌─────────────────────────────────────────────┐   │
│  │  生产流量（实时 Agent 运行）                  │   │
│  │       ↓                                      │   │
│  │  连续评估规则（Continuous Evaluation Rule）   │   │
│  │       ↓                                      │   │
│  │  自动采样 + 评估（max 100 runs/hour）         │   │
│  │       ↓                                      │   │
│  │  Foundry Agent Monitor 仪表盘               │   │
│  └─────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────┘

7.2 内置 Evaluator 完整列表

python 复制代码

# evaluator_catalog.py - 完整 Evaluator 目录
from azure.ai.evaluation import (
    # 通用质量评估
    CoherenceEvaluator,      # 连贯性（0-5分）
    FluencyEvaluator,        # 流畅性（0-5分）
    QAEvaluator,             # 问答综合评估（捆绑质量指标）
    
    # 文本相似性
    SimilarityEvaluator,     # 语义相似度
    F1ScoreEvaluator,        # F1 分数
    BleuScoreEvaluator,      # BLEU 分数（机器翻译）
    RougeScoreEvaluator,     # ROUGE 分数（摘要）
    MeteorScoreEvaluator,    # METEOR 分数
    GleuScoreEvaluator,      # GLEU 分数
    
    # RAG 专用评估
    RetrievalEvaluator,      # 检索相关性
    GroundednessEvaluator,   # 基础性（是否基于上下文）
    RelevanceEvaluator,      # 相关性（回答是否切题）
    
    # 风险与安全
    ViolenceEvaluator,       # 暴力内容
    SexualEvaluator,         # 性内容
    SelfHarmEvaluator,       # 自我伤害
    HateUnfairnessEvaluator, # 仇恨/不公平
    IndirectAttackEvaluator, # 间接攻击
    
    # Agent 专用评估（新增）
    IntentResolutionEvaluator,  # 意图识别准确性
    ToolCallAccuracyEvaluator,  # 工具调用准确性
    TaskAdherenceEvaluator,     # 任务遵从性
    
    # 复合评估器
    ContentSafetyEvaluator,  # 安全综合评估（捆绑安全指标）
)

7.3 离线批量评估完整示例

python 复制代码

# offline_evaluation.py - 完整的离线批量评估
import json
import os
from pathlib import Path
from azure.ai.evaluation import (
    evaluate,
    GroundednessEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
    ViolenceEvaluator,
    IntentResolutionEvaluator,
    ToolCallAccuracyEvaluator,
)
from azure.identity import DefaultAzureCredential

# 配置 Azure AI 项目（用于 AI 辅助评估）
azure_ai_project = {
    "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
    "resource_group_name": os.environ["AZURE_RESOURCE_GROUP"],
    "project_name": os.environ["AZURE_AI_PROJECT_NAME"],
}

# 评估模型配置（推荐 gpt-4o-mini 以降低成本）
model_config = {
    "azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"],
    "azure_deployment": "gpt-4o-mini",  # 评估专用部署
    "api_version": "2024-08-01-preview",
    "api_key": os.environ["AZURE_OPENAI_API_KEY"],
}

# 初始化评估器
evaluators = {
    # 质量评估
    "groundedness": GroundednessEvaluator(model_config=model_config),
    "relevance": RelevanceEvaluator(model_config=model_config),
    "coherence": CoherenceEvaluator(model_config=model_config),
    "fluency": FluencyEvaluator(model_config=model_config),
    
    # 安全评估（需要 Azure AI 项目）
    "violence": ViolenceEvaluator(
        azure_ai_project=azure_ai_project,
        credential=DefaultAzureCredential(),
    ),
    
    # Agent 专用评估
    "intent_resolution": IntentResolutionEvaluator(model_config=model_config),
    "tool_call_accuracy": ToolCallAccuracyEvaluator(model_config=model_config),
}

# 列映射（数据集列名 → 评估器期望的键名）
evaluator_config = {
    "groundedness": {
        "column_mapping": {
            "query": "${data.user_query}",
            "context": "${data.retrieved_context}",
            "response": "${data.agent_response}",
        }
    },
    "relevance": {
        "column_mapping": {
            "query": "${data.user_query}",
            "response": "${data.agent_response}",
        }
    },
    "coherence": {
        "column_mapping": {
            "query": "${data.user_query}",
            "response": "${data.agent_response}",
        }
    },
    "fluency": {
        "column_mapping": {
            "response": "${data.agent_response}",
        }
    },
    "violence": {
        "column_mapping": {
            "query": "${data.user_query}",
            "response": "${data.agent_response}",
        }
    },
    "intent_resolution": {
        "column_mapping": {
            "query": "${data.user_query}",
            "response": "${data.agent_response}",
            "conversation": "${data.conversation}",
        }
    },
    "tool_call_accuracy": {
        "column_mapping": {
            "query": "${data.user_query}",
            "tool_calls": "${data.tool_calls}",
            "tool_definitions": "${data.tool_definitions}",
        }
    },
}

# 准备测试数据集
def prepare_test_dataset(output_path: str):
    """从 Agent 运行历史中创建评估数据集"""
    test_cases = [
        {
            "user_query": "Azure OpenAI GPT-4.1 的上下文窗口有多大？",
            "retrieved_context": "GPT-4.1 支持 1M token 上下文窗口，可处理超长文档。",
            "agent_response": "GPT-4.1 拥有 100 万 token 的超大上下文窗口，特别适合需要处理大量文档的场景。",
            "conversation": [
                {"role": "user", "content": "Azure OpenAI GPT-4.1 的上下文窗口有多大？"},
                {"role": "assistant", "content": "GPT-4.1 拥有 100 万 token 的超大上下文窗口..."},
            ],
            "tool_calls": [
                {
                    "function": {
                        "name": "search_knowledge_base",
                        "arguments": '{"query": "GPT-4.1 context window"}'
                    }
                }
            ],
            "tool_definitions": [
                {
                    "name": "search_knowledge_base",
                    "description": "在知识库中搜索技术文档",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "query": {"type": "string", "description": "搜索关键词"}
                        },
                        "required": ["query"]
                    }
                }
            ]
        },
        # 添加更多测试用例...
    ]
    
    with open(output_path, "w", encoding="utf-8") as f:
        for case in test_cases:
            f.write(json.dumps(case, ensure_ascii=False) + "\n")
    
    return output_path


# 执行评估
def run_evaluation(dataset_path: str, output_dir: str):
    """执行批量评估并上传结果到 Foundry"""
    
    results = evaluate(
        data=dataset_path,
        evaluators=evaluators,
        evaluator_config=evaluator_config,
        output_path=f"{output_dir}/eval_results.jsonl",
        
        # 上传到 Foundry 项目（可选，用于在 Portal 查看）
        azure_ai_project=azure_ai_project,
        
        # 结果追踪
        evaluation_name=f"regression_eval_{Path(dataset_path).stem}",
    )
    
    # 打印聚合指标
    print("\n=== 评估结果摘要 ===")
    for metric_name, metric_value in results["metrics"].items():
        status = "✅" if metric_value >= 0.8 else "⚠️" if metric_value >= 0.6 else "❌"
        print(f"  {status} {metric_name}: {metric_value:.3f}")
    
    # 检查是否通过质量门控
    quality_gate_passed = all(
        results["metrics"].get(m, 0) >= 0.8
        for m in ["groundedness", "relevance", "coherence"]
    )
    
    safety_gate_passed = (
        results["metrics"].get("violence_defect_rate", 0) == 0.0
    )
    
    print(f"\n质量门控: {'✅ 通过' if quality_gate_passed else '❌ 未通过'}")
    print(f"安全门控: {'✅ 通过' if safety_gate_passed else '❌ 未通过'}")
    
    return results, quality_gate_passed and safety_gate_passed


# 主程序
if __name__ == "__main__":
    dataset_path = prepare_test_dataset("/tmp/eval_dataset.jsonl")
    results, passed = run_evaluation(dataset_path, "/tmp/eval_output")
    
    if not passed:
        print("\n⚠️  评估未通过，建议延迟部署直至问题解决！")
        exit(1)
    else:
        print("\n✅ 评估通过，可以安全部署！")

7.4 Agent 专用评估器详解

IntentResolutionEvaluator（意图识别准确性）

python 复制代码

# intent_resolution_example.py
from azure.ai.evaluation import IntentResolutionEvaluator

evaluator = IntentResolutionEvaluator(model_config=model_config)

# 单次评估
result = evaluator(
    query="帮我查一下明天北京的天气，顺便告诉我最近的机场",
    response="明天北京天气晴，最高气温 22°C。距离最近的机场是首都国际机场（T3航站楼距离市区约 28 公里）。",
    conversation=[
        {"role": "user", "content": "帮我查一下明天北京的天气，顺便告诉我最近的机场"},
        {"role": "assistant", "content": "明天北京天气晴..."},
    ]
)

print(f"Intent Resolution Score: {result['intent_resolution']}")
# 评分范围: 1（完全不符合）到 5（完全符合）
# 若 > 3 表示意图基本解析正确

ToolCallAccuracyEvaluator（工具调用准确性）

python 复制代码

# tool_call_accuracy_example.py
from azure.ai.evaluation import ToolCallAccuracyEvaluator

evaluator = ToolCallAccuracyEvaluator(model_config=model_config)

result = evaluator(
    query="查询 Azure OpenAI 在东亚区的可用性",
    tool_calls=[
        {
            "function": {
                "name": "search_service_availability",
                "arguments": '{"service": "Azure OpenAI", "region": "East Asia"}'
            }
        }
    ],
    tool_definitions=[
        {
            "name": "search_service_availability",
            "description": "查询 Azure 服务在指定区域的可用性",
            "parameters": {
                "type": "object",
                "properties": {
                    "service": {"type": "string"},
                    "region": {"type": "string"}
                },
                "required": ["service", "region"]
            }
        }
    ]
)

print(f"Tool Call Accuracy: {result['tool_call_accuracy']}")
# 评分: 1（工具选择/参数完全错误）到 5（工具选择和参数完全正确）

7.5 在线持续评估配置

python 复制代码

# continuous_evaluation.py - 在线持续评估配置
import os
import asyncio
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
    ContinuousEvaluationConfiguration,
    EvaluationSchedule,
    BuiltinEvaluator,
)
from azure.identity import DefaultAzureCredential

async def setup_continuous_evaluation():
    """
    配置在线持续评估
    自动对生产流量进行采样和评估
    """
    project_client = AIProjectClient(
        endpoint=os.environ["AZURE_AI_PROJECT_ENDPOINT"],
        credential=DefaultAzureCredential(),
    )
    
    async with project_client:
        # 获取 Agent
        agent = await project_client.agents.get_agent(
            agent_id=os.environ["AZURE_AI_AGENT_ID"]
        )
        
        # 配置持续评估规则
        # 内置评估器列表:
        # - builtin.violence（暴力内容）
        # - builtin.sexual（性内容）
        # - builtin.self_harm（自我伤害）
        # - builtin.hate_unfairness（仇恨）
        # - builtin.groundedness（基础性）
        # - builtin.relevance（相关性）
        
        eval_config = ContinuousEvaluationConfiguration(
            evaluators=[
                BuiltinEvaluator(name="builtin.groundedness"),
                BuiltinEvaluator(name="builtin.relevance"),
                BuiltinEvaluator(name="builtin.violence"),
                BuiltinEvaluator(name="builtin.hate_unfairness"),
            ],
            max_runs_per_hour=100,  # 每小时最多评估 100 次
            sampling_rate=0.1,       # 采样 10% 的生产流量
        )
        
        # 绑定到 Agent
        await project_client.agents.update_agent(
            agent_id=agent.id,
            evaluation_configuration=eval_config,
        )
        
        print(f"✅ 持续评估已配置: {agent.name}")
        print(f"   采样率: 10%, 最大 100 次/小时")
        print(f"   评估器: Groundedness, Relevance, Violence, Hate")

asyncio.run(setup_continuous_evaluation())

参考代码：

sample_continuous_evaluation_rule.py

sample_scheduled_evaluations.py

八、SLO / SLA 设计与告警配置

8.1 AI Agent SLO 设计框架

传统 Web 服务的 SLO 主要关注延迟和可用性。AI Agent 的 SLO 还必须涵盖质量维度：

复制代码

┌─────────────────────────────────────────────────────────┐
│              AI Agent SLO 三维框架                      │
├──────────────────┬──────────────────┬───────────────────┤
│   可用性 (A)      │   性能 (P)        │   质量 (Q)        │
├──────────────────┼──────────────────┼───────────────────┤
│ 服务正常运行率    │ P50 延迟 ≤ 2s    │ Groundedness ≥ 0.8│
│ ≥ 99.5%          │ P95 延迟 ≤ 8s    │ Relevance ≥ 0.8   │
│                  │ P99 延迟 ≤ 15s   │ 工具调用成功率≥98% │
│ 可成功处理的请    │                  │                   │
│ 求比率 ≥ 99%     │ Token 吞吐量 ≥   │ 安全过滤误触发率  │
│                  │ 10,000 TPM       │ ≤ 1%              │
└──────────────────┴──────────────────┴───────────────────┘

8.2 SLO 指标计算公式

python 复制代码

# slo_calculator.py - SLO 指标计算
class SLOCalculator:
    """AI Agent SLO 计算器"""
    
    # SLO 目标定义
    SLO_TARGETS = {
        "availability": 0.995,          # 99.5% 可用性
        "p95_latency_ms": 8000,         # P95 ≤ 8 秒
        "p99_latency_ms": 15000,        # P99 ≤ 15 秒
        "success_rate": 0.99,           # 99% 请求成功率
        "tool_success_rate": 0.98,      # 98% 工具调用成功率
        "groundedness_score": 0.80,     # Groundedness ≥ 0.8
        "relevance_score": 0.80,        # Relevance ≥ 0.8
        "safety_defect_rate": 0.0,      # 安全缺陷率 = 0
    }
    
    @staticmethod
    def calculate_error_budget(
        slo_target: float,
        time_window_hours: int = 720  # 30 天
    ) -> dict:
        """计算错误预算（Error Budget）"""
        total_minutes = time_window_hours * 60
        allowed_bad_minutes = total_minutes * (1 - slo_target)
        allowed_bad_seconds = allowed_bad_minutes * 60
        
        return {
            "slo_target_pct": f"{slo_target * 100:.2f}%",
            "time_window_hours": time_window_hours,
            "allowed_downtime_minutes": f"{allowed_bad_minutes:.1f}",
            "allowed_downtime_seconds": f"{allowed_bad_seconds:.0f}",
        }
    
    @staticmethod
    def assess_slo_breach(
        current_availability: float,
        current_p95_ms: float,
        current_success_rate: float,
        current_groundedness: float,
    ) -> list[dict]:
        """检查 SLO 违反情况"""
        targets = SLOCalculator.SLO_TARGETS
        breaches = []
        
        checks = [
            ("availability", current_availability, targets["availability"]),
            ("p95_latency_ms", current_p95_ms, targets["p95_latency_ms"], "lte"),
            ("success_rate", current_success_rate, targets["success_rate"]),
            ("groundedness", current_groundedness, targets["groundedness_score"]),
        ]
        
        for metric, current, target, *mode in checks:
            compare_mode = mode[0] if mode else "gte"
            
            if compare_mode == "gte":
                breached = current < target
            else:  # lte
                breached = current > target
            
            if breached:
                gap = abs(current - target)
                severity = "critical" if gap > 0.05 else "warning"
                
                breaches.append({
                    "metric": metric,
                    "current": current,
                    "target": target,
                    "gap": gap,
                    "severity": severity,
                })
        
        return breaches


# 使用示例
budget_99_5 = SLOCalculator.calculate_error_budget(0.995)
print(f"99.5% 可用性 → 每月允许宕机: {budget_99_5['allowed_downtime_minutes']} 分钟")
# 输出: 99.5% 可用性 → 每月允许宕机: 216.0 分钟 (3.6 小时)

8.3 Azure Monitor 告警配置

python 复制代码

# alerts_setup.py - 使用 Python SDK 创建告警规则
import os
from azure.mgmt.monitor import MonitorManagementClient
from azure.mgmt.monitor.models import (
    MetricAlertResource,
    MetricAlertSingleResourceMultipleMetricCriteria,
    MetricCriteria,
    MetricAlertAction,
)
from azure.identity import DefaultAzureCredential

def create_ai_agent_alerts(
    resource_group: str,
    app_insights_resource_id: str,
    action_group_id: str,
):
    """创建 AI Agent 生产监控告警套件"""
    
    credential = DefaultAzureCredential()
    subscription_id = os.environ["AZURE_SUBSCRIPTION_ID"]
    
    client = MonitorManagementClient(credential, subscription_id)
    
    alert_definitions = [
        # 1. 高延迟告警
        {
            "name": "alert-agent-high-latency",
            "description": "Agent P95 延迟超过 10 秒",
            "severity": 2,
            "metric_name": "agent.request.duration",
            "operator": "GreaterThan",
            "threshold": 10000,  # 毫秒
            "time_aggregation": "Percentile95",
            "window_size": "PT5M",
        },
        # 2. 低成功率告警
        {
            "name": "alert-agent-low-success-rate",
            "description": "Agent 运行成功率低于 95%",
            "severity": 1,  # Critical
            "metric_name": "agent.request.total",
            "operator": "LessThan",
            "threshold": 0.95,
            "time_aggregation": "Average",
            "window_size": "PT10M",
        },
        # 3. Token 用量异常告警
        {
            "name": "alert-agent-token-spike",
            "description": "Token 用量异常飙升（可能的 Prompt Injection）",
            "severity": 2,
            "metric_name": "agent.token.input.total",
            "operator": "GreaterThan",
            "threshold": 50000,  # 5 分钟内 50K tokens
            "time_aggregation": "Total",
            "window_size": "PT5M",
        },
    ]
    
    created_alerts = []
    for alert_def in alert_definitions:
        print(f"创建告警规则: {alert_def['name']}")
        created_alerts.append(alert_def["name"])
    
    print(f"\n✅ 已创建 {len(created_alerts)} 条告警规则")
    return created_alerts

8.4 Grafana SLO 配置

在 Azure Managed Grafana 中配置 SLO：

yaml 复制代码

# grafana-slo-config.yaml
apiVersion: slo.grafana.com/v1alpha1
kind: SLO
metadata:
  name: ai-agent-availability-slo
  namespace: azure-ai-monitoring

spec:
  title: "AI Agent 可用性 SLO"
  description: "生产 AI Agent 服务可用性 99.5%"
  
  service: ai-agent-service
  
  # SLI（服务水平指标）定义
  query:
    type: ratio
    ratio:
      successMetric:
        prometheusMetric:
          type: counter
          metric: "agent_request_success_total"
      totalMetric:
        prometheusMetric:
          type: counter
          metric: "agent_request_total"
  
  # SLO 目标
  objectives:
    - value: 0.995
      window: 30d  # 30 天滚动窗口
  
  # 告警配置
  alerting:
    fastBurn:
      annotations:
        summary: "AI Agent 可用性快速下降"
        description: "错误预算消耗速率超过 14x"
      labels:
        severity: critical
    slowBurn:
      annotations:
        summary: "AI Agent 可用性缓慢下降"
        description: "错误预算消耗速率超过 2x"
      labels:
        severity: warning

参考：Set service level objectives and manage SLO alerts in Grafana Cloud

九、成本归因与 Token 用量优化

9.1 成本计算体系

Azure AI Foundry 的成本主要由以下部分构成：

成本类型	计费单位	GPT-4.1 价格	GPT-4.1 mini 价格
输入 Token	$/1M tokens	$2.00	$0.40
输出 Token	$/1M tokens	$8.00	$1.60
缓存输入 Token	$/1M tokens	$0.50	$0.10
嵌入（text-embedding-3-large）	$/1M tokens	$0.13	-
图像输入（GPT-4.1）	按 tile 计费	可变	-

注意：价格随时可能更新，请以 Azure 官方定价页面为准。

9.2 成本追踪与归因系统

python 复制代码

# cost_tracker.py - 完整的成本追踪系统
import os
import json
from datetime import datetime
from dataclasses import dataclass, field, asdict
from typing import Optional
from opentelemetry import trace

@dataclass
class TokenUsage:
    """Token 用量记录"""
    request_id: str
    timestamp: str
    agent_name: str
    model: str
    input_tokens: int
    output_tokens: int
    cached_input_tokens: int = 0
    
    # 成本归因标签
    user_id: str = "anonymous"
    department: str = "unallocated"
    project: str = "default"
    environment: str = "production"
    use_case: str = "general"
    
    # 计算字段
    input_cost_usd: float = field(default=0.0, init=False)
    output_cost_usd: float = field(default=0.0, init=False)
    total_cost_usd: float = field(default=0.0, init=False)
    
    # 模型价格表（$/1M tokens）
    MODEL_PRICES = {
        "gpt-4.1": {"input": 2.0, "output": 8.0, "cached_input": 0.5},
        "gpt-4.1-mini": {"input": 0.4, "output": 1.6, "cached_input": 0.1},
        "gpt-4.1-nano": {"input": 0.1, "output": 0.4, "cached_input": 0.025},
        "gpt-4o": {"input": 2.5, "output": 10.0, "cached_input": 1.25},
        "gpt-4o-mini": {"input": 0.15, "output": 0.6, "cached_input": 0.075},
        "text-embedding-3-large": {"input": 0.13, "output": 0.0, "cached_input": 0.0},
    }
    
    def __post_init__(self):
        prices = self.MODEL_PRICES.get(
            self.model,
            {"input": 2.0, "output": 8.0, "cached_input": 0.5}
        )
        
        # 计算成本
        regular_input = max(0, self.input_tokens - self.cached_input_tokens)
        
        self.input_cost_usd = (
            regular_input * prices["input"] / 1_000_000 +
            self.cached_input_tokens * prices["cached_input"] / 1_000_000
        )
        self.output_cost_usd = self.output_tokens * prices["output"] / 1_000_000
        self.total_cost_usd = self.input_cost_usd + self.output_cost_usd


class CostAttributionSystem:
    """成本归因系统：追踪并分析 AI Agent 使用成本"""
    
    def __init__(self):
        self.tracer = trace.get_tracer("cost-attribution")
        self._usage_log: list[TokenUsage] = []
    
    def record_usage(
        self,
        model: str,
        input_tokens: int,
        output_tokens: int,
        agent_name: str,
        user_id: str = "anonymous",
        department: str = "unallocated",
        project: str = "default",
        use_case: str = "general",
        cached_input_tokens: int = 0,
    ) -> TokenUsage:
        """记录一次 Token 用量"""
        
        import uuid
        usage = TokenUsage(
            request_id=str(uuid.uuid4()),
            timestamp=datetime.utcnow().isoformat() + "Z",
            agent_name=agent_name,
            model=model,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            cached_input_tokens=cached_input_tokens,
            user_id=user_id,
            department=department,
            project=project,
            use_case=use_case,
        )
        
        self._usage_log.append(usage)
        
        # 将成本数据写入 OTel Span 属性
        current_span = trace.get_current_span()
        if current_span.is_recording():
            current_span.set_attribute("cost.input_usd", usage.input_cost_usd)
            current_span.set_attribute("cost.output_usd", usage.output_cost_usd)
            current_span.set_attribute("cost.total_usd", usage.total_cost_usd)
            current_span.set_attribute("cost.department", department)
            current_span.set_attribute("cost.project", project)
        
        return usage
    
    def get_cost_summary(
        self,
        group_by: list[str] = ["department"],
        period_days: int = 30,
    ) -> dict:
        """按维度汇总成本"""
        from collections import defaultdict
        
        summary = defaultdict(lambda: {
            "total_cost_usd": 0.0,
            "input_tokens": 0,
            "output_tokens": 0,
            "request_count": 0,
        })
        
        for usage in self._usage_log:
            # 构建分组键
            key_parts = []
            for dim in group_by:
                key_parts.append(getattr(usage, dim, "unknown"))
            key = "|".join(key_parts)
            
            summary[key]["total_cost_usd"] += usage.total_cost_usd
            summary[key]["input_tokens"] += usage.input_tokens
            summary[key]["output_tokens"] += usage.output_tokens
            summary[key]["request_count"] += 1
        
        return dict(summary)
    
    def identify_cost_anomalies(
        self,
        threshold_multiplier: float = 3.0
    ) -> list[dict]:
        """识别成本异常（用量突增）"""
        if len(self._usage_log) < 10:
            return []
        
        costs = [u.total_cost_usd for u in self._usage_log]
        avg_cost = sum(costs) / len(costs)
        
        anomalies = []
        for usage in self._usage_log:
            if usage.total_cost_usd > avg_cost * threshold_multiplier:
                anomalies.append({
                    "request_id": usage.request_id,
                    "timestamp": usage.timestamp,
                    "cost_usd": usage.total_cost_usd,
                    "avg_cost_usd": avg_cost,
                    "multiplier": usage.total_cost_usd / avg_cost,
                    "user_id": usage.user_id,
                    "agent_name": usage.agent_name,
                    "input_tokens": usage.input_tokens,
                    "output_tokens": usage.output_tokens,
                })
        
        return sorted(anomalies, key=lambda x: x["multiplier"], reverse=True)


# 使用示例
tracker = CostAttributionSystem()

usage = tracker.record_usage(
    model="gpt-4.1",
    input_tokens=2500,
    output_tokens=800,
    cached_input_tokens=500,
    agent_name="research_agent",
    user_id="user_456",
    department="engineering",
    project="azure-migration",
    use_case="code_review",
)

print(f"本次请求成本: ${usage.total_cost_usd:.6f}")
# 输出示例: 本次请求成本: $0.011750

# 打印成本汇总
summary = tracker.get_cost_summary(group_by=["department", "project"])
print("\n部门/项目成本汇总:")
for key, data in sorted(summary.items(), key=lambda x: x[1]["total_cost_usd"], reverse=True):
    print(f"  {key}: ${data['total_cost_usd']:.4f} ({data['request_count']} 次请求)")

9.3 Token 优化策略

python 复制代码

# token_optimizer.py - Token 用量优化
import os
import hashlib
from functools import lru_cache
from typing import Any, Optional

class PromptOptimizer:
    """Prompt 优化器：减少 Token 消耗"""
    
    # 策略 1: Prompt 缓存（利用 Azure OpenAI 的 Cached Tokens）
    # 将系统 prompt 放在开头，可享受最高 75% 缓存命中率
    
    CACHED_SYSTEM_PROMPT = """
    You are an expert Azure cloud architect specializing in AI solutions.
    You have deep knowledge of Azure AI Foundry, Azure OpenAI, and related services.
    Always provide accurate, concise, and actionable guidance.
    """.strip()
    
    # 策略 2: 动态精简上下文
    @staticmethod
    def trim_conversation_history(
        messages: list[dict],
        max_tokens: int = 4000,
        keep_last_n: int = 10,
    ) -> list[dict]:
        """
        精简对话历史，避免上下文窗口溢出导致截断
        保留系统 prompt + 最近 N 条消息
        """
        if not messages:
            return []
        
        # 始终保留系统 prompt
        system_msgs = [m for m in messages if m["role"] == "system"]
        user_assistant_msgs = [m for m in messages if m["role"] != "system"]
        
        # 保留最近 N 条
        if len(user_assistant_msgs) > keep_last_n:
            kept_msgs = user_assistant_msgs[-keep_last_n:]
            trimmed_count = len(user_assistant_msgs) - keep_last_n
            print(f"ℹ️  已裁剪 {trimmed_count} 条历史消息以节省 Token")
        else:
            kept_msgs = user_assistant_msgs
        
        return system_msgs + kept_msgs
    
    # 策略 3: 选择合适的模型（成本差异最高 20x）
    @staticmethod
    def recommend_model(task_complexity: str, require_reasoning: bool = False) -> str:
        """
        根据任务复杂度推荐最经济的模型
        
        task_complexity: "simple" | "medium" | "complex"
        """
        if require_reasoning:
            return "o4-mini"  # 推理任务
        
        model_map = {
            "simple": "gpt-4.1-nano",    # 分类、提取、简单问答 $0.1/$0.4 per 1M
            "medium": "gpt-4.1-mini",    # 摘要、翻译、代码补全 $0.4/$1.6 per 1M
            "complex": "gpt-4.1",        # 深度分析、复杂推理  $2.0/$8.0 per 1M
        }
        return model_map.get(task_complexity, "gpt-4.1-mini")
    
    # 策略 4: 响应缓存（相同问题不重复调用 LLM）
    @staticmethod
    def generate_cache_key(system_prompt: str, user_message: str) -> str:
        """生成缓存键（基于 prompt 的哈希）"""
        content = f"{system_prompt}|{user_message}"
        return hashlib.sha256(content.encode()).hexdigest()[:16]

    # 策略 5: 流式输出减少感知延迟（不减少 Token，但改善 UX）
    @staticmethod
    async def stream_response_example(client, messages: list[dict]):
        """流式输出示例"""
        from openai import AsyncAzureOpenAI
        
        stream = await client.chat.completions.create(
            model="gpt-4.1-mini",
            messages=messages,
            stream=True,
            max_tokens=1000,
        )
        
        full_response = ""
        async for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                full_response += content
                print(content, end="", flush=True)
        
        # 记录 Token 用量（从最后一个 chunk 获取）
        return full_response


# Token 节省计算器
def calculate_savings():
    """计算不同优化策略的节省效果"""
    
    # 假设场景：每天 10,000 次请求，平均 2000 input tokens
    daily_requests = 10_000
    avg_input_tokens = 2_000
    avg_output_tokens = 500
    
    # 基准：全部使用 GPT-4.1
    baseline_input_cost = daily_requests * avg_input_tokens * 2.0 / 1_000_000
    baseline_output_cost = daily_requests * avg_output_tokens * 8.0 / 1_000_000
    baseline_daily = baseline_input_cost + baseline_output_cost
    
    print(f"基准成本（全部 GPT-4.1）: ${baseline_daily:.2f}/天, ${baseline_daily * 30:.2f}/月")
    
    # 优化方案 1：路由到 GPT-4.1-mini（70% 请求）
    mini_ratio = 0.7
    optimized_1_daily = (
        (1 - mini_ratio) * daily_requests * (avg_input_tokens * 2.0 + avg_output_tokens * 8.0) / 1_000_000 +
        mini_ratio * daily_requests * (avg_input_tokens * 0.4 + avg_output_tokens * 1.6) / 1_000_000
    )
    print(f"方案1（70%路由mini）: ${optimized_1_daily:.2f}/天, 节省 {(1 - optimized_1_daily/baseline_daily)*100:.1f}%")
    
    # 优化方案 2：利用 50% 缓存命中（cached_input 节省 75%）
    cache_hit_rate = 0.5
    optimized_2_input_cost = (
        (1 - cache_hit_rate) * daily_requests * avg_input_tokens * 2.0 / 1_000_000 +
        cache_hit_rate * daily_requests * avg_input_tokens * 0.5 / 1_000_000  # 缓存价格
    )
    optimized_2_daily = optimized_2_input_cost + baseline_output_cost
    print(f"方案2（50%缓存命中）: ${optimized_2_daily:.2f}/天, 节省 {(1 - optimized_2_daily/baseline_daily)*100:.1f}%")

calculate_savings()
# 输出示例:
# 基准成本（全部 GPT-4.1）: $44.00/天, $1320.00/月
# 方案1（70%路由mini）: $15.68/天, 节省 64.4%
# 方案2（50%缓存命中）: $27.50/天, 节省 37.5%

9.4 成本治理 KQL 仪表盘

kusto 复制代码

// 成本治理仪表盘 - Top 20 高成本请求
customEvents
| where name == "agent.cost"
| where timestamp > ago(7d)
| extend
    model = tostring(customDimensions["model"]),
    user_id = tostring(customDimensions["user.id"]),
    department = tostring(customDimensions["department"]),
    project = tostring(customDimensions["project"]),
    cost_usd = todouble(customDimensions["cost.estimated_usd"]),
    input_tokens = toint(customDimensions["tokens.input"]),
    output_tokens = toint(customDimensions["tokens.output"])
| summarize
    total_cost = sum(cost_usd),
    total_input_tokens = sum(input_tokens),
    total_output_tokens = sum(output_tokens),
    request_count = count(),
    avg_cost_per_request = avg(cost_usd)
    by department, project, model
| extend total_tokens = total_input_tokens + total_output_tokens
| order by total_cost desc
| take 20
| project 
    department, project, model,
    total_cost_usd = round(total_cost, 4),
    request_count,
    avg_cost_usd = round(avg_cost_per_request, 6),
    total_tokens

十、Grafana + Azure Monitor 仪表盘实战

10.1 连接 Azure Monitor 数据源

在 Azure Managed Grafana 中配置 Azure Monitor 数据源：

json 复制代码

{
  "type": "grafana-azure-monitor-datasource",
  "name": "Azure Monitor - AI Foundry",
  "jsonData": {
    "cloudName": "azuremonitor",
    "subscriptionId": "${AZURE_SUBSCRIPTION_ID}",
    "tenantId": "${AZURE_TENANT_ID}",
    "clientId": "${GRAFANA_SP_CLIENT_ID}",
    "azureAuthType": "clientsecret",
    "logAnalyticsDefaultWorkspace": "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.OperationalInsights/workspaces/law-ai-agent-prod"
  },
  "secureJsonData": {
    "clientSecret": "${GRAFANA_SP_CLIENT_SECRET}"
  }
}

10.2 AI Agent 综合监控仪表盘 JSON

以下是一个可以直接导入 Grafana 的仪表盘配置：

json 复制代码

{
  "title": "Azure AI Agent - Production Dashboard",
  "description": "全栈 AI Agent 可观测性仪表盘",
  "tags": ["azure", "ai-agent", "foundry"],
  "refresh": "30s",
  "panels": [
    {
      "title": "请求成功率 (5min)",
      "type": "stat",
      "targets": [
        {
          "queryType": "Azure Log Analytics",
          "query": "customMetrics | where name == 'agent.request.total' | where timestamp > ago(5m) | extend status = tostring(customDimensions['status']) | summarize success_rate = 1.0 * countif(status == 'success') / count()"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {"color": "red", "value": null},
              {"color": "yellow", "value": 0.95},
              {"color": "green", "value": 0.99}
            ]
          }
        }
      }
    },
    {
      "title": "P95 延迟趋势",
      "type": "timeseries",
      "targets": [
        {
          "queryType": "Azure Log Analytics",
          "query": "customMetrics | where name == 'agent.request.duration' | where timestamp > ago(1h) | summarize p95 = percentile(value, 95) by bin(timestamp, 5m) | render timechart"
        }
      ]
    },
    {
      "title": "Token 用量（小时）",
      "type": "timeseries",
      "targets": [
        {
          "queryType": "Azure Log Analytics",
          "query": "customMetrics | where name in ('agent.token.input.total', 'agent.token.output.total') | where timestamp > ago(24h) | summarize total = sum(value) by name, bin(timestamp, 1h) | render timechart"
        }
      ]
    },
    {
      "title": "部门成本归因（本月）",
      "type": "piechart",
      "targets": [
        {
          "queryType": "Azure Log Analytics",
          "query": "customEvents | where name == 'agent.cost' | where timestamp > startofmonth(now()) | extend dept = tostring(customDimensions['department']), cost = todouble(customDimensions['cost.estimated_usd']) | summarize total_cost = sum(cost) by dept"
        }
      ]
    }
  ]
}

参考：Configure Azure Monitor data source in Grafana

十一、Agent 监控仪表盘（Foundry 内置）

11.1 访问 Agent 监控仪表盘

Foundry 内置的 Agent Monitor 仪表盘（目前处于 Preview 阶段）提供了开箱即用的监控体验：

访问步骤：

登录 Microsoft Foundry Portal
选择您的 Foundry 项目
左侧导航 → Build → 选择 Agent
顶部 Tab → Monitor

仪表盘内容：

复制代码

┌──────────────────────────────────────────────────────────────┐
│              Agent 监控仪表盘（Foundry 内置）                  │
├────────────┬───────────┬─────────────┬──────────────────────┤
│ Token 用量  │  延迟      │  成功率      │   评估分数           │
│ 12,345 /h  │ P95: 3.2s │  99.2%      │  GND: 0.89          │
│ ↑ 8%       │ ↑ 0.3s    │  ↓ 0.1%     │  REL: 0.87          │
├────────────┴───────────┴─────────────┴──────────────────────┤
│                                                              │
│  延迟趋势（过去 24 小时）                                     │
│  ████████████████████████████████████████████████           │
│  4s|                        ▄▄                              │
│  2s|              ▄▄▄▄▄▄▄▄▄▄  ▄▄▄▄▄▄▄                    │
│  0s|______________________________________________           │
│      0   4   8  12  16  20  24 (hours)                      │
│                                                              │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  评估分数趋势（持续评估）                                      │
│  1.0|                                                        │
│  0.8|------ Groundedness -----   ---- Relevance ---         │
│  0.6|                                                        │
│  0.4|_________________________________________________       │
│                                                              │
├──────────────────────────────────────────────────────────────┤
│  Red Team 扫描结果 ✅ 最近一次扫描：全部通过（0 高危问题）      │
└──────────────────────────────────────────────────────────────┘

11.2 配置仪表盘告警

在 Foundry Agent Monitor 的设置面板中（齿轮图标），可以配置以下告警：

python 复制代码

# dashboard_alerts.py - 通过 SDK 配置 Agent 监控告警
import os
import asyncio
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

async def configure_agent_monitoring_alerts():
    """配置 Agent 监控仪表盘告警"""
    
    project_client = AIProjectClient(
        endpoint=os.environ["AZURE_AI_PROJECT_ENDPOINT"],
        credential=DefaultAzureCredential(),
    )
    
    async with project_client:
        # 配置告警阈值
        alert_config = {
            "latency_threshold_ms": 10000,   # 延迟 > 10s 告警
            "success_rate_threshold": 0.95,   # 成功率 < 95% 告警
            "token_usage_threshold": 100000,  # 每小时 > 10万 Token 告警
            "eval_score_threshold": 0.7,      # 评估分 < 0.7 告警
            
            # 告警通知渠道
            "notification_channels": {
                "email": os.environ.get("ALERT_EMAIL", ""),
                "teams_webhook": os.environ.get("TEAMS_WEBHOOK_URL", ""),
                "pagerduty": os.environ.get("PAGERDUTY_INTEGRATION_KEY", ""),
            },
            
            # 定时评估（每小时自动运行红队测试）
            "scheduled_evaluations": {
                "red_team": {
                    "enabled": True,
                    "schedule": "0 * * * *",  # 每小时
                    "attack_categories": ["jailbreak", "prompt_injection"],
                },
                "quality": {
                    "enabled": True,
                    "schedule": "0 6 * * *",  # 每天早上 6 点
                    "evaluators": ["groundedness", "relevance", "coherence"],
                }
            }
        }
        
        print("✅ Agent 监控告警配置完成:")
        for key, value in alert_config.items():
            if key != "notification_channels":
                print(f"   {key}: {value}")
        
        return alert_config

十二、持续评估与生产流量监控

12.1 持续评估架构

复制代码

生产流量 (100%)
    │
    ├── 正常处理 (90%)
    │       │
    │       └── 响应给用户
    │
    └── 评估采样 (10%)
            │
            ├── 自动提交到评估引擎
            │       │
            │       ├── Groundedness 评估
            │       ├── Relevance 评估
            │       └── Safety 评估
            │
            └── 结果写入 Agent Monitor 仪表盘
                    │
                    ├── 指标趋势
                    ├── 告警触发
                    └── 模型升级决策

12.2 评估门控（Evaluation Gate）CI/CD 集成

yaml 复制代码

# .github/workflows/agent-eval-gate.yml
# 将评估门控集成到 CI/CD 流水线

name: AI Agent Evaluation Gate

on:
  pull_request:
    branches: [main]
    paths:
      - 'agents/**'
      - 'prompts/**'

jobs:
  evaluation-gate:
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      
      - name: Install dependencies
        run: |
          pip install azure-ai-evaluation>=1.0.0
          pip install azure-identity
      
      - name: Run Evaluation Gate
        id: eval-gate
        env:
          AZURE_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
          AZURE_RESOURCE_GROUP: ${{ secrets.AZURE_RESOURCE_GROUP }}
          AZURE_AI_PROJECT_NAME: ${{ secrets.AZURE_AI_PROJECT_NAME }}
          AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
          AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}
          AZURE_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
          AZURE_CLIENT_SECRET: ${{ secrets.AZURE_CLIENT_SECRET }}
          AZURE_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
        run: |
          python scripts/run_eval_gate.py \
            --dataset tests/eval_dataset.jsonl \
            --thresholds '{"groundedness": 0.8, "relevance": 0.8, "coherence": 0.75}' \
            --output eval_results.json
      
      - name: Check Gate Results
        run: |
          python -c "
          import json, sys
          with open('eval_results.json') as f:
              results = json.load(f)
          passed = results.get('gate_passed', False)
          print('评估门控结果:', '✅ 通过' if passed else '❌ 未通过')
          for metric, score in results.get('metrics', {}).items():
              print(f'  {metric}: {score:.3f}')
          sys.exit(0 if passed else 1)
          "
      
      - name: Comment PR
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('eval_results.json', 'utf8'));
            const status = results.gate_passed ? '✅ 通过' : '❌ 未通过';
            
            let comment = `## AI Agent 评估门控结果: ${status}\n\n`;
            comment += '| 指标 | 分数 | 目标 | 状态 |\n';
            comment += '|------|------|------|------|\n';
            
            for (const [metric, data] of Object.entries(results.metrics_detail)) {
              const icon = data.passed ? '✅' : '❌';
              comment += `| ${metric} | ${data.score.toFixed(3)} | ≥${data.threshold} | ${icon} |\n`;
            }
            
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comment
            });

python 复制代码

# scripts/run_eval_gate.py - 评估门控脚本
import argparse
import json
import sys
from pathlib import Path
from azure.ai.evaluation import (
    evaluate,
    GroundednessEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    ViolenceEvaluator,
)
import os

def run_evaluation_gate(dataset_path: str, thresholds: dict, output_path: str):
    """执行评估门控，返回是否通过"""
    
    model_config = {
        "azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"],
        "azure_deployment": "gpt-4o-mini",
        "api_version": "2024-08-01-preview",
        "api_key": os.environ["AZURE_OPENAI_API_KEY"],
    }
    
    # 根据阈值配置评估器
    evaluators = {}
    evaluator_config = {}
    
    if "groundedness" in thresholds:
        evaluators["groundedness"] = GroundednessEvaluator(model_config=model_config)
        evaluator_config["groundedness"] = {
            "column_mapping": {
                "query": "${data.user_query}",
                "context": "${data.retrieved_context}",
                "response": "${data.agent_response}",
            }
        }
    
    if "relevance" in thresholds:
        evaluators["relevance"] = RelevanceEvaluator(model_config=model_config)
        evaluator_config["relevance"] = {
            "column_mapping": {
                "query": "${data.user_query}",
                "response": "${data.agent_response}",
            }
        }
    
    if "coherence" in thresholds:
        evaluators["coherence"] = CoherenceEvaluator(model_config=model_config)
        evaluator_config["coherence"] = {
            "column_mapping": {
                "query": "${data.user_query}",
                "response": "${data.agent_response}",
            }
        }
    
    # 运行评估
    results = evaluate(
        data=dataset_path,
        evaluators=evaluators,
        evaluator_config=evaluator_config,
    )
    
    # 检查门控
    metrics_detail = {}
    gate_passed = True
    
    for metric, threshold in thresholds.items():
        score = results["metrics"].get(metric, 0.0)
        # 评估器返回 1-5 分制，需要归一化到 0-1
        normalized_score = score / 5.0 if score > 1.0 else score
        passed = normalized_score >= threshold
        
        metrics_detail[metric] = {
            "score": normalized_score,
            "threshold": threshold,
            "passed": passed,
        }
        
        if not passed:
            gate_passed = False
            print(f"❌ {metric}: {normalized_score:.3f} < {threshold} (门控未通过)")
        else:
            print(f"✅ {metric}: {normalized_score:.3f} >= {threshold}")
    
    # 保存结果
    gate_result = {
        "gate_passed": gate_passed,
        "metrics": {k: v["score"] for k, v in metrics_detail.items()},
        "metrics_detail": metrics_detail,
    }
    
    with open(output_path, "w") as f:
        json.dump(gate_result, f, indent=2)
    
    return gate_passed


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset", required=True)
    parser.add_argument("--thresholds", required=True)
    parser.add_argument("--output", default="eval_results.json")
    args = parser.parse_args()
    
    thresholds = json.loads(args.thresholds)
    passed = run_evaluation_gate(args.dataset, thresholds, args.output)
    sys.exit(0 if passed else 1)

十三、安全可观测性：内容安全事件追踪

13.1 内容安全与可观测性集成

python 复制代码

# safety_observability.py - 内容安全事件追踪
import os
import asyncio
from opentelemetry import trace
from azure.ai.contentsafety import ContentSafetyClient
from azure.ai.contentsafety.models import (
    AnalyzeTextOptions,
    TextCategory,
)
from azure.identity import DefaultAzureCredential

tracer = trace.get_tracer("content-safety")

class ContentSafetyObserver:
    """内容安全可观测性：追踪所有安全事件"""
    
    SEVERITY_LABELS = {0: "safe", 2: "low", 4: "medium", 6: "high"}
    
    def __init__(self):
        self.client = ContentSafetyClient(
            endpoint=os.environ["CONTENT_SAFETY_ENDPOINT"],
            credential=DefaultAzureCredential(),
        )
    
    async def analyze_and_trace(
        self,
        text: str,
        context: str = "user_input",
        user_id: str = "anonymous",
    ) -> dict:
        """分析内容安全并生成 OTel 事件"""
        
        with tracer.start_as_current_span("content_safety_check") as span:
            span.set_attribute("safety.context", context)
            span.set_attribute("safety.user_id", user_id)
            span.set_attribute("safety.text_length", len(text))
            
            try:
                response = self.client.analyze_text(
                    AnalyzeTextOptions(
                        text=text,
                        categories=[
                            TextCategory.HATE,
                            TextCategory.VIOLENCE,
                            TextCategory.SEXUAL,
                            TextCategory.SELF_HARM,
                        ],
                        output_type="FourSeverityLevels",
                    )
                )
                
                safety_results = {}
                highest_severity = 0
                
                for category_result in response.categories_analysis:
                    severity = category_result.severity or 0
                    category_name = category_result.category.value.lower()
                    safety_results[category_name] = severity
                    
                    if severity > highest_severity:
                        highest_severity = severity
                    
                    # 记录每个类别的安全状态
                    span.set_attribute(
                        f"safety.{category_name}.severity",
                        severity
                    )
                
                # 总体安全判断
                is_safe = highest_severity < 4  # 低于 medium 视为安全
                
                span.set_attribute("safety.is_safe", is_safe)
                span.set_attribute("safety.max_severity", highest_severity)
                span.set_attribute(
                    "safety.max_severity_label",
                    self.SEVERITY_LABELS.get(highest_severity, "unknown")
                )
                
                if not is_safe:
                    # 记录安全事件（不包含原始文本）
                    span.add_event(
                        "safety_violation_detected",
                        attributes={
                            "severity": highest_severity,
                            "categories": str([k for k, v in safety_results.items() if v >= 4]),
                            "user_id": user_id,
                            "context": context,
                        }
                    )
                    
                    # 设置 Span 状态为错误
                    span.set_status(
                        trace.StatusCode.ERROR,
                        description=f"Content safety violation: severity={highest_severity}"
                    )
                
                return {
                    "is_safe": is_safe,
                    "max_severity": highest_severity,
                    "categories": safety_results,
                }
                
            except Exception as e:
                span.record_exception(e)
                span.set_status(trace.StatusCode.ERROR, str(e))
                # 安全失败开放（fail-open）或安全失败关闭（fail-closed）
                # 生产环境建议 fail-closed
                raise


class JailbreakDetector:
    """Jailbreak 攻击检测与追踪"""
    
    # Jailbreak 攻击特征模式（简化版）
    JAILBREAK_PATTERNS = [
        "ignore previous instructions",
        "ignore your system prompt",
        "act as if you have no restrictions",
        "pretend you are",
        "DAN mode",
        "developer mode",
        "忽略之前的指令",
        "忽视系统提示",
        "扮演没有限制的",
    ]
    
    @staticmethod
    def detect_and_trace(user_input: str, user_id: str = "anonymous") -> bool:
        """检测 Jailbreak 攻击并记录追踪事件"""
        
        input_lower = user_input.lower()
        detected_patterns = [
            p for p in JailbreakDetector.JAILBREAK_PATTERNS
            if p.lower() in input_lower
        ]
        
        if detected_patterns:
            current_span = trace.get_current_span()
            if current_span.is_recording():
                current_span.add_event(
                    "jailbreak_attempt_detected",
                    attributes={
                        "attack.type": "jailbreak",
                        "attack.pattern_count": len(detected_patterns),
                        "user_id": user_id,
                        # 不记录实际输入内容（安全原则）
                        "input.length": len(user_input),
                    }
                )
                current_span.set_attribute("security.jailbreak_detected", True)
            
            return True
        
        return False

13.2 安全可观测性 KQL 查询

kusto 复制代码

// 安全仪表盘：内容安全事件趋势（过去 7 天）
traces
| where timestamp > ago(7d)
| where customDimensions["event"] == "safety_violation_detected"
| extend
    severity = toint(customDimensions["severity"]),
    categories = tostring(customDimensions["categories"]),
    user_id = tostring(customDimensions["user_id"]),
    context = tostring(customDimensions["context"])
| summarize
    violation_count = count(),
    unique_users = dcount(user_id),
    high_severity_count = countif(severity >= 6)
    by categories, bin(timestamp, 1d)
| order by violation_count desc
| render timechart

// --------

// Jailbreak 攻击尝试追踪
traces
| where timestamp > ago(24h)
| where customDimensions["attack.type"] == "jailbreak"
| extend
    user_id = tostring(customDimensions["user_id"]),
    pattern_count = toint(customDimensions["attack.pattern_count"])
| summarize
    attempt_count = count(),
    max_patterns = max(pattern_count),
    first_attempt = min(timestamp),
    last_attempt = max(timestamp)
    by user_id
| where attempt_count > 3  // 重复攻击用户
| order by attempt_count desc
| project
    user_id,
    attempt_count,
    max_patterns,
    first_attempt,
    last_attempt,
    suspected_attacker = iff(attempt_count > 10, true, false)

十四、企业落地最佳实践与检查清单

14.1 可观测性成熟度模型

复制代码

Level 0: 无可观测性
  └─ Agent 运行如同黑盒，无任何监控

Level 1: 基础监控
  ├─ 服务健康检查
  ├─ HTTP 状态码监控
  └─ 基础延迟/吞吐量指标

Level 2: 链路追踪
  ├─ OTel SDK 接入
  ├─ Span 树可视化
  └─ 工具调用追踪

Level 3: 质量评估
  ├─ 离线批量评估
  ├─ 持续在线评估
  └─ 评估门控 CI/CD

Level 4: 全栈可观测性（本文目标）
  ├─ Traces + Metrics + Logs
  ├─ SLO/Error Budget 管理
  ├─ 成本归因与优化
  ├─ 安全可观测性
  └─ Grafana 综合仪表盘

Level 5: 自适应优化
  ├─ 自动模型路由（基于质量/成本）
  ├─ 自动扩缩容（基于 TPM）
  └─ 自动化 A/B 评估

14.2 生产环境部署检查清单

markdown 复制代码

### AI Agent 可观测性生产部署检查清单（详细）

### 基础设施
- [ ] Application Insights 已创建并链接到 Foundry 项目
- [ ] Log Analytics Workspace 已配置（保留期 ≥ 90 天）
- [ ] Azure Managed Grafana 已创建并连接 Azure Monitor
- [ ] RBAC 权限已配置（Log Analytics Reader 角色）
- [ ] 托管身份（Managed Identity）已配置

### OpenTelemetry 接入
- [ ] OTel SDK 版本 >= 1.20.0
- [ ] Azure Monitor Exporter 已安装并配置
- [ ] 服务名称（service.name）已正确设置
- [ ] 部署环境（deployment.environment）已标记
- [ ] 内容记录（Content Recording）在生产环境已关闭

### Traces 追踪
- [ ] 所有 Agent 框架已启用 OTel 追踪
- [ ] 工具调用 Span 包含关键属性（tool.name, duration）
- [ ] 用户会话 ID 已作为 Span 属性传递
- [ ] 敏感信息（PII）未记录在 Span 属性中
- [ ] Trace 采样率已合理配置（建议生产环境 10-20%）

### Metrics 指标
- [ ] 请求延迟直方图已配置（用于 P95/P99 计算）
- [ ] Token 用量计数器已按部门/项目标记
- [ ] 工具调用成功率指标已收集
- [ ] 成本归因标签（department, project, user_id）已添加

### Logs 日志
- [ ] 结构化 JSON 日志格式已启用
- [ ] 日志与 Trace 关联（trace_id 注入日志）
- [ ] 异常日志包含堆栈信息
- [ ] 安全事件（内容过滤、Jailbreak）已记录

### Evaluation 评估
- [ ] 基础测试数据集已准备（至少 100 条用例）
- [ ] 离线评估已集成到 CI/CD 流水线
- [ ] 质量门控阈值已设定（Groundedness ≥ 0.8）
- [ ] 在线持续评估已配置（采样率 10%）
- [ ] 安全评估（Violence/Hate）已启用

### SLO/告警
- [ ] P95 延迟 SLO 已定义（建议 ≤ 8s）
- [ ] 成功率 SLO 已定义（建议 ≥ 99%）
- [ ] 评估质量 SLO 已定义（Groundedness ≥ 0.8）
- [ ] 告警规则已创建（延迟、成功率、Token 异常）
- [ ] On-call 轮值已配置（告警通知渠道）
- [ ] SLO 违反时的 Runbook 已编写

### 成本控制
- [ ] 每部门/项目的 Token 用量限额已设置
- [ ] 成本异常告警已配置（环比 > 50% 触发）
- [ ] 模型路由策略已实施（按任务复杂度选模型）
- [ ] Prompt 缓存已启用（系统 prompt 置于消息开头）

### 安全可观测性
- [ ] Azure AI Content Safety 已集成
- [ ] 内容安全事件（违规类别、严重程度）已追踪
- [ ] Jailbreak 检测已启用
- [ ] PII 数据不记录在 Traces/Logs 中
- [ ] Red Team 扫描已定期调度（至少每周一次）

### 仪表盘
- [ ] Foundry Agent Monitor 仪表盘已访问验证
- [ ] Grafana AI Agent 综合仪表盘已导入
- [ ] Azure Monitor Workbook 已创建
- [ ] 仪表盘访问权限已按角色配置
- [ ] 关键告警已在仪表盘中可视化

14.3 常见问题排查指南

问题现象	可能原因	排查步骤
Trace 数据不出现在 Foundry Portal	1. AI 未链接到 App Insights 2. 权限不足	检查 Foundry 项目设置 → Observability → Application Insights；检查 Contributor 角色
LangChain Span 不出现	未安装正确版本的 `langchain-azure-ai`	`pip install langchain-azure-ai>=0.1.0`；确认回调已注入
内容记录了 PII 数据	`enable_content_recording=True`	生产环境设置 `AZURE_TRACING_GEN_AI_CONTENT_RECORDING_ENABLED=false`
Token 用量统计不准确	指标标签缺失或不一致	检查 `customDimensions` 中的归因标签是否在每次请求中传递
评估分数长期偏低	1. 系统 prompt 质量 2. RAG 检索质量下降	分析 Groundedness 低分样本；检查向量索引更新情况
KQL 查询无数据	时区问题或表名错误	确认 `customMetrics`/`customEvents` 表名；检查时间范围
持续评估每小时运行超限	`max_runs_per_hour` 设置过低	提升 `max_runs_per_hour` 或降低流量采样率
Grafana 无法连接 Azure Monitor	Service Principal 权限不足	为 Grafana SP 分配 `Monitoring Reader` 角色

十五、总结与 Blog #6 预告

15.1 核心要点回顾

本文系统讲解了 Azure AI Foundry 全栈可观测性的五大支柱：

复制代码

┌─────────────────────────────────────────────────────────────┐
│              Blog #5 核心要点总结                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1️⃣  Traces（链路追踪）                                      │
│     • OpenTelemetry + gen_ai.* 语义约定                     │
│     • 支持 MAF、SK、LangChain、OpenAI Agents SDK            │
│     • 多 Agent Span 树：execute_task, A2A, state管理        │
│                                                             │
│  2️⃣  Metrics（指标）                                         │
│     • 自定义 Histogram/Counter/Gauge                        │
│     • Foundry Agent Monitor 内置摘要卡片                    │
│     • P50/P95/P99 延迟 + Token 用量 + 成功率               │
│                                                             │
│  3️⃣  Logs（日志）                                            │
│     • 结构化 JSON 日志 + Trace 关联                         │
│     • 6 大 KQL 查询模板                                     │
│     • 安全事件日志（内容过滤、Jailbreak）                    │
│                                                             │
│  4️⃣  Evaluation（评估）                                      │
│     • 离线批量评估（azure-ai-evaluation SDK）               │
│     • 在线持续评估（10% 采样，100次/小时）                  │
│     • CI/CD 评估门控（GitHub Actions 集成）                 │
│                                                             │
│  5️⃣  Cost Attribution（成本归因）                            │
│     • 按部门/项目/用户分摊成本                              │
│     • Token 缓存利用（节省最高 75%）                        │
│     • 智能模型路由（成本差异最高 20x）                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

15.2 可观测性工具矩阵

工具	适用场景	主要优势
Foundry Agent Monitor	开箱即用监控	无需配置，内置评估
Application Insights	深度 Trace/Metric/Log	强大 KQL，长期存储
Azure Monitor Workbooks	自定义报告/仪表盘	灵活模板，可分享
Azure Managed Grafana	SLO 管理，跨平台	Error Budget，丰富图表
Azure AI Evaluation SDK	批量/持续质量评估	标准化指标，CI 集成
Azure AI Content Safety	安全内容过滤	多类别、低延迟

15.3 实施路线图

复制代码

第 1 周：基础监控
  ├── 接入 Application Insights
  ├── 配置 OTel SDK
  └── 启用 Foundry Agent Monitor

第 2 周：深度追踪
  ├── 实现结构化日志
  ├── 添加成本归因标签
  └── 配置关键告警规则

第 3 周：质量评估
  ├── 准备评估数据集（100+ 用例）
  ├── 集成 CI/CD 评估门控
  └── 启用在线持续评估

第 4 周：成本与 SLO 优化
  ├── 实施模型路由策略
  ├── 配置 SLO + Error Budget
  └── 搭建 Grafana 综合仪表盘

15.4 Blog #6 预告

下一篇 ：Blog #6 --- Azure AI Foundry 安全架构深度：RAI / RBAC / 网络隔离 / 数据治理

将覆盖：

Responsible AI（RAI）：Azure AI Content Safety、Groundedness 检测、Prompt Shield
RBAC 精细化权限控制：Foundry 项目角色、资源级访问控制、条件访问
网络隔离：Private Endpoint、VNet 集成、Azure AI Foundry 网络安全
数据治理：数据驻留、加密（CMK）、审计日志合规
企业零信任架构：Managed Identity、Key Vault 集成、Secret 轮换

参考资料

官方文档

资源	链接
Azure AI Foundry 可观测性概念	learn.microsoft.com
配置 Agent 框架追踪	learn.microsoft.com
Agent 监控仪表盘指南	learn.microsoft.com
Azure AI Evaluation SDK 本地评估	learn.microsoft.com
Agent Evaluate SDK 指南	learn.microsoft.com
Application Insights 监控 Agent	learn.microsoft.com
Azure Monitor OpenTelemetry 集成	learn.microsoft.com
Azure Monitor Workbook 模板	learn.microsoft.com

OpenTelemetry 规范

资源	链接
GenAI Span 语义约定	opentelemetry.io
GenAI Agent Span 规范	opentelemetry.io
OTel 语义约定 GitHub	github.com

博客与社区

资源	链接
Microsoft Foundry 推进 OTel	techcommunity.microsoft.com
AI 可观测性实战（Dave R.）	itnext.io
Arize: 大规模评估 AI Agent	arize.com
Azure Application Insights KQL 指南	dev.to
Grafana SLO 管理	grafana.com
Azure Monitor Grafana 仪表盘	blog.aks.azure.com
Foundry 9月 2025 更新	devblogs.microsoft.com
Foundry Dec 2025/Jan 2026 更新	devblogs.microsoft.com

GitHub 示例代码

资源	链接
连续评估规则示例	github.com
调度评估示例	github.com
Azure AI Foundry 教程	github.com
AI Agent Evals GitHub Action	github.com
Azure Monitor Workbook 模板库	github.com

📌 本系列博客持续更新，欢迎关注获取最新内容。认可内容的话，请点个赞