【LangChain 】 自定义解析器实战指南:从原理到 10 个业务场景落地

LangChain 自定义解析器实战指南:从原理到 10 个业务场景落地

一、为什么需要自定义解析器?

在 LLM 应用落地的过程中,解析器(Output Parser) 往往是最容易被忽视,却最容易出问题的环节。LangChain 内置了 PydanticOutputParserCommaSeparatedListOutputParserStructuredOutputParser 等通用组件,但在真实业务中,它们往往力不从心。以下是四大核心痛点:

  1. 包裹问题(Wrapper Problem)

LLM 的输出经常被"包裹"在 Markdown 代码块、解释性文字或格式标签中:

以下是您要求的 JSON 数据:

json 复制代码
{"name": "张三", "age": 28}

PydanticOutputParser 期望的是纯 JSON,遇到这种包裹会直接抛出 OutputParserException。生产环境中,即使提示词明确要求"只输出 JSON",模型仍有一定概率添加前缀或后缀。

2. 格式漂移(Format Drift)

同一业务场景下,不同模型(GPT-4、Claude、Llama、Qwen)对相同提示词的理解存在差异。例如日期格式,有的输出 2024-01-15,有的输出 2024年1月15日,有的输出 ISO 8601 标准格式。没有自定义解析器做统一归一化,下游系统将面临持续的数据清洗成本。

3. 业务语义(Business Semantics)

通用解析器只能做语法层面的校验,无法理解业务规则。例如金融场景中的"风险等级",模型可能输出"中等风险",但业务系统需要标准化的枚举值 MEDIUM_RISK。医疗场景中,模型输出的诊断描述需要映射到 ICD-10 编码体系。这种语义层转换必须发生在解析阶段。

4. 多源异构(Multi-Source Heterogeneity)

企业级应用往往同时对接多个 LLM 供应商和本地模型。每个模型的输出习惯不同:OpenAI 喜欢加思考过程,Claude 偏好 XML 标签,本地模型可能 JSON 结构不严谨。自定义解析器可以作为统一适配层,屏蔽底层差异,为上层业务提供一致的结构化数据。


二、核心原理

2.1 BaseOutputParser[T] 设计哲学

LangChain 的解析器体系建立在 BaseOutputParser[T] 抽象基类之上,其设计体现了**"解析即责任"**的思想:

python 复制代码
from abc import abstractmethod
from typing import Generic, TypeVar

T = TypeVar("T")

class BaseOutputParser(Generic[T]):
    @abstractmethod
    def parse(self, text: str) -> T:
        """将原始文本解析为结构化类型 T"""
        raise NotImplementedError
    
    @property
    @abstractmethod
    def _type(self) -> str:
        """用于序列化标识"""
        raise NotImplementedError
    
    # 可选:提供格式化指令,注入到 prompt 中
    def get_format_instructions(self) -> str:
        return ""

关键设计决策解析:

设计点 意图 实战启示
Generic[T] 类型安全,解析结果可静态检查 定义解析器时显式指定返回类型,如 BaseOutputParser[MedicalDiagnosis]
parse() 纯函数 无副作用,便于测试和缓存 所有清洗、转换、验证逻辑应内聚在 parse()
get_format_instructions() 将格式要求反向注入 Prompt 自定义解析器应同时维护"解析规则"和"生成规则",确保一致性
_type 属性 支持序列化(如 LangServe) 必须实现,否则无法通过 API 暴露

2.2 关键组件解析

OutputParserException

这是解析失败的统一异常类型,包含 llm_output 字段保留原始输出,便于后续修复:

python 复制代码
from langchain.schema import OutputParserException

class CleanJsonOutputParser(BaseOutputParser[dict]):
    def parse(self, text: str) -> dict:
        try:
            # 清洗 Markdown 包裹
            cleaned = self._extract_json(text)
            return json.loads(cleaned)
        except (json.JSONDecodeError, ValueError) as e:
            raise OutputParserException(
                f"Failed to parse JSON from: {text[:100]}...",
                llm_output=text  # 关键:保留原始输出用于修复
            ) from e

Runnable 集成(LangChain Expression Language)

在 LCEL 中,解析器天然是一个 Runnable,可以直接用 | 管道操作符串联:

python 复制代码
from langchain_core.runnables import RunnableLambda

chain = prompt | model | CleanJsonOutputParser()
# 等价于:chain = prompt | model | RunnableLambda(lambda x: parser.parse(x.content))

这意味着自定义解析器可以无缝接入复杂的 RAG、Agent 链路中。


三、10 个业务场景实战

场景 1:电商 - Markdown JSON 清洗 ⭐⭐

痛点:模型输出被 Markdown 代码块包裹,标准 JSON 解析器直接报错。

python 复制代码
import re
import json
from typing import Dict, Any
from langchain.schema import BaseOutputParser, OutputParserException

class CleanJsonOutputParser(BaseOutputParser[Dict[str, Any]]):
    """清洗 Markdown 包裹的 JSON,支持多层级嵌套"""
    
    def _extract_json(self, text: str) -> str:
        # 优先匹配 ```json ... ```或 ```... ```
        patterns = [
            r"```(?:json)?\s*([\s\S]*?)\s*```",  # Markdown 代码块
            r"(\{[\s\S]*\})",                      # 裸 JSON(最外层花括号)
            r"(\[[\s\S]*\])"                       # 裸 JSON 数组
        ]
        
        for pattern in patterns:
            match = re.search(pattern, text.strip())
            if match:
                return match.group(1).strip()
        
        # 如果都不匹配,尝试清理常见前缀后缀
        cleaned = re.sub(r"^(?:以下是|Here is|JSON:)\s*", "", text, flags=re.I)
        cleaned = re.sub(r"\s*(?:请查收|谢谢|以上)$", "", cleaned, flags=re.I)
        return cleaned.strip()
    
    def parse(self, text: str) -> Dict[str, Any]:
        cleaned = self._extract_json(text)
        try:
            result = json.loads(cleaned)
            if not isinstance(result, (dict, list)):
                raise ValueError(f"Expected dict or list, got {type(result)}")
            return result
        except Exception as e:
            raise OutputParserException(
                f"JSON parse failed: {str(e)}",
                llm_output=text
            ) from e
    
    @property
    def _type(self) -> str:
        return "clean_json_parser"
    
    def get_format_instructions(self) -> str:
        return '请直接输出 JSON 对象,不要添加 Markdown 代码块标记或其他解释性文字。'

使用示例:

python 复制代码
parser = CleanJsonOutputParser()

# 测试各种包裹情况
test_cases = [
    '```json\n{"product": "iPhone", "price": 5999}\n```',
    '```\n{"product": "iPhone", "price": 5999}\n```',
    '以下是结果:{"product": "iPhone", "price": 5999} 请查收',
    '{"product": "iPhone", "price": 5999}'  # 裸 JSON
]

for case in test_cases:
    print(parser.parse(case))  # 全部成功解析

场景 2:金融 - 风险等级映射 ⭐⭐

痛点:模型输出自然语言描述,需要映射为标准化枚举值供风控系统消费。

python 复制代码
from enum import Enum
from dataclasses import dataclass
from typing import Optional

class RiskLevel(Enum):
    LOW = "LOW"
    MEDIUM = "MEDIUM"
    HIGH = "HIGH"
    CRITICAL = "CRITICAL"

@dataclass
class RiskAssessment:
    level: RiskLevel
    score: float  # 0-100
    factors: list[str]
    suggestion: Optional[str] = None

class RiskLevelOutputParser(BaseOutputParser[RiskAssessment]):
    """将 LLM 的风险描述映射为标准风险等级"""
    
    # 语义映射表:支持多种表述归一化
    LEVEL_MAPPINGS = {
        RiskLevel.LOW: ["低", "low", "轻微", "minimal", "绿色"],
        RiskLevel.MEDIUM: ["中", "medium", "一般", "moderate", "黄色", "中等"],
        RiskLevel.HIGH: ["高", "high", "严重", "severe", "红色", "重大"],
        RiskLevel.CRITICAL: ["极高", "critical", "紧急", "urgent", "致命", "黑色"]
    }
    
    def _normalize_level(self, raw_text: str) -> RiskLevel:
        raw_lower = raw_text.lower().strip()
        for level, keywords in self.LEVEL_MAPPINGS.items():
            if any(kw in raw_lower for kw in keywords):
                return level
        # 默认兜底:根据分数推断
        return RiskLevel.MEDIUM
    
    def _extract_score(self, text: str) -> float:
        # 提取 0-100 范围的数字
        import re
        scores = re.findall(r"(\d+(?:\.\d+)?)\s*(?:分|%|percent)", text.lower())
        if scores:
            return min(100.0, max(0.0, float(scores[0])))
        # 如果没有明确分数,根据关键词估算
        if any(k in text for k in ["极高", "critical", "致命"]):
            return 95.0
        elif any(k in text for k in ["高", "high", "严重"]):
            return 75.0
        elif any(k in text for k in ["中", "medium", "一般"]):
            return 50.0
        return 25.0
    
    def parse(self, text: str) -> RiskAssessment:
        try:
            # 优先尝试解析 JSON 格式
            clean_json = CleanJsonOutputParser()._extract_json(text)
            data = json.loads(clean_json)
            
            level = self._normalize_level(data.get("level", data.get("risk_level", "")))
            score = data.get("score", self._extract_score(text))
            factors = data.get("factors", data.get("risk_factors", []))
            suggestion = data.get("suggestion", data.get("advice", None))
            
            return RiskAssessment(
                level=level,
                score=float(score),
                factors=factors if isinstance(factors, list) else [factors],
                suggestion=suggestion
            )
        except Exception:
            # 回退到纯文本解析
            level = self._normalize_level(text)
            score = self._extract_score(text)
            # 提取风险因素(简单按行分割)
            factors = [line.strip("- ") for line in text.split("\n") 
                      if line.strip().startswith("-")]
            return RiskAssessment(
                level=level, score=score, factors=factors or ["未明确列出"]
            )
    
    @property
    def _type(self) -> str:
        return "risk_level_parser"
    
    def get_format_instructions(self) -> str:
        return '''请输出 JSON 格式:
{
    "level": "风险等级(低/中/高/极高)",
    "score": 0-100,
    "factors": ["风险因素1", "风险因素2"],
    "suggestion": "风控建议(可选)"
}'''

场景 3:医疗 - 病历结构化诊断 ⭐⭐⭐

痛点:病历文本非结构化,需要提取为符合 HL7 FHIR 标准的诊断对象,且需处理医学同义词。

python 复制代码
from dataclasses import dataclass, field
from typing import List, Optional

@dataclass
class Diagnosis:
    icd10_code: Optional[str]  # 国际疾病分类编码
    name: str
    confidence: float  # 模型置信度 0-1
    symptoms: List[str] = field(default_factory=list)
    differential: List[str] = field(default_factory=list)  # 鉴别诊断

@dataclass
class MedicalRecord:
    chief_complaint: str  # 主诉
    diagnoses: List[Diagnosis]
    recommendations: List[str] = field(default_factory=list)

class MedicalDiagnosisParser(BaseOutputParser[MedicalRecord]):
    """医疗病历结构化解析器,支持 ICD-10 编码映射"""
    
    # 常见症状到 ICD-10 的简化映射(实际应对接标准库)
    ICD10_MAPPING = {
        "头痛": "R51", "headache": "R51",
        "发热": "R50.9", "fever": "R50.9",
        "高血压": "I10", "hypertension": "I10",
        "糖尿病": "E11.9", "diabetes": "E11.9",
        "胸痛": "R07.4", "chest pain": "R07.4",
    }
    
    def _map_icd10(self, symptom: str) -> Optional[str]:
        for key, code in self.ICD10_MAPPING.items():
            if key in symptom.lower():
                return code
        return None
    
    def _normalize_confidence(self, val) -> float:
        if isinstance(val, (int, float)):
            return min(1.0, max(0.0, float(val)))
        if isinstance(val, str):
            # 处理 "高"/"中"/"低"
            mapping = {"高": 0.9, "中高": 0.75, "中": 0.5, "低": 0.2, "不确定": 0.3}
            return mapping.get(val.strip(), 0.5)
        return 0.5
    
    def parse(self, text: str) -> MedicalRecord:
        try:
            data = CleanJsonOutputParser().parse(text)
            
            diagnoses = []
            for d in data.get("diagnoses", []):
                name = d.get("name", d.get("diagnosis", "未知诊断"))
                symptoms = d.get("symptoms", d.get("symptom", []))
                if isinstance(symptoms, str):
                    symptoms = [symptoms]
                
                # 尝试从症状推断 ICD-10
                icd10 = d.get("icd10") or self._map_icd10(name) or self._map_icd10(" ".join(symptoms))
                
                diagnoses.append(Diagnosis(
                    icd10_code=icd10,
                    name=name,
                    confidence=self._normalize_confidence(d.get("confidence", 0.5)),
                    symptoms=symptoms,
                    differential=d.get("differential", d.get("differential_diagnosis", []))
                ))
            
            return MedicalRecord(
                chief_complaint=data.get("chief_complaint", data.get("主诉", "")),
                diagnoses=diagnoses,
                recommendations=data.get("recommendations", data.get("建议", []))
            )
        except Exception as e:
            raise OutputParserException(
                f"Medical diagnosis parse failed: {str(e)}",
                llm_output=text
            ) from e
    
    @property
    def _type(self) -> str:
        return "medical_diagnosis_parser"
    
    def get_format_instructions(self) -> str:
        return '''请以 JSON 输出病历结构化结果,遵循以下格式:
{
    "chief_complaint": "患者主诉",
    "diagnoses": [
        {
            "name": "诊断名称",
            "confidence": 0.85,
            "symptoms": ["症状1", "症状2"],
            "differential": ["鉴别诊断1"]
        }
    ],
    "recommendations": ["检查建议1", "用药建议1"]
}
注意:confidence 使用 0-1 小数表示模型置信度。'''

场景 4:内容审核 - 多维度评分 ⭐⭐⭐

痛点:内容安全需要多维度评分(涉黄、涉政、暴力、广告等),且各维度阈值不同。

python 复制代码
from dataclasses import dataclass, field
from typing import Dict
from enum import Enum

class ViolationType(Enum):
    PORNOGRAPHY = "pornography"
    POLITICAL = "political"
    VIOLENCE = "violence"
    ADVERTISING = "advertising"
    DISCRIMINATION = "discrimination"
    FRAUD = "fraud"

@dataclass
class DimensionScore:
    score: float  # 0-1,越接近 1 越违规
    evidence: str  # 判定依据文本片段
    confidence: float

@dataclass
class AuditResult:
    is_violation: bool
    overall_score: float
    dimensions: Dict[ViolationType, DimensionScore] = field(default_factory=dict)
    action: str = "pass"  # pass, review, block

class ContentAuditParser(BaseOutputParser[AuditResult]):
    """内容审核多维度评分解析器"""
    
    # 各维度阈值配置
    THRESHOLDS = {
        ViolationType.PORNOGRAPHY: 0.7,
        ViolationType.POLITICAL: 0.6,
        ViolationType.VIOLENCE: 0.8,
        ViolationType.ADVERTISING: 0.75,
        ViolationType.DISCIMINATION: 0.65,
        ViolationType.FRAUD: 0.7,
    }
    
    def _determine_action(self, dimensions: Dict[ViolationType, DimensionScore]) -> str:
        max_score = max((d.score for d in dimensions.values()), default=0)
        if max_score >= 0.9:
            return "block"
        elif max_score >= 0.5:
            return "review"
        return "pass"
    
    def parse(self, text: str) -> AuditResult:
        try:
            data = CleanJsonOutputParser().parse(text)
            
            dimensions = {}
            for raw_type, info in data.get("dimensions", {}).items():
                try:
                    v_type = ViolationType(raw_type.lower())
                except ValueError:
                    continue
                
                dimensions[v_type] = DimensionScore(
                    score=min(1.0, max(0.0, float(info.get("score", 0)))),
                    evidence=info.get("evidence", info.get("reason", "")),
                    confidence=min(1.0, max(0.0, float(info.get("confidence", 0.5))))
                )
            
            # 计算综合得分(加权平均,政治和欺诈权重更高)
            weights = {ViolationType.POLITICAL: 1.5, ViolationType.FRAUD: 1.3}
            if dimensions:
                weighted_sum = sum(
                    d.score * weights.get(t, 1.0) 
                    for t, d in dimensions.items()
                )
                weight_total = sum(weights.get(t, 1.0) for t in dimensions.keys())
                overall = weighted_sum / weight_total
            else:
                overall = 0.0
            
            is_violation = any(
                d.score >= self.THRESHOLDS.get(t, 0.7) 
                for t, d in dimensions.items()
            )
            
            return AuditResult(
                is_violation=is_violation,
                overall_score=overall,
                dimensions=dimensions,
                action=self._determine_action(dimensions)
            )
        except Exception as e:
            raise OutputParserException(
                f"Content audit parse failed: {str(e)}",
                llm_output=text
            ) from e
    
    @property
    def _type(self) -> str:
        return "content_audit_parser"
    
    def get_format_instructions(self) -> str:
        return '''请对内容进行多维度安全审核,输出 JSON:
{
    "dimensions": {
        "pornography": {"score": 0.1, "evidence": "无明确证据", "confidence": 0.9},
        "political": {"score": 0.0, "evidence": "", "confidence": 0.95},
        "violence": {"score": 0.0, "evidence": "", "confidence": 0.95},
        "advertising": {"score": 0.2, "evidence": "包含联系方式", "confidence": 0.7},
        "discrimination": {"score": 0.0, "evidence": "", "confidence": 0.95},
        "fraud": {"score": 0.0, "evidence": "", "confidence": 0.95}
    }
}
score 范围 0-1(1 表示高度违规),evidence 提供判定依据文本片段。'''

场景 5:客服 - 意图识别+槽位填充 ⭐⭐⭐

痛点:客服机器人需要识别用户意图并提取关键参数(槽位),支持多轮对话状态追踪。

python 复制代码
from dataclasses import dataclass, field
from typing import List, Dict, Optional
from enum import Enum

class IntentType(Enum):
    ORDER_INQUIRY = "order_inquiry"      # 订单查询
    REFUND_REQUEST = "refund_request"    # 退款申请
    PRODUCT_CONSULT = "product_consult"  # 产品咨询
    COMPLAINT = "complaint"              # 投诉
    TECH_SUPPORT = "tech_support"        # 技术支持
    CHITCHAT = "chitchat"                # 闲聊

@dataclass
class Slot:
    name: str
    value: Optional[str]
    is_required: bool
    is_filled: bool = False

@dataclass
class IntentResult:
    intent: IntentType
    confidence: float
    slots: Dict[str, Slot] = field(default_factory=dict)
    missing_slots: List[str] = field(default_factory=list)  # 待澄清槽位
    reply_template: Optional[str] = None  # 推荐回复模板

class CustomerServiceIntentParser(BaseOutputParser[IntentResult]):
    """客服意图识别与槽位填充解析器"""
    
    # 意图同义词映射
    INTENT_SYNONYMS = {
        IntentType.ORDER_INQUIRY: ["订单", "查单", "物流", "到哪了", "发货"],
        IntentType.REFUND_REQUEST: ["退款", "退货", "退钱", "不要了", "取消"],
        IntentType.PRODUCT_CONSULT: ["多少钱", "价格", "尺寸", "颜色", "有货吗"],
        IntentType.COMPLAINT: ["投诉", "差评", "举报", "不满", "生气"],
        IntentType.TECH_SUPPORT: ["怎么用", "不会用", "故障", "坏了", "连不上"],
        IntentType.CHITCHAT: ["你好", "谢谢", "再见", "在吗", "人工"],
    }
    
    # 各意图所需槽位定义
    REQUIRED_SLOTS = {
        IntentType.ORDER_INQUIRY: ["order_id", "phone"],
        IntentType.REFUND_REQUEST: ["order_id", "reason"],
        IntentType.PRODUCT_CONSULT: ["product_name"],
        IntentType.COMPLAINT: ["issue_type", "contact"],
        IntentType.TECH_SUPPORT: ["product_model", "issue_desc"],
        IntentType.CHITCHAT: [],
    }
    
    def _normalize_intent(self, raw: str) -> IntentType:
        raw_lower = raw.lower().strip()
        # 精确匹配
        try:
            return IntentType(raw_lower)
        except ValueError:
            pass
        # 模糊匹配
        for intent, keywords in self.INTENT_SYNONYMS.items():
            if any(kw in raw_lower for kw in keywords):
                return intent
        return IntentType.CHITCHAT  # 兜底闲聊
    
    def _build_slots(self, intent: IntentType, slot_data: dict) -> Dict[str, Slot]:
        required = self.REQUIRED_SLOTS.get(intent, [])
        slots = {}
        for name in required:
            value = slot_data.get(name)
            slots[name] = Slot(
                name=name,
                value=value,
                is_required=True,
                is_filled=value is not None and str(value).strip() != ""
            )
        # 补充非必需槽位
        for name, value in slot_data.items():
            if name not in slots:
                slots[name] = Slot(
                    name=name, value=value, is_required=False, is_filled=True
                )
        return slots
    
    def parse(self, text: str) -> IntentResult:
        try:
            data = CleanJsonOutputParser().parse(text)
            
            intent = self._normalize_intent(data.get("intent", ""))
            confidence = min(1.0, max(0.0, float(data.get("confidence", 0.5))))
            
            slots = self._build_slots(intent, data.get("slots", {}))
            missing = [name for name, slot in slots.items() if slot.is_required and not slot.is_filled]
            
            # 根据意图和槽位填充状态推荐回复模板
            template = self._select_reply_template(intent, missing)
            
            return IntentResult(
                intent=intent,
                confidence=confidence,
                slots=slots,
                missing_slots=missing,
                reply_template=template
            )
        except Exception as e:
            raise OutputParserException(
                f"Intent parse failed: {str(e)}",
                llm_output=text
            ) from e
    
    def _select_reply_template(self, intent: IntentType, missing: List[str]) -> Optional[str]:
        templates = {
            IntentType.ORDER_INQUIRY: "为您查询订单状态,请提供{slots}",
            IntentType.REFUND_REQUEST: "办理退款需要{slots},请补充信息",
            IntentType.PRODUCT_CONSULT: "关于{product_name}的咨询收到",
        }
        if missing:
            return templates.get(intent, "请补充以下信息:" + ", ".join(missing)).replace(
                "{slots}", ", ".join(missing)
            )
        return None
    
    @property
    def _type(self) -> str:
        return "customer_service_intent_parser"
    
    def get_format_instructions(self) -> str:
        return '''请识别用户意图并提取槽位,输出 JSON:
{
    "intent": "意图类型(order_inquiry/refund_request/product_consult/complaint/tech_support/chitchat)",
    "confidence": 0.95,
    "slots": {
        "order_id": "12345",
        "phone": "13800138000",
        "product_name": "iPhone 15"
    }
}
只输出 JSON,不要解释。'''

场景 6:招聘 - 简历信息提取 ⭐⭐⭐

痛点:简历格式多样(PDF、图片、文本),需要统一提取为结构化数据供 ATS 系统消费。

python 复制代码
from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime
import re

@dataclass
class WorkExperience:
    company: str
    title: str
    start_date: Optional[str]
    end_date: Optional[str]
    description: str
    duration_months: Optional[int] = None

@dataclass
class Resume:
    name: str
    phone: Optional[str]
    email: Optional[str]
    education: List[dict] = field(default_factory=list)
    experiences: List[WorkExperience] = field(default_factory=list)
    skills: List[str] = field(default_factory=list)
    expected_salary: Optional[str] = None

class ResumeParser(BaseOutputParser[Resume]):
    """简历信息提取解析器"""
    
    def _extract_phone(self, text: str) -> Optional[str]:
        # 中国手机号
        match = re.search(r"1[3-9]\d{9}", text)
        return match.group(0) if match else None
    
    def _extract_email(self, text: str) -> Optional[str]:
        match = re.search(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", text)
        return match.group(0) if match else None
    
    def _parse_date(self, date_str: str) -> Optional[str]:
        if not date_str or date_str in ["至今", "present", "now"]:
            return None
        # 支持 2020.03、2020-03、2020/03 等格式
        patterns = [
            r"(\d{4})[./-](\d{1,2})",
            r"(\d{4})",
        ]
        for p in patterns:
            m = re.search(p, date_str)
            if m:
                year = m.group(1)
                month = m.group(2) if len(m.groups()) > 1 else "01"
                return f"{year}-{month.zfill(2)}"
        return date_str
    
    def _calc_duration(self, start: Optional[str], end: Optional[str]) -> Optional[int]:
        if not start:
            return None
        try:
            start_dt = datetime.strptime(start, "%Y-%m")
            end_dt = datetime.strptime(end, "%Y-%m") if end else datetime.now()
            return (end_dt.year - start_dt.year) * 12 + (end_dt.month - start_dt.month)
        except:
            return None
    
    def parse(self, text: str) -> Resume:
        try:
            data = CleanJsonOutputParser().parse(text)
            
            # 基本信息提取与校验
            raw_text = text  # 保留原始文本用于正则提取
            name = data.get("name", "未知")
            phone = data.get("phone") or self._extract_phone(raw_text)
            email = data.get("email") or self._extract_email(raw_text)
            
            # 工作经历结构化
            experiences = []
            for exp in data.get("experiences", data.get("work_experience", [])):
                start = self._parse_date(exp.get("start_date", exp.get("start", "")))
                end = self._parse_date(exp.get("end_date", exp.get("end", "")))
                
                experiences.append(WorkExperience(
                    company=exp.get("company", ""),
                    title=exp.get("title", exp.get("position", "")),
                    start_date=start,
                    end_date=end,
                    description=exp.get("description", ""),
                    duration_months=self._calc_duration(start, end)
                ))
            
            # 技能去重与清洗
            skills = list(set([
                s.strip().lower() 
                for s in data.get("skills", [])
                if s and len(s.strip()) > 1
            ]))
            
            return Resume(
                name=name,
                phone=phone,
                email=email,
                education=data.get("education", []),
                experiences=experiences,
                skills=skills,
                expected_salary=data.get("expected_salary", data.get("salary_expectation"))
            )
        except Exception as e:
            raise OutputParserException(
                f"Resume parse failed: {str(e)}",
                llm_output=text
            ) from e
    
    @property
    def _type(self) -> str:
        return "resume_parser"
    
    def get_format_instructions(self) -> str:
        return '''请从简历中提取结构化信息,输出 JSON:
{
    "name": "姓名",
    "phone": "手机号",
    "email": "邮箱",
    "education": [
        {"school": "学校", "degree": "本科", "major": "计算机", "graduation": "2020"}
    ],
    "experiences": [
        {
            "company": "公司名",
            "title": "职位",
            "start_date": "2020-03",
            "end_date": "2023-05",
            "description": "工作描述"
        }
    ],
    "skills": ["Python", "机器学习"],
    "expected_salary": "25k-35k"
}
日期格式统一为 YYYY-MM,end_date 为"至今"表示在职。'''

场景 7:法律 - 合同风险标记 ⭐⭐⭐

痛点:合同审查需要标记风险条款,关联法律依据,并评估风险等级。

python 复制代码
from dataclasses import dataclass, field
from typing import List, Optional
from enum import Enum

class RiskSeverity(Enum):
    INFO = "info"           # 提示性
    WARNING = "warning"     # 警告
    CRITICAL = "critical"   # 严重

@dataclass
class RiskClause:
    clause_number: str      # 条款编号,如"第3.2条"
    original_text: str      # 原文引用
    risk_type: str          # 风险类型,如"违约责任不对等"
    severity: RiskSeverity
    legal_basis: str        # 法律依据
    suggestion: str         # 修改建议
    confidence: float

@dataclass
class ContractReview:
    contract_type: Optional[str]
    parties: List[str]
    risks: List[RiskClause] = field(default_factory=list)
    overall_risk_score: float = 0.0  # 0-100

class ContractRiskParser(BaseOutputParser[ContractReview]):
    """合同风险标记解析器"""
    
    SEVERITY_MAP = {
        "info": RiskSeverity.INFO, "提示": RiskSeverity.INFO, "低": RiskSeverity.INFO,
        "warning": RiskSeverity.WARNING, "警告": RiskSeverity.WARNING, "中": RiskSeverity.WARNING,
        "critical": RiskSeverity.CRITICAL, "严重": RiskSeverity.CRITICAL, "高": RiskSeverity.CRITICAL,
    }
    
    def _normalize_severity(self, raw: str) -> RiskSeverity:
        return self.SEVERITY_MAP.get(raw.lower().strip(), RiskSeverity.WARNING)
    
    def _calculate_overall_score(self, risks: List[RiskClause]) -> float:
        if not risks:
            return 0.0
        weights = {RiskSeverity.INFO: 1, RiskSeverity.WARNING: 5, RiskSeverity.CRITICAL: 20}
        total = sum(weights[r.severity] * (1 + r.confidence) for r in risks)
        return min(100.0, total)
    
    def parse(self, text: str) -> ContractReview:
        try:
            data = CleanJsonOutputParser().parse(text)
            
            risks = []
            for r in data.get("risks", data.get("clauses", [])):
                severity = self._normalize_severity(r.get("severity", r.get("level", "warning")))
                
                risks.append(RiskClause(
                    clause_number=r.get("clause_number", r.get("article", "未知条款")),
                    original_text=r.get("original_text", r.get("text", "")),
                    risk_type=r.get("risk_type", r.get("type", "未分类风险")),
                    severity=severity,
                    legal_basis=r.get("legal_basis", r.get("law", "暂无")),
                    suggestion=r.get("suggestion", r.get("advice", "建议咨询专业律师")),
                    confidence=min(1.0, max(0.0, float(r.get("confidence", 0.5))))
                ))
            
            # 按严重程度排序
            severity_order = {RiskSeverity.CRITICAL: 0, RiskSeverity.WARNING: 1, RiskSeverity.INFO: 2}
            risks.sort(key=lambda x: severity_order[x.severity])
            
            return ContractReview(
                contract_type=data.get("contract_type", data.get("type")),
                parties=data.get("parties", data.get("party", [])),
                risks=risks,
                overall_risk_score=self._calculate_overall_score(risks)
            )
        except Exception as e:
            raise OutputParserException(
                f"Contract risk parse failed: {str(e)}",
                llm_output=text
            ) from e
    
    @property
    def _type(self) -> str:
        return "contract_risk_parser"
    
    def get_format_instructions(self) -> str:
        return '''请审查合同并标记风险条款,输出 JSON:
{
    "contract_type": "劳动合同",
    "parties": ["甲方公司", "乙方员工"],
    "risks": [
        {
            "clause_number": "第3.2条",
            "original_text": "原文引用(50字内)",
            "risk_type": "违约责任不对等",
            "severity": "critical/warning/info",
            "legal_basis": "《劳动合同法》第XX条",
            "suggestion": "建议修改为...",
            "confidence": 0.9
        }
    ]
}
severity 分级:critical(严重,必须修改)、warning(警告,建议修改)、info(提示,知晓即可)。'''

场景 8:电商 - 评价情感标签提取 ⭐⭐⭐

痛点:商品评价需要提取细粒度情感标签(如"物流慢但质量好"),而非简单正负情感。

python 复制代码
from dataclasses import dataclass, field
from typing import List, Dict
from enum import Enum

class AspectType(Enum):
    PRODUCT_QUALITY = "product_quality"    # 商品质量
    LOGISTICS = "logistics"                # 物流速度
    SERVICE = "service"                    # 客服服务
    PRICE = "price"                        # 价格
    PACKAGING = "packaging"                # 包装
    DESCRIPTION = "description"            # 描述相符

class SentimentPolarity(Enum):
    POSITIVE = 1
    NEUTRAL = 0
    NEGATIVE = -1

@dataclass
class AspectSentiment:
    aspect: AspectType
    polarity: SentimentPolarity
    keywords: List[str]  # 触发词
    quote: str           # 原文引用

@dataclass
class ReviewAnalysis:
    overall_sentiment: SentimentPolarity
    overall_score: float  # -1 到 1
    aspects: List[AspectSentiment] = field(default_factory=list)
    summary: str = ""     # 一句话总结

class ReviewAnalysisParser(BaseOutputParser[ReviewAnalysis]):
    """商品评价细粒度情感分析解析器"""
    
    ASPECT_KEYWORDS = {
        AspectType.PRODUCT_QUALITY: ["质量", "做工", "材质", "正品", "假货", "瑕疵"],
        AspectType.LOGISTICS: ["物流", "快递", "发货", "配送", "慢", "快", "到货"],
        AspectType.SERVICE: ["客服", "售后", "态度", "回复", "处理"],
        AspectType.PRICE: ["价格", "贵", "便宜", "划算", "性价比", "值"],
        AspectType.PACKAGING: ["包装", "破损", "完好", "精美", "简陋"],
        AspectType.DESCRIPTION: ["描述", "相符", "实物", "图片", "色差", "尺寸"],
    }
    
    def _normalize_aspect(self, raw: str) -> AspectType:
        raw_lower = raw.lower().strip()
        try:
            return AspectType(raw_lower)
        except ValueError:
            for aspect, keywords in self.ASPECT_KEYWORDS.items():
                if any(kw in raw_lower for kw in keywords):
                    return aspect
            return AspectType.PRODUCT_QUALITY  # 兜底
    
    def _normalize_polarity(self, raw) -> SentimentPolarity:
        if isinstance(raw, (int, float)):
            val = float(raw)
            if val > 0.2: return SentimentPolarity.POSITIVE
            elif val < -0.2: return SentimentPolarity.NEGATIVE
            return SentimentPolarity.NEUTRAL
        
        mapping = {
            "positive": SentimentPolarity.POSITIVE, "正面": SentimentPolarity.POSITIVE,
            "negative": SentimentPolarity.NEGATIVE, "负面": SentimentPolarity.NEGATIVE,
            "neutral": SentimentPolarity.NEUTRAL, "中性": SentimentPolarity.NEUTRAL,
        }
        return mapping.get(str(raw).lower().strip(), SentimentPolarity.NEUTRAL)
    
    def parse(self, text: str) -> ReviewAnalysis:
        try:
            data = CleanJsonOutputParser().parse(text)
            
            aspects = []
            for a in data.get("aspects", []):
                aspect = self._normalize_aspect(a.get("aspect", a.get("dimension", "")))
                polarity = self._normalize_polarity(a.get("polarity", a.get("sentiment", 0)))
                
                aspects.append(AspectSentiment(
                    aspect=aspect,
                    polarity=polarity,
                    keywords=a.get("keywords", []),
                    quote=a.get("quote", a.get("text", ""))
                ))
            
            # 计算综合得分
            if aspects:
                scores = [asp.polarity.value * 0.3 + 0.5 for asp in aspects]  # 映射到 0-1
                overall = sum(scores) / len(scores) * 2 - 1  # 映射回 -1 到 1
            else:
                overall = 0.0
            
            # 确定整体情感
            if overall > 0.3:
                overall_pol = SentimentPolarity.POSITIVE
            elif overall < -0.3:
                overall_pol = SentimentPolarity.NEGATIVE
            else:
                overall_pol = SentimentPolarity.NEUTRAL
            
            return ReviewAnalysis(
                overall_sentiment=overall_pol,
                overall_score=round(overall, 2),
                aspects=aspects,
                summary=data.get("summary", data.get("一句话总结", ""))
            )
        except Exception as e:
            raise OutputParserException(
                f"Review analysis parse failed: {str(e)}",
                llm_output=text
            ) from e
    
    @property
    def _type(self) -> str:
        return "review_analysis_parser"
    
    def get_format_instructions(self) -> str:
        return '''请分析商品评价的多维度情感,输出 JSON:
{
    "overall_sentiment": "positive/neutral/negative",
    "aspects": [
        {
            "aspect": "product_quality/logistics/service/price/packaging/description",
            "polarity": "positive/neutral/negative",
            "keywords": ["质量好", "做工精细"],
            "quote": "原文引用"
        }
    ],
    "summary": "一句话总结评价核心观点"
}
注意:一条评价可能同时包含正面和负面维度,请分别标注。'''

场景 9:物流 - 地址标准化 ⭐⭐⭐

痛点:用户输入地址格式混乱(如"北京朝阳三里屯"vs"北京市朝阳区三里屯街道"),需要标准化为五级行政区划。

python 复制代码
from dataclasses import dataclass
from typing import Optional
import re

@dataclass
class StandardAddress:
    province: str           # 省/直辖市
    city: str              # 市
    district: str          # 区/县
    street: Optional[str]  # 街道/镇
    detail: str            # 详细地址
    zip_code: Optional[str]
    formatted: str         # 完整格式化地址

class AddressStandardizationParser(BaseOutputParser[StandardAddress]):
    """地址标准化解析器,支持中文地址五级拆分"""
    
    # 直辖市列表
    MUNICIPALITIES = {"北京", "上海", "天津", "重庆"}
    
    # 常见后缀清洗
    SUFFIXES = ["省", "市", "区", "县", "街道", "镇", "乡", "路", "街", "号", "栋", "单元", "室"]
    
    def _clean_suffix(self, text: str) -> str:
        """清洗重复后缀"""
        result = text
        for suffix in self.SUFFIXES:
            result = re.sub(f"{suffix}+", suffix, result)
        return result
    
    def _extract_zip(self, text: str) -> Optional[str]:
        match = re.search(r"\b\d{6}\b", text)
        return match.group(0) if match else None
    
    def _parse_address(self, text: str) -> dict:
        """基于规则 + LLM 输出的混合解析"""
        # 先尝试解析 JSON
        try:
            data = CleanJsonOutputParser().parse(text)
            if isinstance(data, dict) and any(k in data for k in ["province", "city", "省"]):
                return data
        except:
            pass
        
        # 纯文本解析(兜底)
        # 匹配 "XX省XX市XX区..." 模式
        pattern = r"(.+?(?:省|自治区|特别行政区|北京|上海|天津|重庆))(.+?(?:市|自治州|地区|盟))(.+?(?:区|县|旗))(.+?(?:街道|镇|乡))?(.*)"
        match = re.match(pattern, self._clean_suffix(text))
        
        if match:
            groups = match.groups()
            return {
                "province": groups[0].strip(),
                "city": groups[1].strip(),
                "district": groups[2].strip(),
                "street": groups[3].strip() if groups[3] else None,
                "detail": groups[4].strip() if len(groups) > 4 else ""
            }
        
        # 极简模式:按关键词切分
        return {"raw": text, "province": "", "city": "", "district": "", "detail": text}
    
    def _standardize(self, data: dict) -> StandardAddress:
        province = data.get("province", "").replace("省", "").replace("自治区", "")
        city = data.get("city", "").replace("市", "").replace("自治州", "")
        district = data.get("district", "").replace("区", "").replace("县", "")
        street = data.get("street", data.get("town", None))
        detail = data.get("detail", data.get("address", ""))
        
        # 直辖市特殊处理
        if province in self.MUNICIPALITIES:
            city = province  # 直辖市:省级=市级
            if not district:
                district = city  # 兜底
        
        # 组装格式化地址
        parts = [p for p in [province, city, district, street, detail] if p]
        formatted = "".join(parts)
        
        # 确保直辖市有市后缀
        if province in self.MUNICIPALITIES and not formatted.startswith(province + "市"):
            formatted = province + "市" + formatted[len(province):]
        
        return StandardAddress(
            province=province,
            city=city,
            district=district,
            street=street,
            detail=detail,
            zip_code=self._extract_zip(str(data)) or data.get("zip_code"),
            formatted=formatted
        )
    
    def parse(self, text: str) -> StandardAddress:
        try:
            data = self._parse_address(text)
            return self._standardize(data)
        except Exception as e:
            raise OutputParserException(
                f"Address parse failed: {str(e)}",
                llm_output=text
            ) from e
    
    @property
    def _type(self) -> str:
        return "address_standardization_parser"
    
    def get_format_instructions(self) -> str:
        return '''请将地址标准化为五级行政区划,输出 JSON:
{
    "province": "北京市",
    "city": "北京市",
    "district": "朝阳区",
    "street": "三里屯街道",
    "detail": "太古里南区S2-11",
    "zip_code": "100027"
}
直辖市(北京/上海/天津/重庆)的 province 和 city 相同。'''

场景 10:通用 - 多候选解析 ⭐⭐⭐

痛点:关键业务场景不能因解析失败而中断,需要尝试多种解析策略并返回最优结果。

python 复制代码
from dataclasses import dataclass, field
from typing import List, TypeVar, Generic, Optional, Callable
import traceback

T = TypeVar("T")

@dataclass
class ParseAttempt:
    parser_name: str
    success: bool
    result: Optional[T] = None
    error: Optional[str] = None
    confidence: float = 0.0  # 解析置信度(如基于文本匹配度估算)

@dataclass
class MultiCandidateResult(Generic[T]):
    best_result: T
    attempts: List[ParseAttempt] = field(default_factory=list)
    fallback_used: bool = False  # 是否使用了兜底策略

class MultiCandidateParser(BaseOutputParser[MultiCandidateResult[T]]):
    """多候选解析器:依次尝试多个解析策略,返回最优结果"""
    
    def __init__(
        self,
        candidates: List[tuple[str, BaseOutputParser[T], float]],  # (名称, 解析器, 权重)
        fallback_parser: Optional[BaseOutputParser[T]] = None,
        fallback_factory: Optional[Callable[[str], T]] = None
    ):
        self.candidates = candidates
        self.fallback_parser = fallback_parser
        self.fallback_factory = fallback_factory
    
    def parse(self, text: str) -> MultiCandidateResult[T]:
        attempts = []
        best_result: Optional[T] = None
        best_score = -1.0
        
        # 依次尝试每个候选解析器
        for name, parser, weight in self.candidates:
            try:
                result = parser.parse(text)
                # 估算置信度:基于解析器权重 + 结果完整性
                score = weight * self._evaluate_completeness(result)
                
                attempts.append(ParseAttempt(
                    parser_name=name,
                    success=True,
                    result=result,
                    confidence=score
                ))
                
                if score > best_score:
                    best_score = score
                    best_result = result
                    
            except Exception as e:
                attempts.append(ParseAttempt(
                    parser_name=name,
                    success=False,
                    error=str(e)[:200],
                    confidence=0.0
                ))
        
        # 如果全部失败,使用兜底策略
        fallback_used = False
        if best_result is None:
            fallback_used = True
            if self.fallback_parser:
                try:
                    best_result = self.fallback_parser.parse(text)
                except:
                    pass
            if best_result is None and self.fallback_factory:
                best_result = self.fallback_factory(text)
        
        if best_result is None:
            raise OutputParserException(
                f"All {len(self.candidates)} parsers failed. Last error: {attempts[-1].error if attempts else 'N/A'}",
                llm_output=text
            )
        
        return MultiCandidateResult(
            best_result=best_result,
            attempts=attempts,
            fallback_used=fallback_used
        )
    
    def _evaluate_completeness(self, result: T) -> float:
        """评估结果完整性,子类可覆盖"""
        if hasattr(result, '__dict__'):
            fields = [v for v in result.__dict__.values() if v is not None and v != "" and v != []]
            total = len(result.__dict__)
            return len(fields) / total if total > 0 else 0.5
        return 0.5
    
    @property
    def _type(self) -> str:
        return "multi_candidate_parser"
    
    def get_format_instructions(self) -> str:
        # 返回第一个候选解析器的指令
        if self.candidates:
            return self.candidates[0][1].get_format_instructions()
        return "请输出结构化数据。"

# 使用示例:医疗场景的多级容错
medical_multi_parser = MultiCandidateParser(
    candidates=[
        ("strict_json", MedicalDiagnosisParser(), 1.0),
        ("clean_json", CleanJsonOutputParser(), 0.7),
        ("regex_fallback", RiskLevelOutputParser(), 0.3)  # 复用风险解析器做兜底
    ],
    fallback_factory=lambda text: MedicalRecord(
        chief_complaint=text[:100],
        diagnoses=[Diagnosis(icd10_code=None, name="解析失败,需人工复核", confidence=0.0, symptoms=[])]
    )
)

四、容错与修复

生产环境中,解析失败不可避免。LangChain 提供了多层容错机制,自定义解析器应充分利用这些能力。

4.1 OutputFixingParser:基于 LLM 的自我修复

当解析失败时,将原始输出和错误信息喂给另一个 LLM,让其修复格式问题:

python 复制代码
from langchain.output_parsers import OutputFixingParser
from langchain_openai import ChatOpenAI

# 包装任意解析器,添加自我修复能力
base_parser = CleanJsonOutputParser()
fixing_parser = OutputFixingParser.from_llm(
    parser=base_parser,
    llm=ChatOpenAI(temperature=0),  # 使用确定性高的模型修复
    max_retries=2  # 最多修复 2 次
)

# 使用方式与普通解析器相同
try:
    result = fixing_parser.parse(malformed_output)
except Exception:
    # 即使修复失败,也会抛出清晰的异常
    pass

自定义修复提示词:

python 复制代码
from langchain.output_parsers import OutputFixingParser

class CustomFixingParser(OutputFixingParser):
    def _get_fix_instructions(self, bad_output: str, error: str) -> str:
        return f"""原始输出解析失败,错误:{error}
请修复以下输出,使其符合 JSON 格式要求。注意:
1. 移除所有 Markdown 标记
2. 确保所有字符串使用双引号
3. 修复尾部逗号
4. 只输出修复后的 JSON,不要解释

原始输出:
{bad_output}
"""

4.2 RetryOutputParser:基于 Prompt 的重试

OutputFixingParser 不同,RetryOutputParser 会重新调用原始 LLM,而非仅修复输出:

python 复制代码
from langchain.output_parsers import RetryOutputParser
from langchain_core.prompts import PromptTemplate

template = """回答用户问题并输出 JSON。
{format_instructions}

问题:{query}
"""
prompt = PromptTemplate(
    template=template,
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

retry_parser = RetryOutputParser.from_llm(
    parser=base_parser,
    llm=ChatOpenAI(temperature=0.1),
    prompt=prompt,  # 重试时会带上原始 prompt
    max_retries=2
)

# 在 LCEL 中使用
chain = prompt | model | retry_parser

4.3 多级容错策略(生产级实践)

python 复制代码
from typing import TypeVar, Generic
from dataclasses import dataclass

T = TypeVar("T")

@dataclass
class RobustParseResult(Generic[T]):
    data: T
    parser_used: str      # 实际使用的解析器
    attempts: int         # 尝试次数
    was_repaired: bool    # 是否经过修复
    raw_output: str       # 保留原始输出用于审计

class RobustParser(Generic[T]):
    """生产级多级容错解析器"""
    
    def __init__(
        self,
        primary_parser: BaseOutputParser[T],
        repair_llm=None,
        max_repair_attempts: int = 2,
        fallback_factory=None
    ):
        self.primary = primary_parser
        self.repair_llm = repair_llm
        self.max_repair = max_repair_attempts
        self.fallback = fallback_factory
    
    def parse(self, text: str) -> RobustParseResult[T]:
        attempts = 0
        was_repaired = False
        
        # 第一级:直接解析
        try:
            result = self.primary.parse(text)
            return RobustParseResult(
                data=result, parser_used="primary", 
                attempts=1, was_repaired=False, raw_output=text
            )
        except OutputParserException as e:
            attempts += 1
            last_error = e
        
        # 第二级:LLM 修复
        if self.repair_llm:
            for i in range(self.max_repair):
                try:
                    fixed = self._llm_fix(text, str(last_error))
                    result = self.primary.parse(fixed)
                    return RobustParseResult(
                        data=result, parser_used=f"repair_llm_{i+1}",
                        attempts=attempts + i + 1, was_repaired=True, raw_output=text
                    )
                except Exception as fix_error:
                    last_error = fix_error
        
        # 第三级:兜底工厂
        if self.fallback:
            result = self.fallback(text)
            return RobustParseResult(
                data=result, parser_used="fallback_factory",
                attempts=attempts + self.max_repair + 1, 
                was_repaired=False, raw_output=text
            )
        
        # 全部失败
        raise OutputParserException(
            f"Robust parse failed after {attempts} attempts",
            llm_output=text
        )
    
    def _llm_fix(self, bad_output: str, error: str) -> str:
        # 调用修复 LLM
        prompt = f"修复以下 JSON 错误:{error}\n\n原始输出:\n{bad_output}\n\n只输出修复后的 JSON:"
        return self.repair_llm.invoke(prompt).content

五、最佳实践

5.1 六条应该做(DO)

  1. 始终继承 BaseOutputParser[T] 并显式指定泛型类型
python 复制代码
# ✅ 正确:类型安全,IDE 可自动补全
class RiskLevelOutputParser(BaseOutputParser[RiskAssessment]):
    ...

# ❌ 错误:返回类型模糊,难以维护
class BadParser(BaseOutputParser):
    def parse(self, text: str):
        return json.loads(text)  # 返回类型不确定
  1. 解析失败时抛出 OutputParserException 并保留 llm_output

这是后续修复链路的唯一数据来源。没有原始输出,OutputFixingParserRetryOutputParser 将无从修复。

  1. 实现 get_format_instructions() 与解析逻辑的双向校验
python 复制代码
class MyParser(BaseOutputParser[MyType]):
    def parse(self, text: str) -> MyType:
        data = json.loads(text)
        # 校验字段与 format_instructions 中定义的一致
        assert "field_a" in data, "Missing field_a as specified in instructions"
        ...
    
    def get_format_instructions(self) -> str:
        # 与 parse() 中的校验逻辑保持同步
        return '请输出 JSON,必须包含 "field_a" 和 "field_b"'
  1. 使用 dataclass 定义解析结果类型

相比裸字典,dataclass 提供:

  • 静态类型检查
  • 自动 __repr__ 便于日志
  • 不可变性选项(frozen=True
  • 与 Pydantic 的兼容性
  1. 在解析器内部实现业务规则校验,而非仅做格式转换

解析器是业务语义层的最后一道关卡。例如金融场景中的分数范围校验、医疗场景中的 ICD-10 编码有效性检查,都应内聚在解析器中。

  1. 为每个解析器编写单元测试,覆盖成功、失败、边界三种情况

(详见下方测试模板)

5.2 五条不应该做(DON'T)

  1. 不要在解析器中调用外部 API 或阻塞 I/O

解析器应是无副作用的纯函数。如果需要外部数据(如查询 ICD-10 编码库),应在解析前通过 RAG 或 Tool 完成,将结果作为上下文传入。

  1. 不要使用 eval()exec() 处理模型输出
python 复制代码
# ❌ 极度危险:代码注入风险
result = eval(text)  # 模型可能输出恶意代码

# ✅ 安全:使用 json.loads 或 ast.literal_eval
result = json.loads(text)
  1. 不要忽略 get_format_instructions() 的实现

空实现会导致 Prompt 中缺少格式要求,模型输出质量下降。即使解析器主要做后处理,也应提供基础指令。

  1. 不要在解析器中做过于复杂的 NLP 处理

如果解析逻辑需要分词、NER、句法分析等重型 NLP 操作,应将其拆分为独立的预处理步骤,保持解析器的单一职责。

  1. 不要硬编码业务阈值,应通过配置注入
python 复制代码
# ❌ 硬编码,难以调整
if score > 0.7: return "high"

# ✅ 配置化,支持动态调整
class RiskLevelOutputParser(BaseOutputParser[...]):
    def __init__(self, thresholds: Optional[dict] = None):
        self.thresholds = thresholds or {"high": 0.7, "medium": 0.4}

5.3 单元测试模板

python 复制代码
import pytest
from langchain.schema import OutputParserException

class TestCleanJsonOutputParser:
    @pytest.fixture
    def parser(self):
        return CleanJsonOutputParser()
    
    # === 成功路径 ===
    def test_parse_bare_json(self, parser):
        result = parser.parse('{"key": "value"}')
        assert result == {"key": "value"}
    
    def test_parse_markdown_wrapped(self, parser):
        result = parser.parse('```json\n{"key": "value"}\n```')
        assert result == {"key": "value"}
    
    def test_parse_with_prefix_suffix(self, parser):
        result = parser.parse('以下是结果:{"key": "value"} 谢谢')
        assert result == {"key": "value"}
    
    # === 失败路径 ===
    def test_parse_invalid_json_raises(self, parser):
        with pytest.raises(OutputParserException) as exc_info:
            parser.parse("这不是 JSON")
        assert "llm_output" in str(exc_info.value) or hasattr(exc_info.value, 'llm_output')
    
    def test_parse_empty_string_raises(self, parser):
        with pytest.raises(OutputParserException):
            parser.parse("")
    
    # === 边界情况 ===
    def test_parse_nested_json(self, parser):
        result = parser.parse('{"outer": {"inner": [1, 2, 3]}}')
        assert result["outer"]["inner"] == [1, 2, 3]
    
    def test_parse_unicode(self, parser):
        result = parser.parse('{"name": "张三", "emoji": "🚀"}')
        assert result["name"] == "张三"
        assert result["emoji"] == "🚀"
    
    def test_parse_large_json(self, parser):
        large = '{"items": [' + ','.join(['{"id": ' + str(i) + '}' for i in range(1000)]) + ']}'
        result = parser.parse(large)
        assert len(result["items"]) == 1000

class TestRiskLevelOutputParser:
    @pytest.fixture
    def parser(self):
        return RiskLevelOutputParser()
    
    def test_normalize_chinese_description(self, parser):
        result = parser.parse('{"level": "风险较高", "score": 80, "factors": ["杠杆率过高"]}')
        assert result.level == RiskLevel.HIGH
    
    def test_normalize_english_description(self, parser):
        result = parser.parse('{"level": "moderate risk", "score": 50, "factors": []}')
        assert result.level == RiskLevel.MEDIUM
    
    def test_fallback_text_parsing(self, parser):
        # 测试非 JSON 输入的降级处理
        result = parser.parse("经评估,该客户风险等级为高风险,主要因素包括:- 征信记录不良 - 负债率过高")
        assert result.level == RiskLevel.HIGH
        assert len(result.factors) == 2

六、总结

一句话心法

解析器是 LLM 与业务系统之间的"契约层"------它不仅要懂格式,更要懂业务。

核心价值提炼

维度 价值

可靠性 将不可控的文本输出转化为类型安全的结构化数据

一致性 屏蔽多模型、多版本的输出差异,为下游提供稳定接口

语义化 在解析阶段完成业务规则映射(如 ICD-10 编码、风险等级枚举)

可观测性 通过 OutputParserExceptionllm_output 实现全链路审计

可维护性 将数据清洗逻辑从 Prompt 工程中剥离,实现关注点分离

自定义解析器不是"锦上添花",而是 LLM 应用从 Demo 走向生产的必要基础设施。


附录:解析器选择速查表

场景特征 推荐解析器 复杂度 关键能力
纯 JSON,可能被 Markdown 包裹 CleanJsonOutputParser ⭐⭐ 多模式正则清洗
需要枚举值映射 RiskLevelOutputParser ⭐⭐ 语义归一化 + 置信度估算
多维度评分/标签 ContentAuditParser / ReviewAnalysisParser ⭐⭐⭐ 加权计算 + 阈值判定
意图识别 + 槽位填充 CustomerServiceIntentParser ⭐⭐⭐ 状态追踪 + 缺失槽位检测
专业领域编码映射 MedicalDiagnosisParser / ContractRiskParser ⭐⭐⭐ 外部编码体系对接
地址/时间标准化 AddressStandardizationParser ⭐⭐⭐ 规则 + 模型混合解析
高可用关键路径 MultiCandidateParser ⭐⭐⭐ 多级容错 + 自动降级
快速原型验证 PydanticOutputParser(内置) JSON Schema 校验
简单列表提取 CommaSeparatedListOutputParser(内置) 逗号分隔解析

建议的演进路径:

  1. MVP 阶段:使用 PydanticOutputParser 快速验证业务可行性
  2. 生产阶段:针对痛点场景开发自定义解析器,接入 OutputFixingParser
  3. 规模化阶段:建立解析器注册中心,统一监控解析成功率、延迟、降级频率
  4. 成熟阶段:将高频解析逻辑沉淀为独立微服务,支持多租户配置隔离
相关推荐
monkeyhlj1 小时前
LangChain - V1.0
python·langchain·ai编程
念恒123061 小时前
基础IO(一切皆文件)
linux·c语言·c++·算法
不穿铠甲的穿山甲1 小时前
LangChain-4.高级提示词技术
microsoft·百度·langchain
狐狐生风2 小时前
LangGraph Human-in-the-loop 全解
python·langchain·prompt·langgraph·agentai
d111111111d2 小时前
MQTT+STM32+云平台+AT命令的编写
服务器·笔记·stm32·单片机·嵌入式硬件·算法
铁皮哥2 小时前
【力扣题解】LeetCode 25. K 个一组翻转链表
java·数据结构·windows·python·算法·leetcode·链表
lbb 小魔仙2 小时前
告别腾讯会议40分钟限制:用ToDesk协作版开在线会议,免费不限时远程会议新方案
python·langchain·jenkins
洛水水3 小时前
【力扣100题】29. 对称二叉树
算法·leetcode·职场和发展
JavaEdge.3 小时前
用 LangChain 克隆一个 ChatGPT:LLMChain + Memory 实战
人工智能·chatgpt·langchain