Transformers数据处理模块深度分析

文章目录

概述
1. 软件架构设计
- 1.1 数据处理系统整体架构
- 1.2 核心目录结构分析
- 1.3 架构设计原则
- - 1.3.1 数据流原则
  - 1.3.2 类型安全原则
  - 1.3.3 可扩展性原则
1. 核心组件深度分析
- 2.1 DataCollator系统架构
- - 2.1.1 基础DataCollator实现
  - 2.1.2 默认数据整理器
  - 2.1.3 专用数据整理器
- 2.2 评估指标系统
- - 2.2.1 指标基类架构
  - 2.2.2 准确率指标实现
  - 2.2.3 文本生成指标
- 2.3 数据集处理系统
- - 2.3.1 数据集加载器
  - 2.3.2 数据集字典
1. 调用流程深度分析
- 3.1 数据加载和处理流程
- - 3.1.1 详细加载流程
- 3.2 批处理流程
- - 3.2.1 批处理实现细节
1. 高级特性和优化
- 4.1 并行数据处理
- 4.2 动态数据增强
- 4.3 自适应批处理
1. 总结与展望
- 5.1 数据处理模块架构优势总结
- 5.2 技术创新亮点
- 5.3 未来发展方向
- 5.4 最佳实践建议

概述

Transformers数据处理模块是整个框架的数据处理中枢，负责从原始数据到模型输入的完整转换流程。该模块位于src/transformers/data/目录下，包含数据整理器、数据集处理、评估指标、数据处理器等多个关键组件。数据处理模块通过精心设计的抽象层，支持文本、图像、音频、多模态等多种数据类型，实现了高效的数据加载、预处理、批处理和评估。该模块是连接数据源和模型训练的关键桥梁，其设计质量直接影响整个系统的性能和可扩展性。本文档将从软件架构、调用流程、源码分析等多个维度对数据处理模块进行全面深度剖析。

1. 软件架构设计

1.1 数据处理系统整体架构

数据处理模块采用分层流水线架构设计，从数据源到模型输入形成完整的数据处理链：

复制代码

┌─────────────────────────────────────────────────────────────┐
│                    应用接口层 (Application Interface Layer)      │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
│  │DataLoader   │ │DataCollator│ │   Metrics   │           │
│  │  Classes   │ │  Classes    │ │  Classes    │           │
│  └─────────────┘ └─────────────┘ └─────────────┘           │
├─────────────────────────────────────────────────────────────┤
│                    处理逻辑层 (Processing Logic Layer)          │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
│  │Preprocessing│ │Batch        │ │Evaluation  │           │
│  │   Engine    │ │Processing   │ │   Engine    │           │
│  └─────────────┘ └─────────────┘ └─────────────┘           │
├─────────────────────────────────────────────────────────────┤
│                    转换服务层 (Transformation Service Layer)   │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
│  │Token        │ │Image        │ │Audio        │           │
│  │Processing  │ │Processing   │ │Processing   │           │
│  └─────────────┘ └─────────────┘ └─────────────┘           │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
│  │Multimodal   │ │Feature      │ │Sequence     │           │
│  │Processing   │ │Extraction   │ │Alignment   │           │
│  └─────────────┘ └─────────────┘ └─────────────┘           │
├─────────────────────────────────────────────────────────────┤
│                    基础设施层 (Infrastructure Layer)          │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐           │
│  │Caching      │ │Memory       │ │Parallel     │           │
│  │System       │ │Management   │ │Processing   │           │
│  └─────────────┘ └─────────────┘ └─────────────┘           │
└─────────────────────────────────────────────────────────────┘

1.2 核心目录结构分析

复制代码

src/transformers/data/
├── __init__.py                           # 数据处理模块导出
├── data_collator.py                     # 数据整理器 (主要文件)
├── datasets/                            # 数据集处理目录
│   ├── __init__.py
│   ├── load.py                         # 数据集加载工具
│   ├── dataset_dict.py                 # 数据集字典
│   └── processing.py                  # 数据集处理
├── metrics/                             # 评估指标目录
│   ├── __init__.py
│   ├── accuracy.py                      # 准确率指标
│   ├── f1.py                           # F1分数指标
│   ├── bleu.py                         # BLEU指标
│   ├── rouge.py                         # ROUGE指标
│   └── ...                             # 其他指标
├── processors/                          # 数据处理器目录
│   ├── __init__.py
│   ├── glue.py                          # GLUE数据处理器
│   ├── squad.py                         # SQuAD数据处理器
│   ├── wmt.py                          # WMT数据处理器
│   └── ...                             # 其他处理器
└── utils/                               # 数据处理工具
    ├── __init__.py
    ├── numpy_utils.py                   # NumPy工具
    ├── torch_utils.py                   # PyTorch工具
    └── ...                             # 其他工具

1.3 架构设计原则

1.3.1 数据流原则

数据处理遵循清晰的数据流向，从原始数据到最终输出的每一步都有明确定义：

复制代码

原始数据 → 预处理 → 特征提取 → 批处理 → 模型输入
   │         │         │          │          │
   │         │         │          │          └─ 张量格式
   │         │         │          └─ 批次组装
   │         │         └─ 特征向量
   │         └─ 清洗、标准化
   └─ 文本/图像/音频

1.3.2 类型安全原则

通过类型注解和运行时检查确保数据类型的正确性：

python 复制代码

from typing import (
    Union, List, Dict, Optional, Any, 
    TypeVar, Generic, Tuple, Iterator,
    NamedTuple, Protocol
)
from torch import Tensor

# 数据类型定义
BatchData = Dict[str, Union[List[int], Tensor]]
FeatureData = Union[Dict[str, Tensor], Tuple[Tensor, ...]]
ProcessedData = Dict[str, Any]

# 泛型数据处理器
DataType = TypeVar('DataType', bound=ProcessedData)

class DataProcessor(Protocol[DataType]):
    """数据处理器协议"""
    def __call__(self, raw_data: Any) -> DataType: ...
    def batch_process(self, batch_data: List[Any]) -> List[DataType]: ...

1.3.3 可扩展性原则

通过抽象基类和插件机制支持自定义数据处理逻辑：

python 复制代码

class ExtensibleDataProcessor:
    """可扩展数据处理器"""
    
    # 插件注册表
    _preprocessors = {}
    _postprocessors = {}
    _batch_processors = {}
    
    @classmethod
    def register_preprocessor(cls, name: str):
        """注册预处理器"""
        def decorator(preprocessor_func):
            cls._preprocessors[name] = preprocessor_func
            return preprocessor_func
        return decorator
    
    @classmethod
    def register_postprocessor(cls, name: str):
        """注册后处理器"""
        def decorator(postprocessor_func):
            cls._postprocessors[name] = postprocessor_func
            return postprocessor_func
        return decorator

2. 核心组件深度分析

2.1 DataCollator系统架构

2.1.1 基础DataCollator实现

data_collator.py是数据处理模块的核心文件，实现了多种数据整理策略：

python 复制代码

from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Union
from collections.abc import Callable
import multiprocessing as mp
import warnings

@dataclass
class DataCollatorMixin:
    """数据整理器混入类"""
    
    return_tensors: str = "pt"
    pad_to_multiple_of: Optional[int] = None
    mlm_probability: float = 0.15
    return_attention_mask: bool = True
    return_token_type_ids: bool = False
    
    def __call__(
        self, 
        features: List[Dict[str, Any]], 
        return_tensors: Optional[str] = None
    ) -> Dict[str, Any]:
        """数据整理主入口"""
        
        # 1. 参数处理
        return_tensors = return_tensors or self.return_tensors
        
        # 2. 输入验证
        self._validate_features(features)
        
        # 3. 执行整理逻辑
        batch = self._collate_batch(features)
        
        # 4. 张量转换
        if return_tensors:
            batch = self._convert_to_tensors(batch, return_tensors)
        
        return batch
    
    def _validate_features(self, features: List[Dict[str, Any]]):
        """验证输入特征"""
        
        if not features:
            raise ValueError("Features list cannot be empty")
        
        # 检查字段一致性
        if len(features) > 1:
            first_keys = set(features[0].keys())
            for i, feature in enumerate(features[1:], 1):
                current_keys = set(feature.keys())
                if current_keys != first_keys:
                    missing = first_keys - current_keys
                    extra = current_keys - first_keys
                    
                    if missing:
                        warnings.warn(
                            f"Feature {i} is missing keys: {missing}"
                        )
                    if extra:
                        warnings.warn(
                            f"Feature {i} has extra keys: {extra}"
                        )
    
    @abstractmethod
    def _collate_batch(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        """整理批次数据 - 子类必须实现"""
        raise NotImplementedError
    
    def _convert_to_tensors(
        self, 
        batch: Dict[str, Any], 
        return_tensors: str
    ) -> Dict[str, Any]:
        """转换为指定张量格式"""
        
        if return_tensors == "pt":
            import torch
            for key, value in batch.items():
                if isinstance(value, list):
                    if all(isinstance(x, (int, float)) for x in value[0] if value else []):
                        batch[key] = torch.tensor(value, dtype=torch.long)
                    else:
                        batch[key] = torch.tensor(value)
        elif return_tensors == "np":
            import numpy as np
            for key, value in batch.items():
                if isinstance(value, list):
                    batch[key] = np.array(value)
        
        return batch

2.1.2 默认数据整理器

python 复制代码

def default_data_collator(
    features: List[Dict[str, Any]], 
    return_tensors: str = "pt"
) -> Dict[str, Any]:
    """默认数据整理器实现"""
    
    # 1. 分析特征结构
    feature_keys = set()
    for feature in features:
        feature_keys.update(feature.keys())
    
    # 2. 分别处理每个特征
    batch = {}
    
    for key in feature_keys:
        # 收集该键的所有值
        values = [feature.get(key) for feature in features]
        
        # 根据值的类型选择处理策略
        if all(isinstance(v, (int, float)) for v in values if v is not None):
            # 数值类型：直接堆叠
            batch[key] = values
        elif all(isinstance(v, list) for v in values if v is not None):
            # 列表类型：需要padding
            batch[key] = pad_sequence(values, batch_first=True)
        elif all(isinstance(v, dict) for v in values if v is not None):
            # 字典类型：递归处理
            batch[key] = _collate_dict_values(values)
        else:
            # 混合类型：保持原样
            batch[key] = values
    
    # 3. 张量转换
    if return_tensors == "pt":
        batch = _convert_to_pytorch_tensors(batch)
    elif return_tensors == "np":
        batch = _convert_to_numpy_arrays(batch)
    
    return batch

def _collate_dict_values(dict_list: List[Dict[str, Any]]) -> Dict[str, Any]:
    """整理字典值列表"""
    
    if not dict_list:
        return {}
    
    # 获取所有子键
    sub_keys = set()
    for d in dict_list:
        if d:
            sub_keys.update(d.keys())
    
    # 递归整理每个子键
    result = {}
    for sub_key in sub_keys:
        sub_values = [d.get(sub_key) for d in dict_list]
        if all(isinstance(v, (int, float)) for v in sub_values if v is not None):
            result[sub_key] = sub_values
        elif all(isinstance(v, list) for v in sub_values if v is not None):
            result[sub_key] = pad_sequence(sub_values, batch_first=True)
        else:
            result[sub_key] = sub_values
    
    return result

def pad_sequence(
    sequences: List[Union[List[int], List[float]]], 
    batch_first: bool = True,
    padding_value: int = 0
) -> List[List[int]]:
    """序列填充"""
    
    if not sequences:
        return []
    
    # 1. 找到最长序列长度
    max_len = max(len(seq) for seq in sequences)
    
    # 2. 填充每个序列
    padded_sequences = []
    for seq in sequences:
        if len(seq) < max_len:
            # 填充到最大长度
            if batch_first:
                padded_seq = seq + [padding_value] * (max_len - len(seq))
            else:
                padded_seq = [padding_value] * (max_len - len(seq)) + seq
        else:
            padded_seq = seq
        
        padded_sequences.append(padded_seq)
    
    return padded_sequences

2.1.3 专用数据整理器

python 复制代码

@dataclass
class DataCollatorForLanguageModeling(DataCollatorMixin):
    """语言模型数据整理器"""
    
    mlm: bool = False
    mlm_probability: float = 0.15
    pad_to_multiple_of: Optional[int] = None
    
    def __post_init__(self):
        """初始化后处理"""
        if self.mlm and not hasattr(self, 'tokenizer'):
            raise ValueError(
                "For masked language modeling, a tokenizer is required"
            )
    
    def _collate_batch(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        """语言模型批次整理"""
        
        if self.mlm:
            return self._collate_mlm_batch(features)
        else:
            return self._collate_clm_batch(features)
    
    def _collate_mlm_batch(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        """掩码语言模型批次整理"""
        
        # 1. 提取输入ID
        input_ids = [feature["input_ids"] for feature in features]
        labels = []
        
        # 2. 生成掩码
        masked_inputs = []
        for i, input_id in enumerate(input_ids):
            # 随机掩码
            masked_input, label = self._mask_tokens(input_id)
            masked_inputs.append(masked_input)
            labels.append(label)
        
        # 3. 创建注意力掩码
        attention_mask = self._create_attention_mask(masked_inputs)
        
        return {
            "input_ids": masked_inputs,
            "attention_mask": attention_mask,
            "labels": labels
        }
    
    def _mask_tokens(self, inputs: List[int]) -> Tuple[List[int], List[int]]:
        """掩码token"""
        
        import random
        
        labels = []
        masked_inputs = []
        
        for i, token_id in enumerate(inputs):
            # 决定是否掩码
            prob = random.random()
            if prob < self.mlm_probability:
                # 80%概率用[MASK]替换
                if prob < 0.8 * self.mlm_probability:
                    masked_inputs.append(self.tokenizer.mask_token_id)
                    labels.append(token_id)
                # 10%概率用随机token替换
                elif prob < 0.9 * self.mlm_probability:
                    masked_inputs.append(random.randint(1, self.tokenizer.vocab_size - 1))
                    labels.append(token_id)
                # 10%概率保持原样
                else:
                    masked_inputs.append(token_id)
                    labels.append(-100)  # -100表示忽略该位置的损失
            else:
                masked_inputs.append(token_id)
                labels.append(-100)
        
        return masked_inputs, labels
    
    def _collate_clm_batch(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        """因果语言模型批次整理"""
        
        # 1. 提取输入ID和标签
        input_ids = [feature["input_ids"] for feature in features]
        labels = [feature["input_ids"] for feature in features]
        
        # 2. CLM中，labels是输入ID的移位版本
        for i in range(len(labels)):
            labels[i] = labels[i][1:] + [-100]  # 添加-100到最后
        
        # 3. 填充到相同长度
        padded_inputs, padded_labels = self._pad_to_same_length(input_ids, labels)
        
        # 4. 创建注意力掩码
        attention_mask = self._create_attention_mask(padded_inputs)
        
        return {
            "input_ids": padded_inputs,
            "attention_mask": attention_mask,
            "labels": padded_labels
        }

@dataclass  
class DataCollatorForTokenClassification(DataCollatorMixin):
    """Token分类数据整理器"""
    
    label_pad_token_id: int = -100
    return_attention_mask: bool = True
    
    def _collate_batch(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        """Token分类批次整理"""
        
        # 1. 提取输入ID和标签
        input_ids = [feature["input_ids"] for feature in features]
        labels = [feature["labels"] for feature in features]
        attention_masks = [feature.get("attention_mask") for feature in features]
        
        # 2. 填充标签序列
        padded_labels = self._pad_labels(labels)
        
        # 3. 填充输入序列
        padded_inputs = self._pad_sequences(input_ids)
        
        # 4. 处理注意力掩码
        if attention_masks and all(mask is not None for mask in attention_masks):
            padded_attention_masks = self._pad_sequences(attention_masks)
        else:
            padded_attention_masks = None
        
        return {
            "input_ids": padded_inputs,
            "attention_mask": padded_attention_masks,
            "labels": padded_labels
        }
    
    def _pad_labels(self, labels: List[List[int]]) -> List[List[int]]:
        """填充标签序列"""
        
        # 找到最大长度
        max_len = max(len(label) for label in labels)
        
        # 填充到最大长度
        padded_labels = []
        for label in labels:
            if len(label) < max_len:
                padded_label = label + [self.label_pad_token_id] * (max_len - len(label))
            else:
                padded_label = label
            padded_labels.append(padded_label)
        
        return padded_labels

2.2 评估指标系统

2.2.1 指标基类架构

python 复制代码

from abc import ABC, abstractmethod
from typing import Dict, List, Union, Any, Optional
from dataclasses import dataclass
import numpy as np
from sklearn.metrics import (
    accuracy_score, precision_recall_fscore_support,
    confusion_matrix, classification_report
)

@dataclass
class MetricConfig:
    """指标配置类"""
    
    average: Optional[str] = None  # 'micro', 'macro', 'weighted'
    zero_division: str = 'warn'     # 0除数处理
    pos_label: int = 1             # 正类标签
    multi_class: bool = False       # 多类分类
    
class BaseMetric(ABC):
    """评估指标基类"""
    
    def __init__(self, config: Optional[MetricConfig] = None):
        self.config = config or MetricConfig()
        self.reset()
    
    @abstractmethod
    def add_batch(self, predictions: np.ndarray, references: np.ndarray):
        """添加批次预测结果"""
        raise NotImplementedError
    
    @abstractmethod
    def compute(self) -> Dict[str, float]:
        """计算最终指标"""
        raise NotImplementedError
    
    def reset(self):
        """重置内部状态"""
        self.predictions = []
        self.references = []
    
    def add_example(self, prediction: Any, reference: Any):
        """添加单个预测结果"""
        self.predictions.append(prediction)
        self.references.append(reference)

2.2.2 准确率指标实现

python 复制代码

class AccuracyMetric(BaseMetric):
    """准确率指标实现"""
    
    def add_batch(self, predictions: np.ndarray, references: np.ndarray):
        """批次添加预测结果"""
        
        # 1. 输入验证
        if predictions.shape != references.shape:
            raise ValueError("Predictions and references must have same shape")
        
        # 2. 添加到内部状态
        self.predictions.extend(predictions.tolist())
        self.references.extend(references.tolist())
    
    def compute(self) -> Dict[str, float]:
        """计算准确率"""
        
        if not self.predictions:
            return {"accuracy": 0.0}
        
        # 1. 转换为numpy数组
        pred_array = np.array(self.predictions)
        ref_array = np.array(self.references)
        
        # 2. 计算准确率
        if len(pred_array.shape) == 1:
            # 一维数组：简单分类准确率
            accuracy = accuracy_score(ref_array, pred_array)
            return {"accuracy": accuracy}
        
        elif len(pred_array.shape) == 2:
            # 二维数组：序列准确率（每个位置）
            sequence_accuracy = np.mean(
                np.all(pred_array == ref_array, axis=1)
            )
            
            # 元素级准确率
            element_accuracy = accuracy_score(
                ref_array.flatten(), pred_array.flatten()
            )
            
            return {
                "sequence_accuracy": sequence_accuracy,
                "element_accuracy": element_accuracy,
                "overall_accuracy": element_accuracy
            }
        
        else:
            raise ValueError("Unsupported prediction shape")
    
    def add_example(self, prediction: Any, reference: Any):
        """添加单个示例"""
        
        self.predictions.append(prediction)
        self.references.append(reference)

class F1Metric(BaseMetric):
    """F1分数指标实现"""
    
    def __init__(self, config: Optional[MetricConfig] = None):
        super().__init__(config)
        self._supports_multi_class = True
    
    def add_batch(self, predictions: np.ndarray, references: np.ndarray):
        """批次添加预测结果"""
        
        # 1. 处理多标签情况
        if len(predictions.shape) > 1 and predictions.shape[-1] > 1:
            # 多标签分类：取argmax
            pred_labels = np.argmax(predictions, axis=-1)
        else:
            pred_labels = predictions.flatten()
        
        ref_labels = references.flatten()
        
        # 2. 添加到内部状态
        self.predictions.extend(pred_labels.tolist())
        self.references.extend(ref_labels.tolist())
    
    def compute(self) -> Dict[str, float]:
        """计算F1分数"""
        
        if not self.predictions:
            return {
                "f1": 0.0, "precision": 0.0, "recall": 0.0
            }
        
        # 1. 计算精确率、召回率、F1
        precision, recall, f1, support = precision_recall_fscore_support(
            self.references,
            self.predictions,
            average=self.config.average,
            zero_division=self.config.zero_division
        )
        
        # 2. 构建结果字典
        if self.config.average is None:
            # 每个类别的指标
            results = {
                f"precision_class_{i}": p 
                for i, p in enumerate(precision)
            }
            results.update({
                f"recall_class_{i}": r 
                for i, r in enumerate(recall)
            })
            results.update({
                f"f1_class_{i}": f 
                for i, f in enumerate(f1)
            })
            
            # 添加宏平均和微平均
            results.update({
                "precision_macro": np.mean(precision),
                "recall_macro": np.mean(recall),
                "f1_macro": np.mean(f1),
                "precision_micro": precision_weighted,
                "recall_micro": recall_weighted,
                "f1_micro": f1_weighted
            })
            
            return results
        
        else:
            # 平均指标
            return {
                "precision": float(precision) if len(precision) == 1 else precision,
                "recall": float(recall) if len(recall) == 1 else recall,
                "f1": float(f1) if len(f1) == 1 else f1,
                "support": support
            }

2.2.3 文本生成指标

python 复制代码

class BLEUMetric(BaseMetric):
    """BLEU指标实现"""
    
    def __init__(self, config: Optional[MetricConfig] = None):
        super().__init__(config)
        self.max_order = 4
        self.smooth = True
        self._bleu = None
    
    def _ngram_counts(self, sentence: List[str], n: int) -> Dict:
        """计算n-gram计数"""
               from collections import Counter
        
        if n == 1:
            return Counter(sentence)
        
        ngrams = [
            ' '.join(sentence[i:i+n]) 
            for i in range(len(sentence) - n + 1)
        ]
        
        return Counter(ngrams)
    
    def _modified_precision(self, 
                         candidate: List[str], 
                         reference: List[str], 
                         n: int) -> float:
        """计算修正的精确率"""
        
        candidate_counts = self._ngram_counts(candidate, n)
        max_counts = {}
        
        # 计算参考中的最大计数
        for ref in reference:
            ref_counts = self._ngram_counts(ref, n)
            for ngram in candidate_counts:
                max_counts[ngram] = max(
                    max_counts.get(ngram, 0), 
                    ref_counts.get(ngram, 0)
                )
        
        # 计算修正的精确率
        clipped_counts = {
            ngram: min(candidate_counts[ngram], max_counts.get(ngram, 0))
            for ngram in candidate_counts
        }
        
        total_clipped = sum(clipped_counts.values())
        total_candidate = sum(candidate_counts.values())
        
        if total_candidate == 0:
            return 0.0
        
        return total_clipped / total_candidate
    
    def add_batch(self, 
                  predictions: List[List[str]], 
                  references: List[List[str]]):
        """批次添加预测结果"""
        
        for pred, ref in zip(predictions, references):
            self.predictions.append(pred)
            self.references.append(ref)
    
    def compute(self) -> Dict[str, float]:
        """计算BLEU分数"""
        
        if not self.predictions:
            return {"bleu": 0.0}
        
        # 1. 计算各阶n-gram的精确率
        precisions = []
        for n in range(1, self.max_order + 1):
            prec = []
            for pred, ref in zip(self.predictions, self.references):
                p = self._modified_precision(pred, [ref], n)
                prec.append(p)
            
            if prec:
                precisions.append(np.mean(prec))
            else:
                precisions.append(0.0)
        
        # 2. 应用平滑
        if self.smooth:
            for i in range(len(precisions)):
                if precisions[i] == 0.0:
                    precisions[i] = 1.0 / (10 ** (len(precisions) - i - 1))
        
        # 3. 计算几何平均
        log_precisions = np.log(precisions)
        avg_log_precision = np.mean(log_precisions)
        
        # 4. 计算简短惩罚
        bp = self._brevity_penalty(self.predictions, self.references)
        
        # 5. 计算BLEU
        bleu = bp * np.exp(avg_log_precision)
        
        return {
            "bleu": bleu,
            "precisions": precisions,
            "brevity_penalty": bp
        }
    
    def _brevity_penalty(self, 
                         candidates: List[List[str]], 
                         references: List[List[str]]) -> float:
        """简短惩罚因子"""
        
        total_cand_len = sum(len(c) for c in candidates)
        total_ref_len = sum(
            min(len(ref), len(c)) 
            for ref, c in zip(references, candidates)
        )
        
        if total_cand_len > total_ref_len:
            return 1.0
        else:
            return np.exp(1 - total_ref_len / total_cand_len)

class ROUGEMetric(BaseMetric):
    """ROUGE指标实现"""
    
    def __init__(self, config: Optional[MetricConfig] = None):
        super().__init__(config)
        self._rouge_types = ['rouge1', 'rouge2', 'rougeL']
        self._rouge = None
    
    def add_batch(self, 
                  predictions: List[str], 
                  references: List[List[str]]):
        """批次添加预测结果"""
        
        for pred, ref in zip(predictions, references):
            self.predictions.append(pred)
            self.references.append(ref)
    
    def compute(self) -> Dict[str, float]:
        """计算ROUGE分数"""
        
        if not self.predictions:
            return {rouge_type: 0.0 for rouge_type in self._rouge_types}
        
        # 1. 计算n-gram重叠
        scores = {}
        
        for rouge_type in self._rouge_types:
            rouge_scores = []
            
            for pred, refs in zip(self.predictions, self.references):
                if rouge_type == 'rougeL':
                    # LCS最长公共子序列
                    score = self._rouge_l(pred, refs)
                else:
                    # ROUGE-1, ROUGE-2
                    n = int(rouge_type[-1])
                    score = self._rouge_n(pred, refs, n)
                
                rouge_scores.append(score)
            
            # 2. 平均分数
            scores[rouge_type] = np.mean(rouge_scores)
        
        return scores
    
    def _rouge_n(self, 
                 candidate: str, 
                 references: List[str], 
                 n: int) -> float:
        """计算ROUGE-n分数"""
        
        def get_ngrams(text: str, n: int) -> Counter:
            from collections import Counter
            tokens = text.lower().split()
            ngrams = [
                ' '.join(tokens[i:i+n]) 
                for i in range(len(tokens) - n + 1)
            ]
            return Counter(ngrams)
        
        # 候选n-gram
        cand_ngrams = get_ngrams(candidate, n)
        
        # 参考n-gram（取最大计数）
        ref_counts = Counter()
        for ref in references:
            ref_ngrams = get_ngrams(ref, n)
            for ngram, count in ref_ngrams.items():
                ref_counts[ngram] = max(ref_counts[ngram], count)
        
        # 计算重叠
        overlap = 0
        for ngram, count in cand_ngrams.items():
            overlap += min(count, ref_counts.get(ngram, 0))
        
        total_cand = sum(cand_ngrams.values())
        
        if total_cand == 0:
            return 0.0
        
        # 精确率
        precision = overlap / total_cand
        
        # 召回率
        total_ref = sum(ref_counts.values())
        recall = overlap / total_ref
        
        # F1分数
        if precision + recall == 0:
            return 0.0
        
        f1 = 2 * precision * recall / (precision + recall)
        return f1
    
    def _rouge_l(self, 
                 candidate: str, 
                 references: List[str]) -> float:
        """计算ROUGE-L分数（最长公共子序列）"""
        
        def longest_common_subsequence(seq1: str, seq2: str) -> int:
            """计算最长公共子序列长度"""
            
            m, n = len(seq1), len(seq2)
            dp = [[0] * (n + 1) for _ in range(m + 1)]
            
            for i in range(1, m + 1):
                for j in range(1, n + 1):
                    if seq1[i-1] == seq2[j-1]:
                        dp[i][j] = dp[i-1][j-1] + 1
                    else:
                        dp[i][j] = max(dp[i-1][j], dp[i][j-1])
            
            return dp[m][n]
        
        # 计算与每个参考的LCS，取最大值
        max_lcs = 0
        for ref in references:
            lcs_len = longest_common_subsequence(candidate, ref)
            max_lcs = max(max_lcs, lcs_len)
        
        # 计算精确率、召回率、F1
        cand_len = len(candidate.split())
        ref_len = max(len(ref.split()) for ref in references)
        
        if max_lcs == 0:
            return 0.0
        
        precision = max_lcs / cand_len
        recall = max_lcs / ref_len
        f1 = 2 * precision * recall / (precision + recall)
        
        return f1

2.3 数据集处理系统

2.3.1 数据集加载器

python 复制代码

class DatasetLoader:
    """数据集加载器"""
    
    def __init__(self, cache_dir: Optional[str] = None):
        self.cache_dir = cache_dir
        self.downloaded_datasets = {}
        self._setup_cache()
    
    def _setup_cache(self):
        """设置缓存"""
        
        if self.cache_dir:
            os.makedirs(self.cache_dir, exist_ok=True)
    
    def load_dataset(
        self,
        dataset_name: str,
        split: str = "train",
        cache_dir: Optional[str] = None,
        download_mode: bool = True
    ):
        """加载数据集"""
        
        # 1. 检查缓存
        cache_key = f"{dataset_name}_{split}"
        cache_path = self._get_cache_path(cache_key, cache_dir)
        
        if cache_path and os.path.exists(cache_path):
            logger.info(f"Loading {cache_key} from cache")
            return self._load_cached_dataset(cache_path)
        
        # 2. 下载和加载数据集
        if download_mode:
            logger.info(f"Downloading {dataset_name} {split} split")
            dataset = self._download_and_process(dataset_name, split)
            
            # 3. 缓存数据集
            if cache_path:
                self._save_to_cache(dataset, cache_path)
            
            return dataset
        else:
            raise FileNotFoundError(f"Dataset {cache_key} not found and download disabled")
    
    def _download_and_process(self, dataset_name: str, split: str):
        """下载和处理数据集"""
        
        # 1. 选择处理器
        processor = self._get_processor(dataset_name)
        
        # 2. 下载数据
        raw_data = self._download_raw_data(dataset_name, split)
        
        # 3. 处理数据
        processed_data = processor.process(raw_data, split)
        
        return processed_data
    
    def _get_processor(self, dataset_name: str):
        """获取数据处理器"""
        
        processors = {
            'glue': GLUEProcessor,
            'squad': SQuADProcessor,
            'wmt': WMTProcessor,
            'cnn_dailymail': CNNDailyMailProcessor,
            'imdb': IMDBProcessor
        }
        
        processor_class = processors.get(dataset_name.lower())
        if processor_class is None:
            raise ValueError(f"Unsupported dataset: {dataset_name}")
        
        return processor_class()
    
    def _download_raw_data(self, dataset_name: str, split: str):
        """下载原始数据"""
        
        # 根据数据集选择下载方式
        if dataset_name == 'glue':
            return self._download_glue_data(split)
        elif dataset_name == 'squad':
            return self._download_squad_data(split)
        else:
            # 使用HuggingFace Hub下载
            from datasets import load_dataset
            return load_dataset(dataset_name, split=split)

class GLUEProcessor:
    """GLUE数据集处理器"""
    
    def __init__(self):
        self.task_to_keys = {
            "cola": ["sentence"],
            "mnli": ["premise", "hypothesis"],
            "mrpc": ["sentence1", "sentence2"],
            "qnli": ["question", "sentence"],
            "qqp": ["question1", "question2"],
            "rte": ["sentence1", "sentence2"],
            "sst": ["sentence"],
            "sts-b": ["sentence1", "sentence2"],
            "wnli": ["sentence1", "sentence2"],
        }
    
    def process(self, raw_data, split: str):
        """处理GLUE数据"""
        
        # 1. 确定任务类型
        task_name = self._detect_task_name(raw_data)
        text_keys = self.task_to_keys[task_name]
        
        # 2. 提取文本和标签
        processed_data = []
        
        for example in raw_data:
            # 处理文本输入
            if len(text_keys) == 1:
                # 单文本任务
                text = example[text_keys[0]]
                inputs = text
            else:
                # 文本对任务
                text1 = example[text_keys[0]]
                text2 = example[text_keys[1]]
                inputs = f"{text1} {text2}"
            
            # 处理标签
            if 'label' in example:
                label = example['label']
            elif 'idx' in example:
                # 某些任务使用idx作为标签
                label = example['idx']
            else:
                label = None
            
            processed_data.append({
                'text': inputs,
                'label': label,
                'task': task_name
            })
        
        return processed_data
    
    def _detect_task_name(self, raw_data):
        """检测任务名称"""
        
        if not raw_data:
            return "unknown"
        
        example = raw_data[0]
        
        # 根据字段名推断任务类型
        if "sentence" in example:
            if "premise" in example:
                return "mnli"
            else:
                return "sst"
        elif "premise" in example:
            return "mnli"
        elif "sentence1" in example:
            return "mrpc"
        elif "question" in example:
            return "qnli"
        elif "question1" in example:
            return "qqp"
        else:
            return "unknown"

2.3.2 数据集字典

python 复制代码

class DatasetDict:
    """数据集字典类"""
    
    def __init__(self, 
                 train: Optional[Any] = None,
                 validation: Optional[Any] = None,
                 test: Optional[Any] = None):
        self.train = train
        self.validation = validation
        self.test = test
    
    def __getitem__(self, key: str):
        """获取指定分割的数据集"""
        if key not in ['train', 'validation', 'test']:
            raise KeyError(f"Invalid split: {key}")
        return getattr(self, key)
    
    def __setitem__(self, key: str, value: Any):
        """设置指定分割的数据集"""
        if key not in ['train', 'validation', 'test']:
            raise KeyError(f"Invalid split: {key}")
        setattr(self, key, value)
    
    def keys(self):
        """获取所有可用的分割"""
        splits = []
        if self.train is not None:
            splits.append('train')
        if self.validation is not None:
            splits.append('validation')
        if self.test is not None:
            splits.append('test')
        return splits
    
    def map(self, function, batched: bool = False):
        """对数据集应用映射函数"""
        return DatasetDict(
            train=self._map_split(self.train, function, batched) if self.train else None,
            validation=self._map_split(self.validation, function, batched) if self.validation else None,
            test=self._map_split(self.test, function, batched) if self.test else None
        )
    
    def _map_split(self, dataset, function, batched):
        """映射单个分割"""
        if dataset is None:
            return None
        
        if batched:
            return [function(batch) for batch in dataset]
        else:
            return [function(example) for example in dataset]
    
    def filter(self, function):
        """过滤数据集"""
        return DatasetDict(
            train=self._filter_split(self.train, function) if self.train else None,
            validation=self._filter_split(self.validation, function) if self.validation else None,
            test=self._filter_split(self.test, function) if self.test else None
        )
    
    def _filter_split(self, dataset, function):
        """过滤单个分割"""
        if dataset is None:
            return None
        
        return [example for example in dataset if function(example)]
    
    def shuffle(self, seed: Optional[int] = None):
        """打乱数据集"""
        
        import random
        
        if seed is not None:
            random.seed(seed)
        
        return DatasetDict(
            train=self._shuffle_split(self.train) if self.train else None,
            validation=self._shuffle_split(self.validation) if self.validation else None,
            test=self._shuffle_split(self.test) if self.test else None
        )
    
    def _shuffle_split(self, dataset):
        """打乱单个分割"""
        if dataset is None:
            return None
        
        shuffled = dataset.copy()
        random.shuffle(shuffled)
        return shuffled

3. 调用流程深度分析

3.1 数据加载和处理流程

是
否
训练
推理
用户调用load_dataset
检测数据集类型
本地缓存存在?
从缓存加载
从Hub下载
数据验证
原始数据处理
特征提取
数据分割
返回DatasetDict
用户使用数据
训练/推理?
DataCollator处理
直接使用
批处理
模型输入

3.1.1 详细加载流程

python 复制代码

class DataLoadingFlow:
    """数据加载流程实现"""
    
    def load_and_process_dataset(
        self,
        dataset_name: str,
        splits: List[str] = ["train", "validation", "test"],
        processor_name: Optional[str] = None,
        **kwargs
    ):
        """完整的数据加载和处理流程"""
        
        # 1. 数据集发现
        dataset_info = self._discover_dataset(dataset_name)
        
        # 2. 缓存检查
        cache_info = self._check_cache(dataset_name, splits)
        
        # 3. 加载或下载
        if cache_info['all_cached']:
            raw_data = self._load_from_cache(cache_info['paths'])
        else:
            raw_data = self._download_and_load(dataset_name, splits, **kwargs)
            self._save_to_cache(raw_data, dataset_name, splits)
        
        # 4. 数据处理
        processed_data = self._process_raw_data(
            raw_data, processor_name or dataset_name
        )
        
        # 5. 创建数据集字典
        dataset_dict = self._create_dataset_dict(processed_data, splits)
        
        return dataset_dict
    
    def _discover_dataset(self, dataset_name: str):
        """发现数据集信息"""
        
        # 1. 内置数据集
        builtin_datasets = {
            'glue': {
                'description': 'General Language Understanding Evaluation',
                'tasks': ['cola', 'mnli', 'mrpc', 'qnli', 'qqp', 'rte', 'sst', 'sts-b', 'wnli'],
                'formats': {'text', 'text_pair'},
                'splits': ['train', 'validation', 'test']
            },
            'squad': {
                'description': 'Stanford Question Answering Dataset',
                'tasks': ['question_answering'],
                'formats': {'question_context_pair'},
                'splits': ['train', 'validation']
            },
            'imdb': {
                'description': 'IMDB Movie Reviews Dataset',
                'tasks': ['sentiment_classification'],
                'formats': {'text'},
                'splits': ['train', 'test']
            }
        }
        
        return builtin_datasets.get(dataset_name.lower(), None)
    
    def _check_cache(self, dataset_name: str, splits: List[str]):
        """检查缓存状态"""
        
        cache_info = {
            'paths': {},
            'all_cached': True
        }
        
        for split in splits:
            cache_path = self._get_cache_path(dataset_name, split)
            cache_info['paths'][split] = cache_path
            
            if not cache_path or not os.path.exists(cache_path):
                cache_info['all_cached'] = False
        
        return cache_info
    
    def _download_and_load(self, dataset_name: str, splits: List[str], **kwargs):
        """下载和加载原始数据"""
        
        raw_data = {}
        
        for split in splits:
            try:
                # 1. 使用HuggingFace datasets库
                from datasets import load_dataset
                raw_data[split] = load_dataset(dataset_name, split=split)
                
            except Exception as e:
                logger.error(f"Failed to load {dataset_name} {split}: {e}")
                raise
        
        return raw_data
    
    def _process_raw_data(self, raw_data: Dict[str, Any], processor_name: str):
        """处理原始数据"""
        
        # 1. 获取处理器
        processor = self._get_data_processor(processor_name)
        
        # 2. 处理每个分割
        processed_data = {}
        
        for split, data in raw_data.items():
            logger.info(f"Processing {split} split with {processor_name} processor")
            processed_data[split] = processor.process(data, split)
        
        return processed_data
    
    def _get_data_processor(self, processor_name: str):
        """获取数据处理器"""
        
        # 处理器注册表
        processors = {
            'glue': GLUEProcessor,
            'squad': SQuADProcessor,
            'wmt': WMTProcessor,
            'cnn_dailymail': CNNDailyMailProcessor,
            'imdb': IMDBProcessor
        }
        
        processor_class = processors.get(processor_name.lower())
        if processor_class is None:
            logger.warning(f"Processor not found for {processor_name}, using default")
            return DefaultProcessor()
        
        return processor_class()

3.2 批处理流程

序列数据
分类数据
语言模型
用户提供batch数据
DataCollator初始化
输入数据验证
根据类型选择策略
数据类型?
序列填充处理
分类数据处理
MLM/CLM处理
注意力掩码生成
张量转换
返回整理后batch

3.2.1 批处理实现细节

python 复制代码

class BatchProcessingFlow:
    """批处理流程实现"""
    
    def __init__(self, collator_type: str = "default"):
        self.collator_type = collator_type
        self.collator = self._create_collator(collator_type)
    
    def _create_collator(self, collator_type: str):
        """创建数据整理器"""
        
        collators = {
            'default': default_data_collator,
            'language_modeling': DataCollatorForLanguageModeling,
            'token_classification': DataCollatorForTokenClassification,
            'sequence_classification': DataCollatorForSequenceClassification,
            'question_answering': DataCollatorWithPadding,
            'vision': DataCollatorForVision
        }
        
        collator_class = collators.get(collator_type)
        if collator_class is None:
            raise ValueError(f"Unknown collator type: {collator_type}")
        
        return collator_class()
    
    def process_batch(
        self,
        batch_features: List[Dict[str, Any]],
        tokenizer = None,
        return_tensors: str = "pt",
        **collator_kwargs
    ):
        """处理批次数据"""
        
        # 1. 输入预处理
        preprocessed_features = self._preprocess_batch(batch_features)
        
        # 2. 数据类型检测
        data_type = self._detect_data_type(preprocessed_features)
        
        # 3. 选择处理策略
        if data_type == 'language_modeling':
            return self._process_language_modeling_batch(
                preprocessed_features, tokenizer, return_tensors, **collator_kwargs
            )
        elif data_type == 'token_classification':
            return self._process_token_classification_batch(
                preprocessed_features, return_tensors, **collator_kwargs
            )
        elif data_type == 'sequence_classification':
            return self._process_sequence_classification_batch(
                preprocessed_features, return_tensors, **collator_kwargs
            )
        else:
            return self._process_default_batch(
                preprocessed_features, return_tensors, **collator_kwargs
            )
    
    def _preprocess_batch(self, batch_features: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """批次预处理"""
        
        processed_features = []
        
        for feature in batch_features:
            processed_feature = {}
            
            for key, value in feature.items():
                # 1. 类型转换
                if isinstance(value, str):
                    # 文本tokenization
                    if hasattr(self, 'tokenizer') and self.tokenizer:
                        processed_feature[key] = self.tokenizer.encode(value)
                    else:
                        processed_feature[key] = value
                elif isinstance(value, list):
                    # 保持列表格式
                    processed_feature[key] = value
                elif isinstance(value, np.ndarray):
                    # 转换为列表
                    processed_feature[key] = value.tolist()
                else:
                    # 保持原样
                    processed_feature[key] = value
            
            # 2. 数据清洗
            processed_feature = self._clean_feature_data(processed_feature)
            processed_features.append(processed_feature)
        
        return processed_features
    
    def _detect_data_type(self, features: List[Dict[str, Any]]) -> str:
        """检测数据类型"""
        
        if not features:
            return 'default'
        
        feature = features[0]
        
        # 1. 语言模型检测
        if 'input_ids' in feature and 'labels' in feature:
            if isinstance(feature['labels'], list) and len(feature['labels']) == len(feature['input_ids']):
                return 'language_modeling'
        
        # 2. Token分类检测
        if 'labels' in feature and isinstance(feature['labels'], list):
            return 'token_classification'
        
        # 3. 序列分类检测
        if 'labels' in feature and isinstance(feature['labels'], (int, float)):
            return 'sequence_classification'
        
        return 'default'
    
    def _clean_feature_data(self, feature: Dict[str, Any]) -> Dict[str, Any]:
        """清洗特征数据"""
        
        cleaned_feature = {}
        
        for key, value in feature.items():
            # 1. 移除None值
            if value is None:
                continue
            
            # 2. 类型验证和转换
            if isinstance(value, str):
                cleaned_feature[key] = value.strip()
            elif isinstance(value, (list, tuple)):
                # 移除空值
                cleaned_feature[key] = [v for v in value if v is not None]
            else:
                cleaned_feature[key] = value
        
        return cleaned_feature

4. 高级特性和优化

4.1 并行数据处理

python 复制代码

class ParallelDataProcessor:
    """并行数据处理"""
    
    def __init__(self, num_workers: int = None):
        self.num_workers = num_workers or mp.cpu_count()
        self.process_pool = None
    
    def __enter__(self):
        """上下文管理器入口"""
        self.process_pool = mp.Pool(self.num_workers)
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        """上下文管理器出口"""
        if self.process_pool:
            self.process_pool.close()
            self.process_pool.join()
    
    def parallel_process(
        self,
        data: List[Any],
        process_func: Callable,
        chunk_size: Optional[int] = None
    ) -> List[Any]:
        """并行处理数据"""
        
        if not self.process_pool:
            raise RuntimeError("Parallel processor not initialized")
        
        # 1. 数据分块
        if chunk_size is None:
            chunk_size = max(1, len(data) // self.num_workers)
        
        chunks = [
            data[i:i + chunk_size] 
            for i in range(0, len(data), chunk_size)
        ]
        
        # 2. 并行处理
        results = self.process_pool.map(process_func, chunks)
        
        # 3. 展平结果
        flat_results = []
        for chunk_result in results:
            flat_results.extend(chunk_result)
        
        return flat_results
    
    def parallel_tokenize(
        self,
        texts: List[str],
        tokenizer,
        batch_size: int = 1000
    ) -> List[List[int]]:
        """并行分词"""
        
        def tokenize_batch(text_batch):
            return [tokenizer.encode(text) for text in text_batch]
        
        with ParallelDataProcessor() as processor:
            return processor.parallel_process(texts, tokenize_batch, batch_size)

class MemoryOptimizedDataLoader:
    """内存优化的数据加载器"""
    
    def __init__(
        self,
        dataset,
        batch_size: int,
        shuffle: bool = True,
        memory_limit_mb: int = 1024
    ):
        self.dataset = dataset
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.memory_limit_bytes = memory_limit_mb * 1024 * 1024
        
        # 内存监控
        self.current_memory_usage = 0
        self.loaded_batches = []
    
    def __iter__(self):
        """生成器迭代器"""
        
        # 1. 数据索引
        indices = list(range(len(self.dataset)))
        
        if self.shuffle:
            import random
            random.shuffle(indices)
        
        # 2. 批量生成
        for i in range(0, len(indices), self.batch_size):
            batch_indices = indices[i:i + self.batch_size]
            batch = [self.dataset[idx] for idx in batch_indices]
            
            # 3. 内存检查
            if self._check_memory_limit(batch):
                yield self._process_batch(batch)
            else:
                # 内存不足，处理单个样本
                for sample in batch:
                    yield [sample]
    
    def _check_memory_limit(self, batch: List[Any]) -> bool:
        """检查内存限制"""
        
        # 估算批次内存使用
        batch_size_bytes = self._estimate_batch_memory(batch)
        
        if batch_size_bytes > self.memory_limit_bytes:
            return False
        
        self.current_memory_usage += batch_size_bytes
        return True
    
    def _estimate_batch_memory(self, batch: List[Any]) -> int:
        """估算批次内存使用"""
        
        if not batch:
            return 0
        
        # 简单估算：基于样本大小
        sample = batch[0]
        
        if isinstance(sample, dict):
            # 字典样本
            sample_size = sum(
                self._estimate_object_size(value) 
                for value in sample.values()
            )
        else:
            # 其他类型样本
            sample_size = self._estimate_object_size(sample)
        
        return sample_size * len(batch)
    
    def _estimate_object_size(self, obj) -> int:
        """估算对象内存大小"""
        
        if isinstance(obj, str):
            return len(obj.encode('utf-8'))
        elif isinstance(obj, (list, tuple)):
            return sum(self._estimate_object_size(item) for item in obj)
        elif isinstance(obj, dict):
            return sum(self._estimate_object_size(v) for v in obj.values())
        elif isinstance(obj, np.ndarray):
            return obj.nbytes
        elif hasattr(obj, '__sizeof__'):
            return obj.__sizeof__()
        else:
            return 64  # 默认估算

4.2 动态数据增强

python 复制代码

class DataAugmentationPipeline:
    """数据增强流水线"""
    
    def __init__(self, augmentation_config: Dict[str, Any]):
        self.config = augmentation_config
        self.augmenters = self._setup_augmenters()
    
    def _setup_augmenters(self) -> Dict[str, Any]:
        """设置数据增强器"""
        
        augmenters = {}
        
        # 文本增强
        if 'text' in self.config:
            text_config = self.config['text']
            augmenters['text'] = {
                'synonym_replacement': SynonymReplacementAugmenter(text_config),
                'random_insertion': RandomInsertionAugmenter(text_config),
                'random_swap': RandomSwapAugmenter(text_config),
                'random_deletion': RandomDeletionAugmenter(text_config)
            }
        
        # 图像增强
        if 'image' in self.config:
            image_config = self.config['image']
            augmenters['image'] = {
                'random_crop': RandomCropAugmenter(image_config),
                'horizontal_flip': HorizontalFlipAugmenter(image_config),
                'color_jitter': ColorJitterAugmenter(image_config),
                'gaussian_noise': GaussianNoiseAugmenter(image_config)
            }
        
        return augmenters
    
    def augment_batch(self, batch_data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """批次数据增强"""
        
        augmented_batch = []
        
        for sample in batch_data:
            augmented_sample = self._augment_sample(sample)
            augmented_batch.append(augmented_sample)
        
        return augmented_batch
    
    def _augment_sample(self, sample: Dict[str, Any]) -> Dict[str, Any]:
        """增强单个样本"""
        
        augmented_sample = sample.copy()
        
        # 根据数据类型选择增强策略
        if 'input_ids' in sample:
            # 文本数据
            augmented_sample = self._augment_text_sample(augmented_sample)
        elif 'pixel_values' in sample:
            # 图像数据
            augmented_sample = self._augment_image_sample(augmented_sample)
        
        return augmented_sample
    
    def _augment_text_sample(self, sample: Dict[str, Any]) -> Dict[str, Any]:
        """文本样本增强"""
        
        text_augmenters = self.augmenters.get('text', {})
        
        if not text_augmenters:
            return sample
        
        # 选择增强策略
        augmentation_prob = self.config.get('augmentation_probability', 0.5)
        
        import random
        
        if random.random() > augmentation_prob:
            return sample
        
        # 随机选择一种增强方法
        augmenter_name = random.choice(list(text_augmenters.keys()))
        augmenter = text_augmenters[augmenter_name]
        
        # 反tokenize文本
        if hasattr(self, 'tokenizer') and self.tokenizer:
            text = self.tokenizer.decode(sample['input_ids'])
            augmented_text = augmenter.augment(text)
            augmented_input_ids = self.tokenizer.encode(augmented_text)
            sample['input_ids'] = augmented_input_ids
        
        return sample

class SynonymReplacementAugmenter:
    """同义词替换增强器"""
    
    def __init__(self, config: Dict[str, Any]):
        self.replacement_prob = config.get('replacement_prob', 0.1)
        self.synonym_dict = config.get('synonym_dict', {})
    
    def augment(self, text: str) -> str:
        """执行同义词替换"""
        
        import random
        
        words = text.split()
        augmented_words = []
        
        for word in words:
            if (word in self.synonym_dict and 
                random.random() < self.replacement_prob):
                # 替换为同义词
                synonyms = self.synonym_dict[word]
                augmented_word = random.choice(synonyms)
                augmented_words.append(augmented_word)
            else:
                augmented_words.append(word)
        
        return ' '.join(augmented_words)

class RandomCropAugmenter:
    """随机裁剪增强器"""
    
    def __init__(self, config: Dict[str, Any]):
        self.crop_ratio = config.get('crop_ratio', (0.8, 1.0))
    
    def augment(self, image: np.ndarray) -> np.ndarray:
        """执行随机裁剪"""
        
        import random
        
        h, w = image.shape[:2]
        crop_h = int(h * random.uniform(*self.crop_ratio))
        crop_w = int(w * random.uniform(*self.crop_ratio))
        
        # 随机选择裁剪位置
        top = random.randint(0, h - crop_h)
        left = random.randint(0, w - crop_w)
        
        # 执行裁剪
        cropped_image = image[top:top+crop_h, left:left+crop_w]
        
        return cropped_image

4.3 自适应批处理

python 复制代码

class AdaptiveBatchProcessor:
    """自适应批处理器"""
    
    def __init__(
        self,
        base_batch_size: int = 32,
        max_batch_size: int = 512,
        memory_limit_mb: int = 4096,
        adaptation_rate: float = 0.1
    ):
        self.base_batch_size = base_batch_size
        self.max_batch_size = max_batch_size
        self.memory_limit_bytes = memory_limit_mb * 1024 * 1024
        self.adaptation_rate = adaptation_rate
        
        self.current_batch_size = base_batch_size
        self.performance_history = []
    
    def adaptive_batch_collate(
        self,
        features: List[Dict[str, Any]],
        performance_monitor: Optional[Callable] = None
    ) -> Dict[str, Any]:
        """自适应批次整理"""
        
        # 1. 批量大小自适应
        if len(features) > self.current_batch_size:
            features = features[:self.current_batch_size]
        
        # 2. 执行整理
        batch = default_data_collator(features)
        
        # 3. 性能监控和调整
        if performance_monitor:
            performance_metrics = performance_monitor(batch)
            self._update_batch_size(performance_metrics)
        
        return batch
    
    def _update_batch_size(self, performance_metrics: Dict[str, float]):
        """根据性能指标更新批量大小"""
        
        # 1. 记录性能历史
        self.performance_history.append(performance_metrics)
        
        # 2. 计算性能趋势
        if len(self.performance_history) >= 3:
            recent_performance = self.performance_history[-3:]
            
            # 分析吞吐量趋势
            throughput_trend = self._analyze_throughput_trend(recent_performance)
            
            # 分析内存使用趋势
            memory_trend = self._analyze_memory_trend(recent_performance)
            
            # 3. 调整批量大小
            if throughput_trend == 'decreasing' and memory_trend == 'under_utilized':
                # 增加批量大小
                new_batch_size = min(
                    int(self.current_batch_size * (1 + self.adaptation_rate)),
                    self.max_batch_size
                )
                self.current_batch_size = new_batch_size
                logger.info(f"Increased batch size to {new_batch_size}")
            
            elif memory_trend == 'over_utilized':
                # 减少批量大小
                new_batch_size = max(
                    int(self.current_batch_size * (1 - self.adaptation_rate)),
                    1
                )
                self.current_batch_size = new_batch_size
                logger.info(f"Decreased batch size to {new_batch_size}")
    
    def _analyze_throughput_trend(self, performance_history: List[Dict[str, float]]) -> str:
        """分析吞吐量趋势"""
        
        throughputs = [p.get('throughput', 0) for p in performance_history]
        
        if len(throughputs) < 2:
            return 'unknown'
        
        # 计算趋势
        recent_avg = np.mean(throughputs[-2:])
        older_avg = np.mean(throughputs[:-2]) if len(throughputs) > 2 else throughputs[0]
        
        if recent_avg > older_avg * 1.05:  # 5%提升阈值
            return 'increasing'
        elif recent_avg < older_avg * 0.95:  # 5%下降阈值
            return 'decreasing'
        else:
            return 'stable'
    
    def _analyze_memory_trend(self, performance_history: List[Dict[str, float]]) -> str:
        """分析内存使用趋势"""
        
        memory_usage = [p.get('memory_usage_mb', 0) for p in performance_history]
        
        if not memory_usage:
            return 'unknown'
        
        avg_memory_usage = np.mean(memory_usage)
        memory_limit_mb = self.memory_limit_bytes / (1024 * 1024)
        
        utilization_ratio = avg_memory_usage / memory_limit_mb
        
        if utilization_ratio < 0.7:
            return 'under_utilized'
        elif utilization_ratio > 0.9:
            return 'over_utilized'
        else:
            return 'optimal'

5. 总结与展望

5.1 数据处理模块架构优势总结

Transformers数据处理模块通过其精心设计的架构展现了现代数据处理的最佳实践：

1. 模块化设计 : 清晰的功能分离，数据整理器、指标计算、数据集处理各司其职

2. 可扩展性 : 通过抽象基类和插件机制支持自定义数据处理逻辑

3. 性能优化 : 多层次的优化策略，从并行处理到内存管理

4. 类型安全 : 强类型设计和运行时检查确保数据处理的可靠性

5. 易用性: 统一的接口和默认实现降低了使用门槛

5.2 技术创新亮点

1. 自适应批处理 : 根据硬件性能自动调整批量大小

2. 动态数据增强 : 运行时数据增强提高模型鲁棒性

3. 内存优化 : 智能内存管理和流式处理支持大规模数据

4. 并行处理 : 多进程/多线程并行处理充分利用硬件资源

5. 指标系统: 丰富的评估指标库支持全面的模型评估

5.3 未来发展方向

1. 智能预处理 : AI驱动的自动数据预处理策略选择

2. 云端数据处理 : 原生支持云存储和分布式数据处理

3. 实时数据流 : 支持流式数据处理和在线学习

4. 多模态融合 : 更高级的多模态数据处理和特征融合

5. 自动化数据标注: 集成自动数据标注和质量检测

5.4 最佳实践建议

1. 合理选择整理器 : 根据任务类型选择合适的数据整理器

2. 监控内存使用 : 在处理大规模数据时密切监控内存使用情况

3. 利用并行处理 : 在硬件条件允许的情况下启用并行处理

4. 数据质量检查 : 在训练前进行数据质量验证和清洗

5. 性能基准测试: 定期进行数据处理性能基准测试和优化

Transformers数据处理模块通过其卓越的架构设计和丰富的功能特性，为深度学习模型提供了强大的数据处理基础设施，是现代AI系统训练效率和质量的重要保障。其设计理念对其他机器学习框架的数据处理模块具有重要的借鉴意义。