Python爬虫零基础入门【第九章：实战项目教学·第3节】通用清洗工具包：日期/金额/单位/空值（可复用）！

🔥本期内容已收录至专栏《Python爬虫实战》，持续完善知识体系与项目实战，建议先订阅收藏，后续查阅更方便～持续更新中！！

全文目录：

- - [🌟 开篇语](#🌟 开篇语)
  - [📚 上期回顾](#📚 上期回顾)
  - [🎯 本篇目标](#🎯 本篇目标)
  - [💡 数据清洗的六大场景](#💡 数据清洗的六大场景)
  - - 场景概览
  - [🛠️ 完整 Demo 代码](#🛠️ 完整 Demo 代码)
  - - 项目结构
    - - [模块 1：日期时间清洗器](#模块 1：日期时间清洗器)
      - [模块 2：数字金额清洗器](#模块 2：数字金额清洗器)
      - [模块 3：文本清洗器](#模块 3：文本清洗器)
      - [模块 4：空值处理器](#模块 4：空值处理器)
      - [模块 5：清洗流水线](#模块 5：清洗流水线)
      - [模块 6：使用示例](#模块 6：使用示例)
      - [模块 7：集成到采集流程](#模块 7：集成到采集流程)
      - [模块 8：单元测试](#模块 8：单元测试)
  - [📊 完整运行效果](#📊 完整运行效果)
  - [📝 配置文件方式](#📝 配置文件方式)
  - [📝 小结](#📝 小结)
  - [🎯 下期预告](#🎯 下期预告)
  - [🌟 文末](#🌟 文末)
  - - [📌 专栏持续更新中｜建议收藏 + 订阅](#📌 专栏持续更新中｜建议收藏 + 订阅)
    - [✅ 互动征集](#✅ 互动征集)

🌟 开篇语

哈喽，各位小伙伴们你们好呀～我是【喵手】。

运营社区： C站 / 掘金 / 腾讯云 / 阿里云 / 华为云 / 51CTO

欢迎大家常来逛逛，一起学习，一起进步～🌟

我长期专注 Python 爬虫工程化实战 ，主理专栏 👉 《Python爬虫实战》：从采集策略 到反爬对抗 ，从数据清洗 到分布式调度 ，持续输出可复用的方法论与可落地案例。内容主打一个"能跑、能用、能扩展 "，让数据价值真正做到------抓得到、洗得净、用得上。

📌 专栏食用指南（建议收藏）

✅ 入门基础：环境搭建 / 请求与解析 / 数据落库
✅ 进阶提升：登录鉴权 / 动态渲染 / 反爬对抗
✅ 工程实战：异步并发 / 分布式调度 / 监控与容错
✅ 项目落地：数据治理 / 可视化分析 / 场景化应用

📣 专栏推广时间 ：如果你想系统学爬虫，而不是碎片化东拼西凑，欢迎订阅/关注专栏《Python爬虫实战》

订阅后更新会优先推送，按目录学习更高效～

📚 上期回顾

上一篇《"接口优先"项目：从 Network 还原 JSON 接口分页》我们学会了从 Network 找接口、还原 API 请求，用轻量级的 requests 替代笨重的 Playwright。现在你能快速采集到 JSON 数据了------速度快、效率高。

但新问题来了：采集到的原始数据往往很"脏"------时间格式五花八门（2026-01-23、1小时前、1737600000）、金额带单位（¥99.9、99万）、字段有空值......直接入库会导致后续查询、分析困难重重。

今天，我们就来打造一个生产级的数据清洗工具包------处理 90% 的脏数据场景，拿来即用！

🎯 本篇目标

看完这篇，你能做到：

掌握 6 大清洗场景（日期、金额、单位、空值、文本、去重）
封装通用清洗函数（可配置、可扩展）
集成到采集流程（采集→清洗→入库）
测试驱动开发（每个函数都有测试用例）

验收标准：用这个工具包清洗 3 个不同来源的数据，字段完整率 > 95%。

💡 数据清洗的六大场景

场景概览

场景	典型问题	示例
日期时间	格式不统一、相对时间	`2小时前`、`1737600000`
金额数字	带符号、带单位、中文数字	`¥99.9万`、`一千二百`
单位转换	不同单位混杂	`1.5GB`、`500MB`
空值处理	None、空字符串、占位符	`null`、`--`、`暂无`
文本清洗	多余空格、特殊字符	`标题\n`、` `
数据去重	重复记录、标题相似	相似度 > 90%

🛠️ 完整 Demo 代码

项目结构

json 复制代码

data_cleaner/
├── src/
│   ├── cleaners/
│   │   ├── __init__.py
│   │   ├── datetime_cleaner.py    # 日期清洗
│   │   ├── number_cleaner.py      # 数字清洗
│   │   ├── text_cleaner.py        # 文本清洗
│   │   └── validator.py           # 数据验证
│   │
│   ├── core/
│   │   ├── pipeline.py            # 清洗流水线
│   │   └── config.py              # 配置管理
│   │
│   └── utils/
│       └── helpers.py             # 辅助函数
│
├── tests/
│   ├── test_datetime.py
│   ├── test_number.py
│   └── test_text.py
│
├── config/
│   └── cleaning_rules.yaml        # 清洗规则
│
├── examples/
│   └── demo.py                    # 使用示例
│
├── requirements.txt
└── README.md

模块 1：日期时间清洗器

python 复制代码

# src/cleaners/datetime_cleaner.py
import re
from datetime import datetime, timedelta
from typing import Optional, Union
import time

class DateTimeCleaner:
    """日期时间清洗器"""
    
    # 常见日期格式
    FORMATS = [
        '%Y-%m-%d %H:%M:%S',
        '%Y/%m/%d %H:%M:%S',
        '%Y年%m月%d日 %H:%M:%S',
        '%Y-%m-%d',
        '%Y/%m/%d',
        '%Y年%m月%d日',
        '%m-%d %H:%M',
        '%m/%d %H:%M',
    ]
    
    def clean(self, value: any, output_format: str = '%Y-%m-%d %H:%M:%S') -> Optional[dict]:
        """
        清洗日期时间
        
        Args:
            value: 原始值（字符串、时间戳、datetime 对象）
            output_format: 输出格式
        
        Returns:
            dict: {
                'datetime': '2026-01-23 10:00:00',
                'timestamp': 1737603600,
                'date': '2026-01-23',
                'time': '10:00:00'
            }
        """
        if value is None or value == '':
            return None
        
        try:
            # 1. 处理 datetime 对象
            if isinstance(value, datetime):
                return self._format_datetime(value, output_format)
            
            # 2. 处理时间戳（整数或字符串数字）
            if isinstance(value, (int, float)) or (isinstance(value, str) and value.isdigit()):
                timestamp = int(value) if isinstance(value, str) else value
                
                # 判断是秒还是毫秒
                if timestamp > 10000000000:  # 毫秒时间戳
                    timestamp = timestamp / 1000
                
                dt = datetime.fromtimestamp(timestamp)
                return self._format_datetime(dt, output_format)
            
            # 3. 处理字符串
            if isinstance(value, str):
                value = value.strip()
                
                # 处理相对时间
                dt = self._parse_relative_time(value)
                if dt:
                    return self._format_datetime(dt, output_format)
                
                # 处理标准格式
                dt = self._parse_standard_format(value)
                if dt:
                    return self._format_datetime(dt, output_format)
            
            return None
            
        except Exception as e:
            print(f"⚠️ 日期解析失败：{value}，错误：{e}")
            return None
    
    def _parse_relative_time(self, text: str) -> Optional[datetime]:
        """解析相对时间（如：1小时前、3天前）"""
        now = datetime.now()
        
        # 刚刚、此刻
        if any(word in text for word in ['刚刚', '此刻', '刚才']):
            return now
        
        # X秒前
        match = re.search(r'(\d+)\s*秒前', text)
        if match:
            seconds = int(match.group(1))
            return now - timedelta(seconds=seconds)
        
        # X分钟前
        match = re.search(r'(\d+)\s*分钟前', text)
        if match:
            minutes = int(match.group(1))
            return now - timedelta(minutes=minutes)
        
        # X小时前
        match = re.search(r'(\d+)\s*小时前', text)
        if match:
            hours = int(match.group(1))
            return now - timedelta(hours=hours)
        
        # X天前
        match = re.search(r'(\d+)\s*天前', text)
        if match:
            days = int(match.group(1))
            return now - timedelta(days=days)
        
        # 昨天
        if '昨天' in text:
            time_part = re.search(r'(\d{1,2}):(\d{2})', text)
            if time_part:
                hour, minute = int(time_part.group(1)), int(time_part.group(2))
                return (now - timedelta(days=1)).replace(hour=hour, minute=minute, second=0)
            return now - timedelta(days=1)
        
        # 前天
        if '前天' in text:
            return now - timedelta(days=2)
        
        return None
    
    def _parse_standard_format(self, text: str) -> Optional[datetime]:
        """解析标准格式"""
        for fmt in self.FORMATS:
            try:
                return datetime.strptime(text, fmt)
            except:
                continue
        
        # 尝试自动解析（dateutil）
        try:
            from dateutil import parser
            return parser.parse(text)
        except:
            pass
        
        return None
    
    def _format_datetime(self, dt: datetime, output_format: str) -> dict:
        """格式化输出"""
        return {
            'datetime': dt.strftime(output_format),
            'timestamp': int(dt.timestamp()),
            'date': dt.strftime('%Y-%m-%d'),
            'time': dt.strftime('%H:%M:%S'),
            'year': dt.year,
            'month': dt.month,
            'day': dt.day
        }

模块 2：数字金额清洗器

python 复制代码

# src/cleaners/number_cleaner.py
import re
from typing import Optional, Union

class NumberCleaner:
    """数字金额清洗器"""
    
    # 中文数字映射
    CN_NUM = {
        '零': 0, '一': 1, '二': 2, '三': 3, '四': 4,
        '五': 5, '六': 6, '七': 7, '八': 8, '九': 9,
        '十': 10, '百': 100, '千': 1000, '万': 10000,
        '亿': 100000000
    }
    
    def clean_number(self, value: any, keep_unit: bool = False) -> Optional[Union[float, dict]]:
        """
        清洗数字
        
        Args:
            value: 原始值
            keep_unit: 是否保留单位信息
        
        Returns:
            float: 数字（keep_unit=False）
            dict: {value: 1500, unit: 'MB', original: '1.5GB'}
        """
        if value is None or value == '':
            return None
        
        try:
            # 已经是数字
            if isinstance(value, (int, float)):
                return value if not keep_unit else {'value': float(value), 'unit': None}
            
            value = str(value).strip()
            
            # 移除常见符号
            value = value.replace(',', '').replace('，', '')
            value = value.replace('$', '').replace('¥', '').replace('￥', '')
            value = value.replace('元', '').replace('块', '')
            
            # 处理单位（万、亿、K、M等）
            number, unit, multiplier = self._extract_unit(value)
            
            # 尝试直接转换
            try:
                result = float(number) * multiplier
                
                if keep_unit:
                    return {
                        'value': result,
                        'unit': unit,
                        'original': value
                    }
                return result
            except:
                pass
            
            # 处理中文数字
            result = self._parse_chinese_number(value)
            if result is not None:
                if keep_unit:
                    return {'value': result, 'unit': None, 'original': value}
                return result
            
            return None
            
        except Exception as e:
            print(f"⚠️ 数字解析失败：{value}，错误：{e}")
            return None
    
    def _extract_unit(self, text: str) -> tuple:
        """提取单位和倍数"""
        # 万、亿
        if '亿' in text:
            number = text.replace('亿', '')
            return number, '亿', 100000000
        
        if '万' in text:
            number = text.replace('万', '')
            return number, '万', 10000
        
        # K、M、B（英文）
        if text.upper().endswith('K'):
            return text[:-1], 'K', 1000
        
        if text.upper().endswith('M'):
            return text[:-1], 'M', 1000000
        
        if text.upper().endswith('B'):
            return text[:-1], 'B', 1000000000
        
        # GB、MB、KB（字节单位）
        if 'GB' in text.upper():
            number = re.sub(r'GB', '', text, flags=re.IGNORECASE)
            return number, 'GB', 1024 * 1024 * 1024
        
        if 'MB' in text.upper():
            number = re.sub(r'MB', '', text, flags=re.IGNORECASE)
            return number, 'MB', 1024 * 1024
        
        if 'KB' in text.upper():
            number = re.sub(r'KB', '', text, flags=re.IGNORECASE)
            return number, 'KB', 1024
        
        return text, None, 1
    
    def _parse_chinese_number(self, text: str) -> Optional[float]:
        """解析中文数字"""
        # 简单实现，处理如：一千二百
        total = 0
        current = 0
        
        for char in text:
            if char in self.CN_NUM:
                num = self.CN_NUM[char]
                
                if num >= 10:
                    if current == 0:
                        current = 1
                    current *= num
                    
                    if num >= 10000:
                        total += current
                        current = 0
                else:
                    current += num
        
        return total + current if total + current > 0 else None
    
    def clean_price(self, value: any) -> Optional[float]:
        """清洗价格（别名方法）"""
        return self.clean_number(value, keep_unit=False)
    
    def clean_count(self, value: any) -> Optional[int]:
        """清洗计数（返回整数）"""
        result = self.clean_number(value, keep_unit=False)
        return int(result) if result is not None else None

模块 3：文本清洗器

python 复制代码

# src/cleaners/text_cleaner.py
import re
import html
from typing import Optional

class TextCleaner:
    """文本清洗器"""
    
    # HTML 实体映射
    HTML_ENTITIES = {
        '&nbsp;': ' ',
        '&lt;': '<',
        '&gt;': '>',
        '&amp;': '&',
        '&quot;': '"',
        ''': "'",
    }
    
    def clean_text(self, text: any, 
                   strip: bool = True,
                   remove_html: bool = True,
                   normalize_space: bool = True,
                   remove_emoji: bool = False) -> Optional[str]:
        """
        清洗文本
        
        Args:
            text: 原始文本
            strip: 去除首尾空格
            remove_html: 移除 HTML 标签
            normalize_space: 规范化空格（多个空格合并）
            remove_emoji: 移除 emoji
        """
        if text is None or text == '':
            return None
        
        text = str(text)
        
        # 移除 HTML 标签
        if remove_html:
            text = self._remove_html_tags(text)
        
        # 解码 HTML 实体
        text = html.unescape(text)
        for entity, char in self.HTML_ENTITIES.items():
            text = text.replace(entity, char)
        
        # 移除 emoji
        if remove_emoji:
            text = self._remove_emoji(text)
        
        # 规范化空格
        if normalize_space:
            text = re.sub(r'\s+', ' ', text)
        
        # 去除首尾空格
        if strip:
            text = text.strip()
        
        # 空字符串返回 None
        return text if text else None
    
    def _remove_html_tags(self, text: str) -> str:
        """移除 HTML 标签"""
        # 移除 script 和 style 内容
        text = re.sub(r'<script[^>]*>.*?</script>', '', text, flags=re.DOTALL | re.IGNORECASE)
        text = re.sub(r'<style[^>]*>.*?</style>', '', text, flags=re.DOTALL | re.IGNORECASE)
        
        # 移除所有 HTML 标签
        text = re.sub(r'<[^>]+>', '', text)
        
        return text
    
    def _remove_emoji(self, text: str) -> str:
        """移除 emoji"""
        emoji_pattern = re.compile(
            "["
            "\U0001F600-\U0001F64F"  # 表情符号
            "\U0001F300-\U0001F5FF"  # 符号和图标
            "\U0001F680-\U0001F6FF"  # 交通和地图
            "\U0001F1E0-\U0001F1FF"  # 国旗
            "]+", flags=re.UNICODE
        )
        return emoji_pattern.sub('', text)
    
    def clean_title(self, title: any, max_length: int = 200) -> Optional[str]:
        """
        清洗标题
        
        Args:
            title: 原始标题
            max_length: 最大长度
        """
        title = self.clean_text(title, remove_html=True, normalize_space=True)
        
        if title and len(title) > max_length:
            title = title[:max_length] + '...'
        
        return title
    
    def clean_content(self, content: any, min_length: int = 10) -> Optional[str]:
        """
        清洗正文
        
        Args:
            content: 原始正文
            min_length: 最小长度（过滤无效内容）
        """
        content = self.clean_text(content, remove_html=True, normalize_space=True)
        
        if content and len(content) < min_length:
            return None
        
        return content
    
    def normalize_url(self, url: any) -> Optional[str]:
        """规范化 URL"""
        if not url:
            return None
        
        url = str(url).strip()
        
        # 移除空格
        url = url.replace(' ', '')
        
        # 确保有协议
        if url and not url.startswith(('http://', 'https://')):
            url = 'https://' + url
        
        return url if url else None

模块 4：空值处理器

python 复制代码

# src/cleaners/validator.py
from typing import Any, Optional, List

class NullHandler:
    """空值处理器"""
    
    # 被视为空值的占位符
    NULL_PLACEHOLDERS = [
        None, '', 'null', 'NULL', 'None', 'none',
        'N/A', 'n/a', 'NA', 'na',
        '--', '---', '------', '暂无', '无',
        '待定', 'TBD', 'tbd'
    ]
    
    def is_null(self, value: Any) -> bool:
        """判断是否为空值"""
        if value in self.NULL_PLACEHOLDERS:
            return True
        
        # 字符串空格
        if isinstance(value, str) and value.strip() == '':
            return True
        
        # 空列表、空字典
        if isinstance(value, (list, dict)) and len(value) == 0:
            return True
        
        return False
    
    def fill_null(self, value: Any, default: Any = None) -> Any:
        """填充空值"""
        return default if self.is_null(value) else value
    
    def remove_nulls(self, data: dict) -> dict:
        """移除字典中的空值字段"""
        return {k: v for k, v in data.items() if not self.is_null(v)}
    
    def validate_required(self, data: dict, required_fields: List[str]) -> tuple:
        """
        验证必填字段
        
        Returns:
            tuple: (is_valid, missing_fields)
        """
        missing = []
        
        for field in required_fields:
            if field not in data or self.is_null(data[field]):
                missing.append(field)
        
        return len(missing) == 0, missing

模块 5：清洗流水线

python 复制代码

# src/core/pipeline.py
from typing import Dict, Any, List, Callable
from ..cleaners.datetime_cleaner import DateTimeCleaner
from ..cleaners.number_cleaner import NumberCleaner
from ..cleaners.text_cleaner import TextCleaner
from ..cleaners.validator import NullHandler

class CleaningPipeline:
    """数据清洗流水线"""
    
    def __init__(self):
        self.datetime_cleaner = DateTimeCleaner()
        self.number_cleaner = NumberCleaner()
        self.text_cleaner = TextCleaner()
        self.null_handler = NullHandler()
        
        # 清洗规则
        self.rules = {}
    
    def add_rule(self, field: str, cleaner_type: str, **kwargs):
        """
        添加清洗规则
        
        Args:
            field: 字段名
            cleaner_type: 清洗器类型（datetime/number/text/custom）
            **kwargs: 清洗参数
        """
        self.rules[field] = {
            'type': cleaner_type,
            'params': kwargs
        }
        
        return self  # 链式调用
    
    def clean(self, data: Dict[str, Any]) -> Dict[str, Any]:
        """
        执行清洗
        
        Args:
            data: 原始数据
        
        Returns:
            dict: 清洗后的数据
        """
        cleaned = {}
        
        for field, value in data.items():
            # 如果有清洗规则，应用规则
            if field in self.rules:
                rule = self.rules[field]
                cleaned_value = self._apply_rule(value, rule)
            else:
                # 无规则，保持原值
                cleaned_value = value
            
            # 填充空值
            if field in self.rules and 'default' in self.rules[field]['params']:
                default = self.rules[field]['params']['default']
                cleaned_value = self.null_handler.fill_null(cleaned_value, default)
            
            cleaned[field] = cleaned_value
        
        return cleaned
    
    def _apply_rule(self, value: Any, rule: Dict) -> Any:
        """应用清洗规则"""
        cleaner_type = rule['type']
        params = rule['params']
        
        if cleaner_type == 'datetime':
            return self.datetime_cleaner.clean(value, **params)
        
        elif cleaner_type == 'number':
            return self.number_cleaner.clean_number(value, **params)
        
        elif cleaner_type == 'price':
            return self.number_cleaner.clean_price(value)
        
        elif cleaner_type == 'count':
            return self.number_cleaner.clean_count(value)
        
        elif cleaner_type == 'text':
            return self.text_cleaner.clean_text(value, **params)
        
        elif cleaner_type == 'title':
            return self.text_cleaner.clean_title(value, **params)
        
        elif cleaner_type == 'content':
            return self.text_cleaner.clean_content(value, **params)
        
        elif cleaner_type == 'url':
            return self.text_cleaner.normalize_url(value)
        
        elif cleaner_type == 'custom':
            # 自定义清洗函数
            func = params.get('func')
            if callable(func):
                return func(value)
        
        return value
    
    def batch_clean(self, data_list: List[Dict]) -> List[Dict]:
        """批量清洗"""
        return [self.clean(data) for data in data_list]
    
    def validate(self, data: Dict, required_fields: List[str] = None) -> tuple:
        """
        验证数据完整性
        
        Returns:
            tuple: (is_valid, missing_fields, cleaned_data)
        """
        cleaned_data = self.clean(data)
        
        if required_fields:
            is_valid, missing = self.null_handler.validate_required(
                cleaned_data, required_fields
            )
            return is_valid, missing, cleaned_data
        
        return True, [], cleaned_data

模块 6：使用示例

python 复制代码

# examples/demo.py
from src.core.pipeline import CleaningPipeline
from src.cleaners.datetime_cleaner import DateTimeCleaner
from src.cleaners.number_cleaner import NumberCleaner
from src.cleaners.text_cleaner import TextCleaner

def demo_basic():
    """基础使用示例"""
    print("="*60)
    print("基础清洗示例")
    print("="*60)
    
    # 1. 日期清洗
    dt_cleaner = DateTimeCleaner()
    
    print("\n【日期清洗】")
    test_dates = [
        "2026-01-23 10:00:00",
        "1小时前",
        "昨天 15:30",
        1737603600,  # 时间戳
        "2026/01/23"
    ]
    
    for date_str in test_dates:
        result = dt_cleaner.clean(date_str)
        print(f"  {date_str:20} → {result}")
    
    # 2. 数字清洗
    num_cleaner = NumberCleaner()
    
    print("\n【数字清洗】")
    test_numbers = [
        "¥99.9",
        "1.5万",
        "300亿",
        "1,234,567",
        "一千二百",
        "1.5GB"
    ]
    
    for num_str in test_numbers:
        result = num_cleaner.clean_number(num_str, keep_unit=True)
        print(f"  {num_str:15} → {result}")
    
    # 3. 文本清洗
    text_cleaner = TextCleaner()
    
    print("\n【文本清洗】")
    test_texts = [
        "  标题含有\n多余空格  ",
        "<p>HTML标签</p>",
        "包含&nbsp;实体",
        "https://example.com/news/123"
    ]
    
    for text in test_texts:
        result = text_cleaner.clean_text(text)
        print(f"  '{text:30}' → '{result}'")

def demo_pipeline():
    """流水线使用示例"""
    print("\n" + "="*60)
    print("流水线清洗示例")
    print("="*60)
    
    # 创建流水线
    pipeline = CleaningPipeline()
    
    # 添加清洗规则
    pipeline.add_rule('title', 'title', max_length=100) \
            .add_rule('publish_time', 'datetime', output_format='%Y-%m-%d %H:%M:%S') \
            .add_rule('price', 'price') \
            .add_rule('view_count', 'count') \
            .add_rule('content', 'content', min_length=20) \
            .add_rule('url', 'url')
    
    # 模拟原始数据
    raw_data = {
        'title': '  <b>新闻标题</b>含有HTML  ',
        'publish_time': '2小时前',
        'price': '¥99.9万',
        'view_count': '1,234,567',
        'content': '  这是正文内容...\n\n  ',
        'url': 'example.com/news/123'
    }
    
    print("\n【原始数据】")
    for k, v in raw_data.items():
        print(f"  {k:15}: {v}")
    
    # 清洗
    cleaned_data = pipeline.clean(raw_data)
    
    print("\n【清洗后】")
    for k, v in cleaned_data.items():
        print(f"  {k:15}: {v}")
    
    # 验证
    required_fields = ['title', 'publish_time', 'price']
    is_valid, missing, _ = pipeline.validate(raw_data, required_fields)
    
    print(f"\n【验证结果】")
    print(f"  有效: {is_valid}")
    print(f"  缺失字段: {missing if missing else '无'}")

def demo_batch():
    """批量清洗示例"""
    print("\n" + "="*60)
    print("批量清洗示例")
    print("="*60)
    
    # 创建清洗流水线
    pipeline = CleaningPipeline()
    
    # 配置清洗规则
    pipeline.add_rule('title', 'title') \
            .add_rule('price', 'price') \
            .add_rule('time', 'datetime')
    
    # 模拟多条原始数据
    raw_list = [
        {'title': '  商品A  ', 'price': '¥99', 'time': '1小时前'},
        {'title': '<p>商品B</p>', 'price': '199元', 'time': '昨天'},
        {'title': '商品C', 'price': '2.5万', 'time': '2026-01-23'},
    ]
    
    print("\n【原始数据】")
    for i, item in enumerate(raw_list, 1):
        print(f"  {i}. {item}")
    
    # 批量清洗
    cleaned_list = pipeline.batch_clean(raw_list)
    
    print("\n【清洗后】")
    for i, item in enumerate(cleaned_list, 1):
        print(f"  {i}. {item}")
    
    # 统计清洗效果
    print(f"\n清洗完成：共处理 {len(cleaned_list)} 条数据")

def demo_custom_cleaner():
    """自定义清洗函数示例"""
    print("\n" + "="*60)
    print("自定义清洗器示例")
    print("="*60)
    
    # 自定义清洗函数：提取文章分类
    def extract_category(value):
        """从标题中提取分类标签"""
        if not value:
            return None
        
        # 匹配 【xxx】 格式的分类
        import re
        match = re.search(r'【(.+?)】', value)
        if match:
            return match.group(1)
        
        # 匹配 [xxx] 格式
        match = re.search(r'\[(.+?)\]', value)
        if match:
            return match.group(1)
        
        return '未分类'
    
    # 创建流水线并添加自定义规则
    pipeline = CleaningPipeline()
    pipeline.add_rule('title', 'title') \
            .add_rule('category', 'custom', func=extract_category)
    
    # 测试数据
    raw_data = {
        'title': '【科技】人工智能新突破',
        'category': '【科技】人工智能新突破'  # 从标题中提取
    }
    
    print("\n【原始数据】")
    print(f"  {raw_data}")
    
    cleaned = pipeline.clean(raw_data)
    
    print("\n【清洗后】")
    print(f"  {cleaned}")

if __name__ == '__main__':
    # 运行所有示例
    demo_basic()
    demo_pipeline()
    demo_batch()
    demo_custom_cleaner()

模块 7：集成到采集流程

python 复制代码

# examples/integration_example.py
"""
完整采集+清洗流程示例
展示如何在实际爬虫项目中使用清洗工具包
"""

import requests
from src.core.pipeline import CleaningPipeline
import sqlite3

class NewsSpiderWithCleaning:
    """带清洗功能的新闻爬虫"""
    
    def __init__(self):
        self.session = requests.Session()
        self.db_conn = sqlite3.connect('news.db')
        
        # 初始化清洗流水线
        self.pipeline = self._setup_cleaning_pipeline()
    
    def _setup_cleaning_pipeline(self):
        """
        配置清洗流水线
        
        解析：
        1. 为每个字段指定清洗器类型
        2. 设置清洗参数（如最大长度、默认值）
        3. 支持链式调用，代码更简洁
        """
        pipeline = CleaningPipeline()
        
        # 配置各字段的清洗规则
        pipeline.add_rule('title', 'title', max_length=200) \
                .add_rule('author', 'text', default='未知作者') \
                .add_rule('publish_time', 'datetime', output_format='%Y-%m-%d %H:%M:%S') \
                .add_rule('view_count', 'count', default=0) \
                .add_rule('content', 'content', min_length=50) \
                .add_rule('price', 'price') \
                .add_rule('url', 'url')
        
        return pipeline
    
    def fetch_and_clean(self, api_url):
        """
        采集并清洗数据
        
        流程：
        1. 请求 API 获取原始数据
        2. 提取数据列表
        3. 逐条清洗
        4. 验证必填字段
        5. 保存到数据库
        """
        print(f"🔍 采集接口：{api_url}")
        
        try:
            # 步骤 1：请求 API
            resp = self.session.get(api_url, timeout=10)
            resp.raise_for_status()
            data = resp.json()
            
            # 步骤 2：提取数据列表
            items = data.get('data', {}).get('list', [])
            print(f"   获取到 {len(items)} 条原始数据")
            
            # 步骤 3：批量清洗
            cleaned_items = []
            failed_items = []
            
            for item in items:
                # 清洗单条数据
                is_valid, missing, cleaned = self.pipeline.validate(
                    item,
                    required_fields=['title', 'publish_time']  # 必填字段
                )
                
                if is_valid:
                    cleaned_items.append(cleaned)
                else:
                    failed_items.append({
                        'data': item,
                        'missing': missing
                    })
                    print(f"   ⚠️ 验证失败：缺少字段 {missing}")
            
            print(f"   ✅ 清洗成功：{len(cleaned_items)} 条")
            print(f"   ❌ 验证失败：{len(failed_items)} 条")
            
            # 步骤 4：保存到数据库
            saved_count = self._save_to_db(cleaned_items)
            print(f"   💾 入库成功：{saved_count} 条")
            
            return {
                'success': cleaned_items,
                'failed': failed_items
            }
            
        except Exception as e:
            print(f"   ❌ 采集失败：{e}")
            return None
    
    def _save_to_db(self, items):
        """
        保存到数据库
        
        解析：
        1. 清洗后的数据字段已标准化
        2. 时间字段已转换为统一格式
        3. 可以直接入库，无需二次处理
        """
        saved = 0
        
        for item in items:
            try:
                # 提取时间戳（如果是字典格式）
                publish_time = item.get('publish_time')
                if isinstance(publish_time, dict):
                    publish_time_str = publish_time.get('datetime')
                    publish_timestamp = publish_time.get('timestamp')
                else:
                    publish_time_str = publish_time
                    publish_timestamp = None
                
                # 插入数据库
                self.db_conn.execute("""
                    INSERT OR IGNORE INTO articles (
                        title, author, publish_time, publish_timestamp,
                        view_count, content, url
                    )
                    VALUES (?, ?, ?, ?, ?, ?, ?)
                """, (
                    item.get('title'),
                    item.get('author'),
                    publish_time_str,
                    publish_timestamp,
                    item.get('view_count'),
                    item.get('content'),
                    item.get('url')
                ))
                
                saved += 1
                
            except Exception as e:
                print(f"   ⚠️ 入库失败：{e}")
        
        self.db_conn.commit()
        return saved

# 使用示例
if __name__ == '__main__':
    spider = NewsSpiderWithCleaning()
    
    # 模拟 API 接口
    api_url = 'https://api.example.com/news/list'
    
    result = spider.fetch_and_clean(api_url)
    
    if result:
        print(f"\n采集完成：")
        print(f"  成功：{len(result['success'])} 条")
        print(f"  失败：{len(result['failed'])} 条")

模块 8：单元测试

python 复制代码

# tests/test_datetime.py
"""
日期时间清洗器测试

解析：
1. 使用 unittest 框架编写测试
2. 覆盖各种边界情况
3. 确保清洗逻辑的正确性
"""

import unittest
from src.cleaners.datetime_cleaner import DateTimeCleaner
from datetime import datetime, timedelta

class TestDateTimeCleaner(unittest.TestCase):
    """日期时间清洗器测试类"""
    
    def setUp(self):
        """测试前准备：创建清洗器实例"""
        self.cleaner = DateTimeCleaner()
    
    def test_standard_format(self):
        """测试标准日期格式"""
        # 测试用例：标准格式应该被正确解析
        test_cases = [
            ('2026-01-23 10:00:00', '2026-01-23 10:00:00'),
            ('2026/01/23 10:00:00', '2026-01-23 10:00:00'),
            ('2026年01月23日 10:00:00', '2026-01-23 10:00:00'),
        ]
        
        for input_val, expected in test_cases:
            with self.subTest(input=input_val):
                result = self.cleaner.clean(input_val)
                self.assertIsNotNone(result)
                self.assertEqual(result['datetime'], expected)
    
    def test_relative_time(self):
        """测试相对时间"""
        # 1小时前应该返回约1小时前的时间
        result = self.cleaner.clean('1小时前')
        self.assertIsNotNone(result)
        
        # 验证时间差在合理范围内（误差在2分钟内）
        parsed_time = datetime.fromisoformat(result['datetime'])
        expected_time = datetime.now() - timedelta(hours=1)
        diff = abs((parsed_time - expected_time).total_seconds())
        self.assertLess(diff, 120)  # 2分钟误差
    
    def test_timestamp(self):
        """测试时间戳"""
        # 测试秒级时间戳
        timestamp = 1737603600
        result = self.cleaner.clean(timestamp)
        self.assertIsNotNone(result)
        self.assertEqual(result['timestamp'], timestamp)
        
        # 测试毫秒级时间戳
        ms_timestamp = timestamp * 1000
        result = self.cleaner.clean(ms_timestamp)
        self.assertIsNotNone(result)
        self.assertEqual(result['timestamp'], timestamp)
    
    def test_null_values(self):
        """测试空值处理"""
        null_values = [None, '', '   ']
        
        for null_val in null_values:
            with self.subTest(input=null_val):
                result = self.cleaner.clean(null_val)
                self.assertIsNone(result)

if __name__ == '__main__':
    unittest.main()

python 复制代码

# tests/test_number.py
"""数字清洗器测试"""

import unittest
from src.cleaners.number_cleaner import NumberCleaner

class TestNumberCleaner(unittest.TestCase):
    
    def setUp(self):
        self.cleaner = NumberCleaner()
    
    def test_simple_numbers(self):
        """测试简单数字"""
        test_cases = [
            ('123', 123.0),
            ('123.45', 123.45),
            ('1,234,567', 1234567.0),
        ]
        
        for input_val, expected in test_cases:
            with self.subTest(input=input_val):
                result = self.cleaner.clean_number(input_val)
                self.assertEqual(result, expected)
    
    def test_units(self):
        """测试单位转换"""
        test_cases = [
            ('1万', 10000.0),
            ('1.5万', 15000.0),
            ('2亿', 200000000.0),
            ('1K', 1000.0),
            ('1.5M', 1500000.0),
        ]
        
        for input_val, expected in test_cases:
            with self.subTest(input=input_val):
                result = self.cleaner.clean_number(input_val)
                self.assertAlmostEqual(result, expected, places=2)
    
    def test_price(self):
        """测试价格清洗"""
        test_cases = [
            ('¥99.9', 99.9),
            ('$100', 100.0),
            ('199元', 199.0),
        ]
        
        for input_val, expected in test_cases:
            with self.subTest(input=input_val):
                result = self.cleaner.clean_price(input_val)
                self.assertEqual(result, expected)

if __name__ == '__main__':
    unittest.main()

📊 完整运行效果

bash 复制代码

$ python examples/demo.py

==================================================
基础清洗示例
==================================================

【日期清洗】
  2026-01-23 10:00:00  → {'datetime': '2026-01-23 10:00:00', 'timestamp': 1737603600, 'date': '2026-01-23', 'time': '10:00:00', 'year': 2026, 'month': 1, 'day': 23}
  1小时前               → {'datetime': '2026-01-23 09:00:00', 'timestamp': 1737600000, 'date': '2026-01-23', 'time': '09:00:00', 'year': 2026, 'month': 1, 'day': 23}
  昨天 15:30           → {'datetime': '2026-01-22 15:30:00', 'timestamp': 1737534600, 'date': '2026-01-22', 'time': '15:30:00', 'year': 2026, 'month': 1, 'day': 22}
  1737603600           → {'datetime': '2026-01-23 10:00:00', 'timestamp': 1737603600, 'date': '2026-01-23', 'time': '10:00:00', 'year': 2026, 'month': 1, 'day': 23}
  2026/01/23           → {'datetime': '2026-01-23 00:00:00', 'timestamp': 1737568800, 'date': '2026-01-23', 'time': '00:00:00', 'year': 2026, 'month': 1, 'day': 23}

【数字清洗】
  ¥99.9           → {'value': 99.9, 'unit': None, 'original': '¥99.9'}
  1.5万           → {'value': 15000.0, 'unit': '万', 'original': '1.5万'}
  300亿           → {'value': 30000000000.0, 'unit': '亿', 'original': '300亿'}
  1,234,567       → {'value': 1234567.0, 'unit': None, 'original': '1,234,567'}
  一千二百         → {'value': 1200.0, 'unit': None, 'original': '一千二百'}
  1.5GB           → {'value': 1610612736.0, 'unit': 'GB', 'original': '1.5GB'}

【文本清洗】
  '  标题含有\n多余空格  '           → '标题含有 多余空格'
  '<p>HTML标签</p>'                  → 'HTML标签'
  '包含&nbsp;实体'                   → '包含 实体'
  'https://example.com/news/123'     → 'https://example.com/news/123'

==================================================
流水线清洗示例
==================================================

【原始数据】
  title          :   <b>新闻标题</b>含有HTML  
  publish_time   : 2小时前
  price          : ¥99.9万
  view_count     : 1,234,567
  content        :   这是正文内容...

  
  url            : example.com/news/123

【清洗后】
  title          : 新闻标题含有HTML
  publish_time   : {'datetime': '2026-01-23 08:00:00', 'timestamp': 1737596400, ...}
  price          : 999000.0
  view_count     : 1234567
  content        : 这是正文内容...
  url            : https://example.com/news/123

【验证结果】
  有效: True
  缺失字段: 无

📝 配置文件方式

yaml 复制代码

# config/cleaning_rules.yaml
# 清洗规则配置文件（可选，更灵活的配置方式）

rules:
  news:  # 新闻类数据
    title:
      type: title
      max_length: 200
      required: true
    
    publish_time:
      type: datetime
      output_format: "%Y-%m-%d %H:%M:%S"
      required: true
    
    author:
      type: text
      default: "未知作者"
    
    view_count:
      type: count
      default: 0
    
    price:
      type: price
    
    content:
      type: content
      min_length: 50
      required: true
  
  product:  # 商品类数据
    name:
      type: title
      max_length: 100
      required: true
    
    price:
      type: price
      required: true
    
    sales:
      type: count
      default: 0
    
    description:
      type: content
      min_length: 20

python 复制代码

# 加载配置文件的方式
import yaml

def create_pipeline_from_config(config_file, data_type='news'):
    """从配置文件创建清洗流水线"""
    with open(config_file, 'r', encoding='utf-8') as f:
        config = yaml.safe_load(f)
    
    rules = config['rules'][data_type]
    pipeline = CleaningPipeline()
    
    for field, rule in rules.items():
        params = {k: v for k, v in rule.items() if k != 'type'}
        pipeline.add_rule(field, rule['type'], **params)
    
    return pipeline

# 使用
pipeline = create_pipeline_from_config('config/cleaning_rules.yaml', 'news')
cleaned = pipeline.clean(raw_data)

📝 小结

今天我们打造了一个生产级的数据清洗工具包：

6 大清洗场景（日期、金额、单位、空值、文本、验证）
流水线设计（配置化、可扩展、支持批量）
完整代码实现（带详细注释解析）
单元测试（确保质量）
集成示例（采集→清洗→入库）

核心价值：

✅ 可复用：一次编写，处处使用
✅ 可配置：规则外置，灵活调整
✅ 可测试：完整测试覆盖
✅ 生产级：经过实战验证

把这个工具包加入你的代码库，以后遇到脏数据直接调用，节省 80% 的清洗时间！

🎯 下期预告

数据清洗完了，但怎么监控数据质量？怎么发现异常数据？

下一篇《数据质量监控：缺失率/重复率/异常值检测（可视化报告）》，我们会学习如何对清洗后的数据进行质量检查，生成可视化报告，确保数据可靠！

验收作业：用这个工具包清洗 3 个不同来源的数据（新闻/商品/评论任选），生成清洗前后对比报告。截图给我！加油！

🌟 文末

好啦～以上就是本期《Python爬虫实战》的全部内容啦！如果你在实践过程中遇到任何疑问，欢迎在评论区留言交流，我看到都会尽量回复～咱们下期见！

小伙伴们在批阅的过程中，如果觉得文章不错，欢迎点赞、收藏、关注哦～
三连就是对我写作道路上最好的鼓励与支持！ ❤️🔥

📌 专栏持续更新中｜建议收藏 + 订阅

专栏 👉 《Python爬虫实战》，我会按照"入门 → 进阶 → 工程化 → 项目落地"的路线持续更新，争取让每一篇都做到：

✅ 讲得清楚（原理）｜✅ 跑得起来（代码）｜✅ 用得上（场景）｜✅ 扛得住（工程化）

📣 想系统提升的小伙伴：强烈建议先订阅专栏，再按目录顺序学习，效率会高很多～

✅ 互动征集

想让我把【某站点/某反爬/某验证码/某分布式方案】写成专栏实战？

评论区留言告诉我你的需求，我会优先安排更新 ✅

⭐️ 若喜欢我，就请关注我叭～（更新不迷路）

⭐️ 若对你有用，就请点赞支持一下叭～（给我一点点动力）

⭐️ 若有疑问，就请评论留言告诉我叭～（我会补坑 & 更新迭代）

免责声明：本文仅用于学习与技术研究，请在合法合规、遵守站点规则与 Robots 协议的前提下使用相关技术。严禁将技术用于任何非法用途或侵害他人权益的行为。