Python爬虫零基础入门【第九章：实战项目教学·第7节】增量采集：last_time / last_id 两种策略各做一遍！

🔥本期内容已收录至专栏《Python爬虫实战》，持续完善知识体系与项目实战，建议先订阅收藏，后续查阅更方便～持续更新中！！

全文目录：

- [🌟 开篇语](#🌟 开篇语)
- 上期回顾
- 什么是增量采集？
- - [全量 vs 增量](#全量 vs 增量)
  - 两种常见策略
- 核心设计思路
- - 增量状态管理
  - 边界处理的三大难题
- 实战代码：两种策略完整实现
- - 项目结构
  - [1. 状态管理器 (state_manager.py)](#1. 状态管理器 (state_manager.py))
  - [2. 基于时间的增量爬虫 (time_based_spider.py)](#2. 基于时间的增量爬虫 (time_based_spider.py))
  - - 回补窗口的作用
    - 时间比较的坑
  - [3. 基于 ID 的增量爬虫 (id_based_spider.py)](#3. 基于 ID 的增量爬虫 (id_based_spider.py))
  - - 分批采集处理大量新增
    - [ID 比时间更可靠](#ID 比时间更可靠)
  - [4. 运行入口 (run_incremental.py)](#4. 运行入口 (run_incremental.py))
- 运行与验证
- - 第一次运行（全量采集）
  - 第二次运行（增量采集）
  - [基于 ID 的测试](#基于 ID 的测试)
- 进阶技巧
- - [1. 混合策略（时间+ID双保险）](#1. 混合策略（时间+ID双保险）)
  - [2. 智能回补窗口（动态调整）](#2. 智能回补窗口（动态调整）)
  - [3. 多源增量管理](#3. 多源增量管理)
  - [4. 增量失败重试策略](#4. 增量失败重试策略)
- 真实项目集成示例
- - 完整的新闻采集系统
- [常见坑点与解决方案 ⚠️](#常见坑点与解决方案 ⚠️)
- - 坑1：时区问题
  - 坑2：回补窗口过小导致漏数据
  - 坑3：状态更新时机错误
  - 坑4：并发采集导致状态混乱
- [监控与告警 📊](#监控与告警 📊)
- - 增量异常检测
- [验收标准 ✅](#验收标准 ✅)
- [下期预告 🔮](#下期预告 🔮)
- [完整代码总结 📦](#完整代码总结 📦)
- [总结 📝](#总结 📝)
- [🌟 文末](#🌟 文末)
- - [📌 专栏持续更新中｜建议收藏 + 订阅](#📌 专栏持续更新中｜建议收藏 + 订阅)
  - [✅ 互动征集](#✅ 互动征集)

🌟 开篇语

哈喽，各位小伙伴们你们好呀～我是【喵手】。

运营社区： C站 / 掘金 / 腾讯云 / 阿里云 / 华为云 / 51CTO

欢迎大家常来逛逛，一起学习，一起进步～🌟

我长期专注 Python 爬虫工程化实战 ，主理专栏 👉 《Python爬虫实战》：从采集策略 到反爬对抗 ，从数据清洗 到分布式调度 ，持续输出可复用的方法论与可落地案例。内容主打一个"能跑、能用、能扩展 "，让数据价值真正做到------抓得到、洗得净、用得上。

📌 专栏食用指南（建议收藏）

✅ 入门基础：环境搭建 / 请求与解析 / 数据落库
✅ 进阶提升：登录鉴权 / 动态渲染 / 反爬对抗
✅ 工程实战：异步并发 / 分布式调度 / 监控与容错
✅ 项目落地：数据治理 / 可视化分析 / 场景化应用

📣 专栏推广时间 ：如果你想系统学爬虫，而不是碎片化东拼西凑，欢迎订阅/关注专栏《Python爬虫实战》

订阅后更新会优先推送，按目录学习更高效～

上期回顾

上一节《Python爬虫零基础入门【第九章：实战项目教学·第6节】断点续爬：任务状态表 + 失败队列重放》我们搞定了 SQLite 入库，用 Upsert 实现了幂等写入------跑多少次都不会重复。但你会发现一个新问题：

每次都从头采集所有数据，太浪费了！

想象一下，你每天采集新闻网站，今天有 100 篇新文章，但你的爬虫还傻傻地从第一页开始爬，把昨天采过的 10000 篇又采一遍。虽然有 Upsert 不会重复入库，但这网络请求、解析、数据库查询全是浪费。

今天我们要解决的就是：只采新增的数据，旧的不碰。

这就是增量采集------爬虫工程化最重要的能力之一 🚀

什么是增量采集？

全量 vs 增量

json 复制代码

全量采集（每次都从头开始）:
第1天: 采集 1-100 号数据 → 入库 100 条
第2天: 采集 1-150 号数据 → 入库 150 条（前100条重复更新）
第3天: 采集 1-200 号数据 → 入库 200 条（前150条重复更新）
问题：90% 的工作都是在重复劳动！

增量采集（只采新增部分）:
第1天: 采集 1-100 号数据 → 入库 100 条，记录 last_id=100
第2天: 从 101 号开始采集 → 只采 101-150 → 入库 50 条
第3天: 从 151 号开始采集 → 只采 151-200 → 入库 50 条
优势：只采新的，速度快、流量省、服务器友好！

两种常见策略

1. 基于时间戳（last_time）

python 复制代码

# 记录上次采集的最晚时间
last_time = "2026-01-24 10:00:00"

# 下次只采这个时间之后的数据
WHERE pub_time > '2026-01-24 10:00:00'

适用场景：新闻、博客、社交媒体帖子（有发布时间字段）

2. 基于 ID（last_id）

python 复制代码

# 记录上次采集的最大 ID
last_id = 12345

# 下次从这个 ID 之后开始
WHERE id > 12345

适用场景：论坛帖子、商品列表、用户数据（有自增 ID）

核心设计思路

增量状态管理

python 复制代码

# 增量状态文件 (incremental_state.json)
{
    "news_spider": {
        "last_time": "2026-01-24 15:30:00",
        "last_id": null,
        "total_collected": 1250,
        "last_run": "2026-01-24 16:00:00"
    },
    "product_spider": {
        "last_time": null,
        "last_id": 98765,
        "total_collected": 5600,
        "last_run": "2026-01-24 14:30:00"
    }
}

设计要点：

每个爬虫独立维护状态
同时记录 last_time 和 last_id（灵活切换）
记录总采集数和最后运行时间（便于监控）

边界处理的三大难题

难题1：等于还是大于？

python 复制代码

# ❌ 错误：可能漏数据
WHERE pub_time > last_time

# ✅ 正确：用大于等于，然后靠 Upsert 去重
WHERE pub_time >= last_time

难题2：时区统一

python 复制代码

# 源站时间可能是：
"2026-01-24 10:00:00"      # 没时区信息
"2026-01-24T10:00:00Z"     # UTC 时间
"2026-01-24 10:00:00+08:00" # 东八区

# 统一转成 UTC 再比较

难题3：数据回补窗口

python 复制代码

# 有时候源站会"迟发布"，比如：
# - 今天采到最晚时间是 10:00
# - 但有篇 9:50 的新闻，源站 10:30 才发布

# 解决办法：回补窗口（lookback）
last_time = "2026-01-24 10:00:00"
lookback = timedelta(hours=1)
query_time = last_time - lookback  # 从 09:00 开始查

实战代码：两种策略完整实现

项目结构

复制代码

incremental_spider/
├── state_manager.py      # 状态管理器
├── time_based_spider.py  # 基于时间的增量爬虫
├── id_based_spider.py    # 基于 ID 的增量爬虫
├── run_incremental.py    # 运行入口
└── incremental_state.json # 状态文件（自动生成）

1. 状态管理器 (state_manager.py)

python 复制代码

"""
增量状态管理器 - 记录上次采集位置
"""
import json
from pathlib import Path
from typing import Optional, Dict, Any
from datetime import datetime


class IncrementalStateManager:
    """增量采集状态管理"""
    
    def __init__(self, state_file: str = 'incremental_state.json'):
        self.state_file = Path(state_file)
        self.states: Dict[str, Dict[str, Any]] = {}
        self._load()
    
    def _load(self):
        """加载状态文件"""
        if self.state_file.exists():
            with open(self.state_file, 'r', encoding='utf-8') as f:
                self.states = json.load(f)
            print(f"✅ 加载状态文件: {self.state_file}")
        else:
            print(f"📝 状态文件不存在，将创建新文件: {self.state_file}")
            self.states = {}
    
    def _save(self):
        """保存状态到文件"""
        with open(self.state_file, 'w', encoding='utf-8') as f:
            json.dump(self.states, f, ensure_ascii=False, indent=2)
    
    def get_state(self, spider_name: str) -> Dict[str, Any]:
        """
        获取爬虫的增量状态
        Args:
            spider_name: 爬虫名称（唯一标识）
        Returns:
            状态字典，如果不存在返回默认值
        """
        if spider_name not in self.states:
            # 初始状态
            self.states[spider_name] = {
                'last_time': None,
                'last_id': None,
                'total_collected': 0,
                'last_run': None
            }
        
        return self.states[spider_name]
    
    def update_state(
        self, 
        spider_name: str, 
        last_time: Optional[str] = None,
        last_id: Optional[int] = None,
        new_count: int = 0
    ):
        """
        更新爬虫状态
        Args:
            spider_name: 爬虫名称
            last_time: 最新的时间戳
            last_id: 最新的 ID
            new_count: 本次新增数量
        """
        state = self.get_state(spider_name)
        
        # 更新字段
        if last_time is not None:
            state['last_time'] = last_time
        
        if last_id is not None:
            state['last_id'] = last_id
        
        state['total_collected'] += new_count
        state['last_run'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        
        # 保存到文件
        self._save()
        
        print(f"💾 状态已更新: {spider_name}")
        print(f"  ├─ last_time: {state['last_time']}")
        print(f"  ├─ last_id: {state['last_id']}")
        print(f"  ├─ 本次新增: {new_count}")
        print(f"  └─ 累计采集: {state['total_collected']}\n")
    
    def reset_state(self, spider_name: str):
        """重置爬虫状态（重新全量采集）"""
        if spider_name in self.states:
            del self.states[spider_name]
            self._save()
            print(f"🔄 已重置状态: {spider_name}")
    
    def get_all_states(self) -> Dict[str, Dict[str, Any]]:
        """获取所有爬虫的状态（用于监控面板）"""
        return self.states

设计亮点：

自动创建状态文件（第一次跑不用手动建）
每个爬虫独立状态（互不干扰）
更新后立即持久化（防止中途挂掉）
提供重置功能（出问题时重新全量采）

2. 基于时间的增量爬虫 (time_based_spider.py)

python 复制代码

"""
基于时间戳的增量采集
适用场景：新闻、博客、社交媒体等有发布时间的内容
"""
from datetime import datetime, timedelta
from typing import List, Dict, Any, Optional
from state_manager import IncrementalStateManager
import time


class TimeBasedIncrementalSpider:
    """基于时间戳的增量爬虫"""
    
    def __init__(
        self, 
        spider_name: str,
        lookback_hours: int = 1  # 回补窗口（小时）
    ):
        self.spider_name = spider_name
        self.lookback_hours = lookback_hours
        self.state_manager = IncrementalStateManager()
        
        # 统计指标
        self.stats = {
            'fetched': 0,    # 本次抓取数
            'new': 0,        # 真正新增数
            'skipped': 0     # 跳过数（已存在）
        }
    
    def get_start_time(self) -> Optional[str]:
        """
        获取本次采集的起始时间
        Returns:
            起始时间字符串，None 表示首次采集（全量）
        """
        state = self.state_manager.get_state(self.spider_name)
        last_time = state['🆕 首次采集，将进行全量抓取")
            return None
        
        # 解析上次时间
        last_dt = datetime.fromisoformat(last_time)
        
        # 减去回补窗口
        start_dt = last_dt - timedelta(hours=self.lookback_hours)
        start_time = start_dt.strftime('%Y-%m-%d %H:%M:%S')
        
        print(f"⏱️  增量采集")
        print(f"  ├─ 上次最晚时间: {last_time}")
        print(f"  ├─ 回补窗口: {self.lookback_hours} 小时")
        print(f"  └─ 本次起始时间: {start_time}\n")
        
        return start_time
    
    def fetch_data_by_time(self, start_time: Optional[str]) -> List[Dict[str, Any]]:
        """
        模拟从接口/数据库获取指定时间后的数据
        实际项目中这里是 requests.get() 或数据库查询
        """
        # 模拟数据源（实际项目替换成真实接口）
        all_data = self._generate_mock_data()
        
        if start_time is None:
            # 首次全量
            return all_data
        
        # 增量过滤
        filtered = [
            item for item in all_data 
            if item['pub_time'] >= start_time
        ]
        
        print(f"📦 从数据源获取: {len(filtered)} 条（时间 >= {start_time}）")
        return filtered
    
    def _generate_mock_data(self) -> List[Dict[str, Any]]:
        """生成模拟数据（实际项目删掉这个函数）"""
        now = datetime.now()
        data = []
        
        # 模拟最近 24 小时的新闻，每小时 10 篇
        for hours_ago in range(24):
            pub_dt = now - timedelta(hours=hours_ago)
            pub_time = pub_dt.strftime('%Y-%m-%d %H:%M:%S')
            
            for i in range(10):
                data.append({
                    'url': f'https://example.com/news/{hours_ago}_{i}',
                    'title': f'{hours_ago}小时前的新闻 {i}',
                    'content': f'这是内容 {hours_ago}_{i}',
                    'pub_time': pub_time,
                    'source': 'example.com'
                })
        
        return data
    
    def run(self):
        """运行增量采集"""
        print(f"🚀 启动爬虫: {self.spider_name}\n")
        
        # 1. 获取起始时间
        start_time = self.get_start_time()
        
        # 2. 获取数据
        items = self.fetch_data_by_time(start_time)
        self.stats['fetched'] = len(items)
        
        if not items:
            print("✅ 没有新数据，本次采集结束\n")
            return
        
        # 3. 处理每条数据（这里简化，实际项目要入库）
        latest_time = None
        for item in items:
            # 实际项目这里是：pipeline.process_item(item)
            # 这里只打印前3条示例
            if self.stats['new'] < 3:
                print(f"📄 [{self.stats['new']+1}] {item['title']} ({item['pub_time']})")
            
            self.stats['new'] += 1
            
            # 记录最晚时间
            if latest_time is None or item['pub_time'] > latest_time:
                latest_time = item['pub_time']
        
        if self.stats['new'] > 3:
            print(f"   ... 还有 {self.stats['new']-3} 条\n")
        
        # 4. 更新状态
        self.state_manager.update_state(
            spider_name=self.spider_name,
            last_time=latest_time,
            new_count=self.stats['new']
        )
        
        # 5. 打印统计
        self._print_summary()
    
    def _print_summary(self):
        """打印统计摘要"""
        print("="*50)
        print("📊 采集统计")
        print("="*50)
        print(f"抓取数据: {self.stats['fetched']} 条")
        print(f"新增入库: {self.stats['new']} 条")
        print(f"跳过重复: {self.stats['skipped']} 条")
        print("="*50 + "\n")

核心逻辑解析：

回补窗口的作用

python 复制代码

last_time = "2026-01-24 10:00:00"
lookback = 1小时
start_time = "2026-01-24 09:00:00"  # 向前推1小时

# 为什么要这样？
# 假设源站有延迟发布：
# - 9:50 的新闻，10:30 才上线
# - 如果从 10:00 开始查，就漏了这条

# 回补窗口让你"往回看一点"，宁可重复也不漏

时间比较的坑

python 复制代码

# ❌ 错误：字符串直接比较可能出问题
if '2026-01-24 9:00:00' > '2026-01-24 10:00:00':  # False!（9>1）

# ✅ 正确：转成 datetime 对象比较
dt1 = datetime.fromisoformat('2026-01-24 09:00:00')
dt2 = datetime.fromisoformat('2026-01-24 10:00:00')
if dt1 > dt2:  # True

# 或者统一格式（补零）
'2026-01-24 09:00:00' > '2026-01-24 10:00:00'  # True

3. 基于 ID 的增量爬虫 (id_based_spider.py)

python 复制代码

"""
基于 ID 的增量采集
适用场景：论坛帖子、商品列表、用户数据等有自增 ID 的场景
"""
from typing import List, Dict, Any, Optional
from state_manager import IncrementalStateManager


class IDBasedIncrementalSpider:
    """基于 ID 的增量爬虫"""
    
    def __init__(self, spider_name: str):
        self.spider_name = spider_name
        self.state_manager = IncrementalStateManager()
        
        self.stats = {
            'fetched': 0,
            'new': 0,
            'skipped': 0
        }
    
    def get_start_id(self) -> int:
        """
        获取本次采集的起始 ID
        Returns:
            起始 ID，0 表示首次采集（从头开始）
        """
        state = self.state_manager.get_state(self.spider_name)
        last_id = state['last_id']
        
        if last_id is None:
            print("🆕 首次采集，从 ID=1 开始\n")
            return 0
        
        # 从上次的下一个 ID 开始
        start_id = last_id + 1
        print(f"⏭️  增量采集")
        print(f"  ├─ 上次最大 ID: {last_id}")
        print(f"  └─ 本次起始 ID: {start_id}\n")
        
        return start_id
    
    def fetch_data_by_id(self, start_id: int, limit: int = 100) -> List[Dict[str, Any]]:
        """
        模拟从接口/数据库获取指定 ID 之后的数据
        Args:
            start_id: 起始 ID
            limit: 每次最多获取多少条
        """
        # 模拟数据源（实际项目替换成真实接口）
        all_data = self._generate_mock_data_with_id()
        
        # 过滤 ID
        filtered = [
            item for item in all_data 
            if item['id'] > start_id
        ][:limit]
        
        print(f"📦 从数据源获取: {len(filtered)} 条（ID > {start_id}）")
        return filtered
    
    def _generate_mock_data_with_id(self) -> List[Dict[str, Any]]:
        """生成带 ID 的模拟数据"""
        data = []
        for i in range(1, 301):  # 模拟数据库有 300 条记录
            data.append({
                'id': i,
                'url': f'https://forum.example.com/post/{i}',
                'title': f'论坛帖子 #{i}',
                'content': f'这是第 {i} 个帖子的内容',
                'author': f'用户{i % 20}'
            })
        return data
    
    def run(self, batch_size: int = 100):
        """
        运行增量采集
        Args:
            batch_size: 每批次获取多少条（分页）
        """
        print(f"🚀 启动爬虫: {self.spider_name}\n")
        
        # 1. 获取起始 ID
        start_id = self.get_start_id()
        current_id = start_id
        
        # 2. 分批采集（处理大量新增的情况）
        while True:
            items = self.fetch_data_by_id(current_id, limit=batch_size)
            
            if not items:
                print("✅ 已到达最新数据，本次采集结束\n")
                break
            
            self.stats['fetched'] += len(items)
            
            # 3. 处理每条数据
            max_id = current_id
            for item in items:
                # 实际项目这里是：pipeline.process_item(item)
                if self.stats['new'] < 3:
                    print(f"📄 [{self.stats['new']+1}] ID={item['id']} {item['title']}")
                
                self.stats['new'] += 1
                
                # 记录最大 ID
                if item['id'] > max_id:
                    max_id = item['id']
            
            if self.stats['new'] == 3 and len(items) > 3:
                print(f"   ... 本批还有 {len(items)-3} 条\n")
            
            # 4. 更新当前 ID，继续下一批
            current_id = max_id
            
            # 防止死循环（如果数据源没有更多数据了）
            if len(items) < batch_size:
                break
        
        # 5. 更新状态
        self.state_manager.update_state(
            spider_name=self.spider_name,
            last_id=current_id,
            new_count=self.stats['new']
        )
        
        # 6. 打印统计
        self._print_summary()
    
    def _print_summary(self):
        """打印统计摘要"""
        print("="*50)
        print("📊 采集统计")
        print("="*50)
        print(f"抓取数据: {self.stats['fetched']} 条")
        print(f"新增入库: {self.stats['new']} 条")
        print("="*50 + "\n")

设计亮点：

分批采集处理大量新增

python 复制代码

# 假设上次采到 ID=100，今天源站新增了 500 条
# 如果一次性抓 500 条：
# - 内存占用大
# - 失败后从头来

# 分批采集（每批100 ID 101-200
第2批: ID 201-300
...
每批成功后更新状态，更稳定！

ID 比时间更可靠

python 复制代码

# 时间可能：
# - 相同时间有多条数据（精度不够）
# - 源站修改发布时间（回到过去）

# ID 的优势：
# - 严格递增，不会回退
# - 唯一标识，不会重复
# 所以论坛、商品列表优先用 ID

4. 运行入口 (run_incremental.py)

python 复制代码

#!/usr/bin/env python3
"""
增量采集运行脚本
"""
from time_based_spider import TimeBasedIncrementalSpider
from id_based_spider import IDBasedIncrementalSpider
import sys


def run_time_based():
    """运行基于时间的增量爬虫"""
    spider = TimeBasedIncrementalSpider(
        spider_name='news_spider',
        lookback_hours=2  # 回补2小时
    )
    spider.run()


def run_id_based():
    """运行基于 ID 的增量爬虫"""
    spider = IDBasedIncrementalSpider(spider_name='forum_spider')
    spider.run(batch_size=50)


def main():
    print("请选择增量采集策略:")
    print("1. 基于时间戳（适用于新闻、博客）")
    print("2. 基于 ID（适用于论坛、商品列表）")
    print("3. 两种都运行\n")
    
    choice = input("请输入选择 (1/2/3): ").strip()
    
    if choice == '1':
        run_time_based()
    elif choice == '2':
        run_id_based()
    elif choice == '3':
        run_time_based()
        print("\n" + "="*60 + "\n")
        run_id_based()
    else:
        print("❌ 无效选择")
        sys.exit(1)


if __name__ == '__main__':
    main()

运行与验证

第一次运行（全量采集）

bash 复制代码

python run_incremental.py
# 选择 1（基于时间）

输出：

json 复制代码

🚀 启动爬虫: news_spider

🆕 首次采集，将进行全量抓取

📦 从数据源获取: 240 条（时间 >= None）
📄 [1] 0小时前的新闻 0 (2026-01-24 16:00:00)
📄 [2] 0小时前的新闻 1 (2026-01-24 16:00:00)
📄 [3] 0小时前的新闻 2 (2026-01-24 16:00:00)
   ... 还有 237 条

💾 状态已更新: news_spider
  ├─ last_time: 2026-01-24 16:00:00
  ├─ last_id: null
  ├─ 本次新增: 240
  └─ 累计采集: 240

==================================================
📊 采集统计
==================================================
抓取数据: 240 条
新增入库: 240 条
跳过重复: 0 条
==================================================

第二次运行（增量采集）

bash 复制代码

# 等1小时后再跑（或手动改状态文件模拟）
python run_incremental.py
# 选择 1

输出：

json 复制代码

⏱️  增量采集
  ├─ 上次最晚时间: 2026-01-24 16:00:00
  ├─ 回补窗口: 2 小时
  └─ 本次起始时间: 2026-01-24 14:00:00

📦 从数据源获取: 30 条（时间 >= 2026-01-24 14:00:00）
📄 [1] 0小时前的新闻 3 (2026-01-24 17:00:00)
📄 [2] 0小时前的新闻 4 (2026-01-24 17:00:00)
...

💾 状态已更新: news_spider
  ├─ last_time: 2026-01-24 17:00:00  # 时间更新了
  ├─ last_id: null
  ├─ 本次新增: 30
  └─ 累计采集: 270  # 累计数增加

==================================================
📊 采集统计
==================================================
抓取数据: 30 条     # 只抓了30条，不是全量240条！
新增入库: 30 条
==================================================

验证成功！ 第二次只采集了新增部分，效率大幅提升 🚀

基于 ID 的测试

bash 复制代码

python run_incremental.py
# 选择 2

# 第一次运行
🆕 首次采集，从 ID=1 开始
📦 从数据源获取: 50 条（ID > 0）
📄 [1] ID=1 论坛帖子 #1
...
💾 状态已更新: forum_spider
  ├─ last_time: null
  ├─ last_id: 50
  ├─ 本次新增: 50
  └─ 累计采集: 50

# 第二次运行（模拟数据库新增了数据）
⏭️  增量采集
  ├─ 上次最大 ID: 50
  └─ 本次起始 ID: 51

📦 从数据源获取: 50 条（ID > 50）
📄 [1] ID=51 论坛帖子 #51
...
💾 状态已更新: forum_spider
  ├─ last_id: 100   # ID 更新了
  └─ 累计采集: 100

进阶技巧

1. 混合策略（时间+ID双保险）

python 复制代码

class HybridIncrementalSpider:
    """混合增量策略 - 同时使用时间和ID"""
    
    def get_filter_condition(self):
        """同时使用时间和 ID 过滤"""
        state = self.state_manager.get_state(self.spider_name)
        
        last_time = state['last_time']
        last_id = state['last_id']
        
        conditions = []
        
        if last_time:
            conditions.append(f"pub_time >= '{last_time}'")
        
        if last_id:
            conditions.append(f"id > {last_id}")
        
        # 两个条件取 OR（只要满足一个就采）
        return " OR ".join(conditions) if conditions else "1=1"
    
    def run(self):
        # SQL 示例
        sql = f"""
        SELECT * FROM articles 
        WHERE {self.get_filter_condition()}
        ORDER BY id ASC
        LIMIT 1000;
        """
        # 这样即使时间不准，ID 也能保证不漏数据

使用场景：源站同时提供时间和 ID，用双重保险更稳

2. 智能回补窗口（动态调整）

python 复制代码

def calculate_smart_lookback(self, state):
    """根据更新频率动态调整回补窗口"""
    last_run = state['last_run']
    
    if not last_run:
        return 24  # 首次运行，回补24小时
    
    # 计算距离上次运行的时间
    last_dt = datetime.fromisoformat(last_run)
    hours_since = (datetime.now() - last_dt).total_seconds() / 3600
    
    # 如果很久没跑了，回补窗口加大
    if hours_since > 24:
        return int(hours_since) + 2  # 加2小时保险
    elif hours_since > 6:
        return 6
    else:
        return 2  # 定时任务每小时跑，回补2小时就够

好处：手动跑间隔长了，自动加大回补；定时任务频繁跑，回补窗口小，减少重复处理。

3. 多源增量管理

python 复制代码

# incremental_state.json
{
    "news_sina": {"last_time": "2026-01-24 10:00:00", ...},
    "news_163": {"last_time": "2026-01-24 09:30:00", ...},
    "news_sohu": {"last_time": "2026-01-24 11:00:00", ...}
}

# 代码中
spiders = ['news_sina', 'news_163', 'news_sohu']

for spider_name in spiders:
    spider = TimeBasedIncrementalSpider(spider_name)
    spider.run()

优势：每个源独立维护进度，互不影响；某个源挂了不影响其他源。

4. 增量失败重试策略

python 复制代码

class IncrementalSpiderWithRetry:
    """带重试的增量爬虫"""
    
    def run_with_retry(self, max_retries=3):
        """失败后不更新状态，保证下次重新采"""
        for attempt in range(max_retries):
            try:
                # 采集逻辑
                items = self.fetch_data()
                
                # 入库
                for item in items:
                    self.pipeline.process_item(item)
                
                # ✅ 成功后才更新状态
                self.state_manager.update_state(...)
                return True
                
            except Exception as e:
                print(f"❌ 第 {attempt+1} 次失败: {e}")
                
                if attempt == max_retries - 1:
                    # ❌ 最后一次失败，不更新状态
                    # 下次运行会重新从这个位置采
                    print("⚠️  达到最大重试次数，状态未更新")
                    return False
                
                time.sleep(2 ** attempt)  # 指数退避

关键点：只有成功才更新状态，失败保留旧状态，保证下次能继续从断点开始。

真实项目集成示例

完整的新闻采集系统

python 复制代码

# news_incremental_system.py
"""
完整的增量新闻采集系统
集成：增量逻辑 + 数据库入库 + 质量检测
"""
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
from typing import List, Dict, Any
import sys
sys.path.append('..')  # 导入之前章节的代码

from db_manager import DatabaseManager
from pipeline import SQLitePipeline
from state_manager import IncrementalStateManager


class IncrementalNewsSpider:
    """完整的增量新闻爬虫"""
    
    def __init__(self, spider_name: str, base_url: str):
        self.spider_name = spider_name
        self.base_url = base_url
        
        # 初始化各模块
        self.db = DatabaseManager('news.db')
        self.pipeline = SQLitePipeline(self.db)
        self.state_manager = IncrementalStateManager()
        
        # 统计
        self.stats = {
            'total_fetched': 0,
            'new_items': 0,
            'updated_items': 0,
            'failed_items': 0
        }
    
    def get_start_time(self, lookback_hours=2):
        """获取增量起始时间"""
        state = self.state_manager.get_state(self.spider_name)
        last_time = state['last_time']
        
        if not last_time:
            # 首次采集，从24小时前开始
            start_time = (datetime.now() - timedelta(hours=24))
        else:
            # 增量采集，加回补窗口
            last_dt = datetime.fromisoformat(last_time)
            start_time = last_dt - timedelta(hours=lookback_hours)
        
        return start_time.strftime('%Y-%m-%d %H:%M:%S')
    
    def fetch_list_page(self, page=1):
        """采集列表页"""
        url = f"{self.base_url}/list?page={page}"
        
        try:
            resp = requests.get(url, timeout=10, headers={
                'User-Agent': 'Mozilla/5.0 (compatible; NewsBot/1.0)'
            })
            resp.raise_for_status()
            
            soup = BeautifulSoup(resp.text, 'html.parser')
            
            # 解析列表项
            news_items = []
            for item in soup.select('.news-item'):
                news_items.append({
                    'url': item.select_one('a')['href'],
                    'title': item.select_one('.title').get_text(strip=True),
                    'pub_time': item.select_one('.time').get_text(strip=True)
                })
            
            return news_items
            
        except Exception as e:
            print(f"❌ 列表页采集失败: {url} - {e}")
            return []
    
    def fetch_detail_page(self, url: str) -> Dict[str, Any]:
        """采集详情页"""
        try:
            resp = requests.get(url, timeout=10)
            resp.raise_for_status()
            
            soup = BeautifulSoup(resp.text, 'html.parser')
            
            item = {
                'url': url,
                'title': soup.select_one('h1.article-title').get_text(strip=True),
                'content': soup.select_one('.article-content').get_text(strip=True),
                'pub_time': soup.select_one('.pub-time').get_text(strip=True),
                'author': soup.select_one('.author').get_text(strip=True) if soup.select_one('.author') else None,
                'source': self.base_url.split('//')[1].split('/')[0],
                'raw_html': resp.text
            }
            
            return item
            
        except Exception as e:
            print(f"  └─ ❌ 详情页失败: {e}")
            self.stats['failed_items'] += 1
            return None
    
    def is_item_newer_than(self, item_time: str, threshold_time: str) -> bool:
        """判断文章是否比阈值时间新"""
        try:
            item_dt = datetime.fromisoformat(item_time.replace(' ', 'T'))
            threshold_dt = datetime.fromisoformat(threshold_time.replace(' ', 'T'))
            return item_dt >= threshold_dt
        except:
            return True  # 解析失败就当新的，宁可多采
    
    def run(self, max_pages=10):
        """运行增量采集"""
        print(f"🚀 启动增量采集: {self.spider_name}")
        print(f"📍 目标站点: {self.base_url}\n")
        
        # 1. 获取增量起始时间
        start_time = self.get_start_time()
        print(f"⏱️  增量起点: {start_time}\n")
        
        # 2. 翻页采集
        latest_time = None
        should_stop = False
        
        for page in range(1, max_pages + 1):
            if should_stop:
                break
            
            print(f"📄 采集第 {page} 页...")
            list_items = self.fetch_list_page(page)
            
            if not list_items:
                print("  └─ 没有更多数据\n")
                break
            
            # 3. 检查是否还有新数据
            for list_item in list_items:
                # 如果这条时间早于起点，说明后面都是旧的，可以停止了
                if not self.is_item_newer_than(list_item['pub_time'], start_time):
                    print(f"  └─ 遇到旧数据 ({list_item['pub_time']})，停止翻页\n")
                    should_stop = True
                    break
                
                # 4. 采集详情
                detail_item = self.fetch_detail_page(list_item['url'])
                
                if detail_item:
                    # 5. 入库
                    success = self.pipeline.process_item(detail_item)
                    
                    if success:
                        self.stats['new_items'] += 1
                        
                        # 记录最新时间
                        if not latest_time or detail_item['pub_time'] > latest_time:
                            latest_time = detail_item['pub_time']
                    
                    self.stats['total_fetched'] += 1
                    
                    # 简单进度提示
                    if self.stats['total_fetched'] % 10 == 0:
                        print(f"  ├─ 已处理 {self.stats['total_fetched']} 条")
        
        # 6. 更新状态
        if latest_time:
            self.state_manager.update_state(
                spider_name=self.spider_name,
                last_time=latest_time,
                new_count=self.stats['new_items']
            )
        
        # 7. 打印摘要
        self._print_summary()
    
    def _print_summary(self):
        """打印采集摘要"""
        print("\n" + "="*50)
        print("📊 本次采集摘要")
        print("="*50)
        print(f"总采集数: {self.stats['total_fetched']}")
        print(f"新增入库: {self.stats['new_items']}")
        print(f"失败数量: {self.stats['failed_items']}")
        
        state = self.state_manager.get_state(self.spider_name)
        print(f"\n当前状态:")
        print(f"  ├─ 最新时间: {state['last_time']}")
        print(f"  └─ 累计采集: {state['total_collected']}")
        print("="*50 + "\n")


# 使用示例
if __name__ == '__main__':
    # 多个新闻源
    sources = [
        ('sina_news', 'https://news.sina.com.cn'),
        ('163_news', 'https://news.163.com'),
    ]
    
    for spider_name, base_url in sources:
        spider = IncrementalNewsSpider(spider_name, base_url)
        spider.run(max_pages=5)
        print("\n" + "="*60 + "\n")

完整流程：

复制代码

1. 读取增量状态 → 确定起始时间
2. 翻页采集列表 → 检查时间过滤
3. 采集详情页 → 解析字段
4. 入库（自动去重） → 统计
5. 更新状态 → 下次从这里开始

常见坑点与解决方案 ⚠️

坑1：时区问题

python 复制代码

# ❌ 问题：本地时间 vs UTC 时间混用
last_time = "2026-01-24 10:00:00"  # 这是哪个时区？
item_time = "2026-01-24T10:00:00Z"  # UTC 时间

# ✅ 解决：统一转成 UTC
from datetime import timezone

def normalize_time(time_str):
    """统一转成 UTC 时间"""
    dt = datetime.fromisoformat(time_str.replace('Z', '+00:00'))
    return dt.astimezone(timezone.utc).strftime('%Y-%m-%d %H:%M:%S')

坑2：回补窗口过小导致漏数据

python 复制代码

# ❌ 问题：源站延迟发布，回补1小时不够
lookback = 1小时
# 但源站可能2小时后才发布昨天的新闻

# ✅ 解决：根据源站特性调整
if source_name == 'slow_site':
    lookback = 6  # 这个站点延迟大，回补6小时
else:
    lookback = 2  # 正常站点2小时

坑3：状态更新时机错误

python 复制代码

# ❌ 错误：采到一半就更新状态
for item in items:
    process_item(item)
    state_manager.update_state(...)  # 如果后面失败，状态已经变了！

# ✅ 正确：全部成功后才更新
items = fetch_data()
for item in items:
    process_item(item)

# 全部成功后才更新
state_manager.update_state(...)

坑4：并发采集导致状态混乱

python 复制代码

# ❌ 问题：多个进程同时跑，状态文件被覆盖
进程A: 采集 1-100 → 更新 last_id=100
进程B: 采集 1-100 → 更新 last_id=100 (覆盖了A的状态)
结果：101-200 的数据漏了

# ✅ 解决：加文件锁
import fcntl

with open('state.json', 'r+') as f:
    fcntl.flock(f.fileno(), fcntl.LOCK_EX)  # 独占锁
    state = json.load(f)
    # ... 更新 state
    f.seek(0)
    json.dump(state, f)
    fcntl.flock(f.fileno(), fcntl.LOCK_UN)  # 解锁

监控与告警 📊

增量异常检测

python 复制代码

def check_incremental_health(state_manager):
    """检查增量采集健康度"""
    warnings = []
    
    for spider_name, state in state_manager.get_all_states().items():
        last_run = state['last_run']
        
        if not last_run:
            continue
        
        # 1. 检查是否太久没跑
        last_dt = datetime.fromisoformat(last_run)
        hours_since = (datetime.now() - last_dt).total_seconds() / 3600
        
        if hours_since > 24:
            warnings.append(f"⚠️  {spider_name}: {hours_since:.1f}小时未运行")
        
        # 2. 检查最新时间是否异常
        last_time = state['last_time']
        if last_time:
            last_data_dt = datetime.fromisoformat(last_time)
            data_hours_ago = (datetime.now() - last_data_dt).total_seconds() / 3600
            
            if data_hours_ago > 48:
                warnings.append(f"⚠️  {spider_name}: 最新数据是{data_hours_ago:.1f}小时前")
    
    return warnings

# 定时任务中调用
warnings = check_incremental_health(state_manager)
if warnings:
    send_alert_email("增量采集告警", "\n".join(warnings))

验收标准 ✅

完成本节后，你的增量采集系统应该：

基于时间的增量
- ✅ 首次全量采集
- ✅ 后续只采新增（时间过滤）
- ✅ 回补窗口防漏数据
- ✅ 状态正确更新
基于 ID 的增量
- ✅ 从上次最大 ID 继续
- ✅ 分批处理大量新增
- ✅ ID 严格递增
状态管理
- ✅ 多爬虫独立状态
- ✅ 持久化到文件
- ✅ 失败时不更新状态
完整性
- ✅ 反复跑不会漏数据
- ✅ 统计报告准确
- ✅ 可重置状态重新全量
集成能力
- ✅ 与数据库入库集成
- ✅ 支持多数据源
- ✅ 异常检测与告警

下期预告 🔮

今天我们实现了增量采集，让爬虫只采新增数据。但还有个问题没解决：

采集过程中断了怎么办？

比如：

采到第 500 条时网络断了
程序崩溃了
手动停止了

如何保证下次启动能从断点继续，而不是从头来或者漏掉一部分？

下一节：断点续爬 - 任务状态表 + 失败队列 + 重启恢复

让你的爬虫真正做到"永不丢失，随时恢复"！💾

完整代码总结 📦

复制代码

incremental_spider/
├── state_manager.py         # 状态管理器 (120行)
├── time_based_spider.py     # 时间增量爬虫 (180行)
├── id_based_spider.py       # ID增量爬虫 (150行)
├── run_incremental.py       # 运行入口 (40行)
├── news_incremental_system.py # 完整集成示例 (220行)
└── incremental_state.json   # 状态文件（自动生成）

总计: ~710行 纯Python代码
依赖: requests, beautifulsoup4

运行方式：

bash 复制代码

# 第一次运行（全量）
python run_incremental.py
# 选择 1 或 2

# 第二次运行（增量）
python run_incremental.py
# 只采集新增部分！

# 查看状态
cat incremental_state.json

# 重置状态（重新全量）
python -c "
from state_manager import IncrementalStateManager
mgr = IncrementalStateManager()
mgr.reset_state('news_spider')
"

总结 📝

这节课我们实现了生产级的增量采集系统，核心收获：

✅ **两时间戳（新闻）vs ID（论坛），按场景选择

✅ 回补窗口机制 - 防止延迟发布导致的漏数据

✅ 状态持久化 - 每次记录进度，下次接着来

✅ 边界处理 - 时区统一、大于等于判断、批量处理

✅ 完整集成 - 与数据库、质量检测无缝配合

最重要的思想：只采新的，不做重复劳动

增量采集不仅能提升效率，还能

降低服务器压力（礼貌爬虫）
节省带宽和存储
让定时任务更轻量（每小时跑一次也不怕）

记住：好的爬虫，一定是增量的。全量采集只在首次需要，后续都应该增量更新。

代码已测试可运行，强烈建议跑一遍体验增量的魅力！下期见~

🌟 文末

好啦～以上就是本期《Python爬虫实战》的全部内容啦！如果你在实践过程中遇到任何疑问，欢迎在评论区留言交流，我看到都会尽量回复～咱们下期见！

小伙伴们在批阅的过程中，如果觉得文章不错，欢迎点赞、收藏、关注哦～
三连就是对我写作道路上最好的鼓励与支持！ ❤️🔥

📌 专栏持续更新中｜建议收藏 + 订阅

专栏 👉 《Python爬虫实战》，我会按照"入门 → 进阶 → 工程化 → 项目落地"的路线持续更新，争取让每一篇都做到：

✅ 讲得清楚（原理）｜✅ 跑得起来（代码）｜✅ 用得上（场景）｜✅ 扛得住（工程化）

📣 想系统提升的小伙伴：强烈建议先订阅专栏，再按目录顺序学习，效率会高很多～

✅ 互动征集

想让我把【某站点/某反爬/某验证码/某分布式方案】写成专栏实战？

评论区留言告诉我你的需求，我会优先安排更新 ✅

⭐️ 若喜欢我，就请关注我叭～（更新不迷路）

⭐️ 若对你有用，就请点赞支持一下叭～（给我一点点动力）

⭐️ 若有疑问，就请评论留言告诉我叭～（我会补坑 & 更新迭代）

免责声明：本文仅用于学习与技术研究，请在合法合规、遵守站点规则与 Robots 协议的前提下使用相关技术。严禁将技术用于任何非法用途或侵害他人权益的行为。